A broad range of maze tasks were used in animal-based neurobehavioral research [63] to study spatial working and reference memory [45, 62], search strategies [10, 47], and spatial pattern learning [11]. Similar maze tasks, conducted in simulation, have been proposed and adopted to study corresponding behavioral properties of RL methods [7, 14, 16]. Grid world environments are two-dimensional discrete versions of such mazes. Among their advantages are lower difficulty starting point and slower scaling. They are also much less demanding to the computational resources and do not require highly developed agent’s perception and motor systems. Nevertheless, grid worlds can provide rich and challenging tasks [15, 51, 60].
In our experiments, we studied the following aspects of the proposed model: spatial–temporal pattern representation learning, discovery and usage of state–action abstractions, intrinsically motivated exploratory strategy and learning in imagination through planning. Despite their simplicity, constructed grid world tasks are able to highlight all aforementioned aspects. For example, each task has different states that are visually similar, thus, in order to succeed, it is necessary to learn a helpful representation of states to distinguish and cluster them. Also, part of the experiments were conducted in a four-room environment, which is divided into several zones interconnected with narrow passages. This makes it hard for the agent to switch zones and can be partly mitigated by the use of state–action abstractions or smart exploratory strategy. Finally, the overall difficulty level of four rooms task was calibrated in a way that there was enough room for improvement to justify dreaming capability to speed up the course of agent’s learning. Given that, we treat our decision to test our model in grid world environments as balanced choice between simplicity and experimental depth.
Consider a grid world environment. Each its state can be defined by an agent’s position; thus, state space S contains all possible agent positions. The environment’s transition function is deterministic. The action space is made up of four actions A that move the agent to each adjacent grid cell: up, down, left, and right. However, when the agent attempts to move into a maze wall, the position of the agent remains unchanged. It is assumed that the maze is surrounded by obstacles, making it impossible for an agent to move outside. At each timestep, an agent receives an observation—a binary image of a small square window encircling it. The image consists of several channels. Each channel is a binary mask representing the object positions of the corresponding type in the observation window. There are several channels for floors of different types, one channel for obstacles, and a channel for the vital resource (Fig. 6). The resource position corresponds to a goal state \(s_\text{g} \in S\). An agent is positively reinforced with the reward \(r = 1\) when it reaches the goal state. On the other hand, at each timestep, an agent receives a small negative reward signal \(r(s, a) {:}{=}-\text{cost}(a), a \in A\), where cost(a) is a real-valued function that represents an energy cost for every action. We divide the interaction between agent and environment into episodes. At the start of the episode, the agent’s position is initialized from the set of initial states \(S_{\text{ini}} \subset S\), and the episode finishes when an agent gets to the goal state or the time limit is reached.
A single test trial lasts for several episodes. As a metric of an agent’s performance during an episode, we use a number of steps required for an agent to reach the goal. An agent is allowed to accumulate experience for the entire duration of a test trial. However, depending on the experimental setup we may also divide a single test trial into a sequence of tasks lasting for several episodes with each task representing an individual set of initial states \(S_{\text{ini}}\) and a goal state \(s_\text{g}\). For every agent and environment setting, we perform several independent trials with different seed values.
In the following subsections, we describe and discuss the experiments intended to investigate the advantages and caveats of different HIMA modules on their own through performance in relatively simple cases and then in multitasking environments of increasing difficulty. The final experiment is carried out with the full-featured HIMA.
Abstract vs. elementary actions
The tests represented in this section were designed to compare the performance of an agent using elementary actions only and an agent also using abstract actions in different environment settings. The agent that forms abstract actions corresponds to the HIMA model, but without the Dreaming and Empowerment blocks. The elementary actions agent is the same model but without the second level of the Hierarchy.
Four corridors experiment
Tests were conducted on a radial arm maze representing four corridors connected at the center (Fig. 8a). Every episode, an agent starts at the far side of a randomly chosen arm. Initially, a resource is positioned at the center of corridor crossing. Then, after 1000 episodes, the resource is moved to the middle of one of the arms chosen randomly and remains here until the end of the trial, for the next 1000 episodes.
As shown in Fig. 7, an agent with abstract actions is much faster to overcome the goal position changing. And as a result, the agent with abstract actions requires fewer steps in total to finish the trial. We also have tried different inverse softmax temperatures \(\beta\) for an agent with elementary actions. As we see from the figure, by increasing the temperature, we can improve the elementary actions agent performance but at the expense of optimality at the first half of the trial. The experiment has shown that the agent with abstract actions can explore an environment more directionally than an agent with elementary actions, as the hierarchical structure of the agent allows it to learn four abstract actions for passing each of the corridors. So, when the position of the goal is changed, HIMA has a good chance to get out of the local maximum learned by the first level of the hierarchy.
Four rooms experiment
The previous experiment was designed to show the type of cases where our current abstract action model is most effective. However, we also wanted to investigate more common cases in that domain and find out the limitations of our method for the abstract actions formation. So, we have tested our agent in a classical four-room maze, which, because of its bottleneck structure, is often used to test abstract actions.
Trials were carried out on a map having the form of four connected rooms with a resource placed in the left doorway. We consider two variations of the test. In the first one, agent each episode starts randomly in one of the cells from the set marked in Fig. 8b. After 2000 episodes, the resource position is moved to a corner chosen randomly in the left-down room. Another test was performed on the same map, but every episode, the agent starts in any unoccupied randomly chosen cell. The goal state is relocated in the same way after 2000 episodes.
In the first variation of the experiment, the agent’s initial positions were chosen so that HIMA can easily form abstract actions: pass through the door down and right. As can be seen from Fig. 9, the agent with abstract actions performs better after the goal position changing than the agent of elementary actions with the same softmax temperature. Although we can adjust softmax temperature to get similar performance during the reward change, it is still worse than a strategy with abstract actions in the long run. However, HIMA learns the suboptimal trajectory to the goal as can be seen from the first half of the learning curve.
In the second experiment, there is much more variation between possible trajectories. For now, our HIMA model is not capable to generalize abstract actions by a goal, but it learns the most repetitive action sequences. As long as an agent can start in any position, it is not possible to distinguish the most repetitive action sequences here. So, in such cases, our method does not guarantee to form useful abstract actions. Therefore, as can be seen from Fig. 10, the problem with suboptimality of the abstract actions becomes more vivid. And as long as an agent starts from different positions, directional exploration, which usually helps to pass through bottleneck states, is not so crucial.
The experiments have shown that HIMA is capable of learning useful abstract actions that improve an agent’s exploration abilities in scenarios with non-stationary goal positions in environments with a low connectivity graph of state transitions. Experiments have also demonstrated that better performance can be reached on tasks where any path to the goal on the transition graph can be decomposed into non-trivial sequences of elementary actions, as for crossed corridors and four rooms with restricted spawn set experiments. Otherwise, there is no guarantee that the strategy with abstract actions will be advantageous even considering the best learning conditions.
Four rooms and empowerment
In this subsection, we evaluate the model of empowerment (Sect. 4.3) on four rooms task. The main goal is to compare empowerment values predicted by our model with the ideal theoretical prediction. In this case the most significant thing is the quality of the transition function, which helps us predict next possible states from the current one. Let us say the ideal empowerment is a value calculated from Eq. 6 having full information about an environment—the final distribution of the reachable states S. This case corresponds to having the perfect transition function. On the other hand, the TM empowerment is a value calculated as was described in Sect. 4.3 with the learned Temporal Memory.
To begin with, we analyze the ideal empowerment regarding its depth: the prediction of how many steps it uses. For the four rooms task, this analysis is shown in Fig. 11. This is a field of values for 1–4 step empowerment. If the depth is small, then almost all states are equivalent. Such signal is not very useful, as it does not highlight any special places that we want to find. With increasing the depth, the situation is changing, and for four-step empowerment, the special places are clearly visible. We call this set of points \(\epsilon\)-ring. Some intuition for the set is that it denotes cells from which the agent can reach the most number of states. If the depth is increased, this set will become clearer, but it is very difficult to make such long predictions (the number of possible path variants increases exponentially). So we opt for the four-step case.
As discussed in Sects. 4.4 and 4.3, the clusters should evaluate empowerment with TM. For the purpose of comparison between the ideal and TM empowerment, we learn TM by a random agent walking 10,000 steps in the environment (after this number of steps, TM does not improve its predictions). During this process, clusters also are created. An example of the learned set of clusters is presented in the left part of Fig. 12. In the similarity matrix (on the right in Fig. 12), we can see that almost all clusters are different, but for some of them, the similarity can near 0.5. The latter is bad for empowerment because similar clusters can interfere, and visit statistics \(\nu\) will be mixed (4.3). To partially solve it, we use median or mode as a statistic function \(\Lambda\). In addition, similar clusters may lead to the false positive TM predictions—TM can start predicting states that actually cannot be the next ones (the so-called phantoms). Generally, this problem can be solved just by increasing the size of an SDR and decreasing its sparsity, but this requires more resources.
The final step of the empowerment analysis is the comparison of the ideal and TM empowerment values. We found that our proposed algorithm for the empowerment estimate cannot handle the case when from a single state different actions lead to itself, which is typical for corner positions. For example, in the top left corner, moving top and moving left both lead to staying in the corner. For this case TM correctly predicts the next state—the corner position itself—but it does not account the number of different transitions \((s, a) \rightarrow s'\), when \(s = s'\). One of the possible ways to solve this is to use additional information about actions for TM predictions (like in the dreaming block), but this is a subject for the future research. So for more accurate consideration, we additionally calculate ideal empowerment with this kind of restriction. We also compute empowerment with TM for mode and median statistics. The results are presented in Fig. 13.
We can see that the ideal variant is the least by values compared with others. The TM mode case is the closest to ideal ones, but it overestimates at the gates. The TM median is more overestimated. In our task, overestimation means that prediction is blurred by intersections between states and phantoms (in this case, statistic \(\nu\) is shared between states). In both TM cases, \(\epsilon\)-ring can be distinguished. The main conclusion is that the TM mode can be used as an ideal empowerment approximator. However, we should consider the problem with corners. Heterogeneity of the estimated empowerment value, in our opinion, is the result of both poor semantics in an observation signal and a very basic visual processing system (our model lacks proper visual cortex model, which is also a subject for the future research).
Dreaming analysis
In this subsection, we discuss experiments with the dreaming. First, we walk through a set of experiments that provided us with the reasoning, which resulted in the final version of the dreaming algorithm described in Sect. 4.5. Finally, we show the effects on the HIMA baseline performance from adding the dreaming block.
During the research and development process of the dreaming algorithm, we were mostly puzzled with two questions. Is the quality of the learned model enough to produce diverse and helpful (correct) planning rollouts? What should the decision-making strategy for starting [or preventing] the dreaming process be?
Above all, we studied pure effects of the dreaming disconnected with HIMA. For that, we took a very basic architecture of an agent instead of HIMA. It had a sequence of SP sub-blocks, which provided a joint state–action encoding. For this encoding, the agent used a classic RL TD-learning method [59] to learn a distributed Q-value function, which in turn induced a softmax policy. For such an agent’s architecture, we implemented the dreaming block the same way it is implemented for HIMA. We tested dreaming in four rooms setting where both the initial agent position and resource position were chosen randomly and stayed fixed for the whole duration of the trial. To exclude easy combinations, the trials were selected such that the agent starting position was not in the same room with the resource.
Our initial version of the dreaming switching strategy was to make the probability proportional to the absolute TD error, because a high TD error indicates states where dreaming can contribute the most to the learning process. However, if it is too high, it may also indicate that this state neighborhood has not been properly explored yet; hence, dreaming should not be started as we cannot rely on the inner model. So, we had to find a balanced TD error range, when the dreaming is allowed be activated. Experiments with such strategy showed its ineffectiveness (see Fig. 14 on the left). It has turned out that we cannot rely on the TD error alone to guarantee the local good quality of the learned model.
To get a clue of a better dreaming switching strategy, we decided to investigate situations when the dreaming makes a positive impact on an agent’s performance. Soon enough, a new problem arose—each dreaming rollout can potentially affect further behavior and performance of an agent, so rollouts must be evaluated independently. On the other hand, most of the time, a single rollout effect is negligible or very stochastic. Moreover, independent rollout evaluation does not add to the understanding of their cumulative effect. All of this makes such analysis highly inaccurate and speculative.
In the corresponding experiment, for each trial, we subsequently and independently compared performance of an agent without dreaming with the same agent that dream only once during the learning. So, for each trial, we independently evaluated the outcome of the dreaming for each trajectory’s position of the non-dreaming agent—it showed us all moments where a single dreaming rollout makes a positive or negative impact. The only conclusion we could reach from this experiment was that dreaming more steadily improves performance when it is activated near the starting point. These locations also share lower than average transition model anomaly values. This led us to the final version with anomaly-based dreaming switching.
Tests for anomaly-based dreaming switching were conducted with the same protocol as for TD error-based switching, but on harder tasks. They showed a significant improvement of an agent’s performance. We compared the baseline agent without dreaming and an agent with the anomaly-based dreaming switching strategy (zero-anomaly probability to switch was \(p_{\max} = 0.12\)). The results are presented in Fig. 14 on the right. Dreaming showed faster convergence to the optimal policy. Based on that, we hypothesized that the effect of dreaming is comparable to the increased learning rate. So, we evaluated the baseline additionally with two different learning rates and included the results in Fig. 14 on the right. The baseline with the 50% increased learning rate (light blue) almost matched the dreaming agent’s performance, while the baseline with the 25% decreased learning rate (blue) was two times slower—it has the number of episodes scaled down two times on the plot for better comparison. Besides the increased speed, we also noted the increased learning stability caused by anomaly-based dreaming.
Exhaustible resource experiment
Here we investigate how our agent behaves in case resources are exhaustible and their extraction complexity increases. One test trial consists of 30 tasks with three levels of difficulty. There are 10 tasks per level. The maze and an agent’s initial state set is the same as for the four rooms experiment (see Fig. 8b). Tasks of different levels differ by relative positions of the agent and the resource. On the first level, the resource is spawned in one of the two hallways in a room of the agent’s spawn (see Fig. 15). For the second level, the set of the initial resource positions is restricted by two adjacent to the agent’s room. On the final level, the resource can be spawned in any position except the room of the agent’s initial position. A task corresponds to one goal and the agent’s initial positions. The task is changed when the agent visits \(s_\text{g}\) more than 100 times, i.e., when the resource is exhausted. The difficulty level of the tasks increases every ten tasks. The trial continues until the agent passes the third level.
In the following subsections, we show the maximum contribution of different features of HIMA to its overall performance on its own. Finally, we carry out the experiment with all the features on and compare our full-featured agent with the baseline. For the baseline, we use the basic version of HIMA with one-level hierarchy and both empowerment and dreaming disabled. The baseline agent uses only one BGT block with one striatum region aggregating the extrinsic reward.
Abstract actions
Here we investigate the effect of enabling the second level of the hierarchy to the baseline HIMA. As can be seen from Fig. 16, the agent with two levels of the hierarchy performs better on average during the tasks. We also have selected a sequence of tasks that consists of conflict situations only and have called it the hard set. In a conflict situation, a strategy learned for a previous task will interfere with the successful accomplishment of the current task. There are eight tasks of the first level and four tasks of the second and the third levels. From Fig. 17, we can see that the agent with abstract actions performs significantly better than the agent with elementary actions. It can also be also noted from Fig. 17a that the difference between the agents arises at tasks of levels two and three, where transitions between the rooms play a crucial role and abstract actions have been learned by the agent already.
There are four examples of abstract actions used during the experiment in Fig. 18, where \(I: S \mapsto [0, 1]\) is a probability to initialize an option in a corresponding state and \(\beta : S \mapsto [0, 1]\) is the terminate probability. A big heat map for every option visualizes the number of times the transition to a state was predicted during the execution of the corresponding option. Two small heat maps correspond to I and \(\beta\) functions.
Empowerment and other signals
Here we investigate the effect of enabling variants of the intrinsic signal to the baseline HIMA model. We compare the following signals: anomaly, empowerment, constant and random. Anomaly is—a TM characteristic—a percent of active SDR cells that were not predicted (\(\text{anomaly}=1-\text{precision}\)) for some state \(s_t\). Constant is some constant value for all states. Random is some value from uniform distribution (from 0 to 1). Constant and random signals are independent of the agent’s state.
HIMA has its built-in intrinsic motivation. It is caused by an optimistic initialization (see Sect. 8.2). The initial value function is zero, but at every step, the agent gets a small negative reward that is some kind of counter (similar to exploration bonuses), so already visited states will be chosen with less probability (as they will have less value). This feature is always working and helps the agent to start with simple exploration.
To understand the influence of only additional striatum pathway (see Sect. 4.2) we use constant intrinsic signal with zero value (zero-const in Fig. 19). Experiments show a significant improvement in the total steps metric for adding intrinsic pathway in the Exhaustible Resource task. We can conclude that pathways weighting is some kind of “shaker” for an agent. When it reaches resources well it does not use intrinsic pathway (see Sect. 8.2). But then the task is changing (the agent performs badly) and the agent needs more steps to reach a resource, extrinsic pathway turns off—its priority becomes near zero. The agent starts to do random actions (for the zero-const signal) controlled by the intrinsic pathway. This “shakes” the agent’s behavior from stagnation.
We try other signals to make this process more intellectual (Fig. 19). Negative-const is a small negative constant (− 0.01). We assume that this signal strengthens exploration because of optimistic initialization in the intrinsic pathway. But this does not happen, and results become worse than with zero-const.
An anomaly signal can be considered a standard prediction error. Normally, its value is between 0 and 1 (this is positive-anomaly), but also we consider negative-anomaly that is shifted by − 1. In Fig. 19, these signals do not improve the zero-const variant but are better than the baseline.
In Sect. 5.2, we figured that the most suitable depth of the prediction for empowerment signal is four. Our goal is to understand how this intrinsic signal can influence the agent’s performance, so we choose the ideal four-step empowerment signal (that uses the environment transition model) to minimize the negative effects of TM-predicted empowerment (see Sect. 5.2). This signal is shifted to be in [0, 1]—positive-empowerment. And for \([-1, 0]\)—negative-empowerment.
We expected that the empowerment signal would help the agent go between the rooms after many failed turns in one room, and this expectation was justified. We found that when the influence of the empowerment signal is big (\(\eta \text{pr}^{\text{int}}>> \text{pr}^{\text{ext}}\) see Eq. 7), the agent begins to walk along the \(\epsilon\)-ring. This can lead to some problems: if the priority of the intrinsic reward is not decreasing, the agent will stay in the vicious circle and not find the resource. Exactly to solve this, we define exponential decay for \(\eta\) (Sect. 4.2).
Variants with empowerment show the best performance among other intrinsic signals with semantics. So we can suppose that empowerment is more suitable for our architecture.
From these experiments, we already have made some conclusions. But it needs to pay attention that for all variants of intrinsic motivation their metric one-sigma confident intervals are intersected (Fig. 19). To check that the signal semantics is matter we evaluate agent with positive-random (uniform from [0, 1]) and negative-random (uniform from \([-1, 0]\)). As can be seen from Fig. 19 these signals are also among other intrinsic motivation variants. The reason for such behavior can be in the fact that for Exhaustible Resource task priority “shaking” is enough and intellectual intrinsic signals are not necessary.
Also, we perform an analysis of the agent’s work process. In Fig. 20, averaged results of several agent runs are shown. We have found for the steps per each task (Fig. 20a), in some cases, the difference between baseline and others is not so big. As can be seen from Fig. 20, intrinsic motivation signals cannot be distinguished by their performance, but all are better than the baseline without the intrinsic modulation. So we can assume that in this task, the priority modulation (“shaking”) is more important than the exact values of the intrinsic reward.
Dreaming
In this subsection, we discuss the effects of enabling the dreaming block to the baseline HIMA. Previously, in Sect. 5.3, we have already shown that dreaming speeds up learning and makes it more stable. Results in the exhaustible resources experimental setup show similar effects caused by dreaming but now applied to the HIMA model (see Fig. 21). In the first-level tasks, dreaming may sometimes decrease performance. However, as the difficulty increases, the positive effects of dreaming grow. Dreaming speeds up convergence during a task. It also accelerates exploration by cutting off less promising pathways.
HIMA
So far, we have been considering each component of our agent architecture separately. In this section, we present the results of the tests for the full-featured HIMA model. Before the final experiment, a grid search procedure was performed for several parameters of the agent model with all components enabled. Parameter fine-tuning was carried out on a simplified version of the test with only two first levels of difficulty and five tasks in each one. Then, the HIMA agent with the best parameters has been tested on the full version of the test.
First, we have compared the full-featured HIMA against the baseline HIMA. Figure 22 shows that the full-featured HIMA model performs significantly better than the baseline. They perform on par in the first-level tasks, which do not require transitions between the rooms, and simple softmax-based exploration is enough. The most conspicuous difference between the baseline and HIMA is on the second and third levels, where the abstract actions and the intrinsic reward facilitate more efficient exploration, while dreaming speeds up the whole learning process. The dreaming helps to stabilize the strategy by improving the value function estimate in the striatum.
Second, we have compared the full-featured HIMA against DeepRL baselines: DQN [44] and Option-Critic [5]. The networks for both methods were constructed on top of two fully connected ANN layers. Actor and critic parts of Option-Critic shared network weights and only had separate corresponding network heads. DQN and critic part of the Option-Critic architecture were trained offline, using regular uniformly distributed experience replay. We fine-tuned baselines hyperparameters via grid search on a separate set of seeds within the same testing protocol. Figure 23 shows that both DeepRL methods were unable to adapt to the repeatedly changing tasks and have extremely low performance compared to HIMA.