2023-03-19

3. Blog Post

FeUdal Networks for Hierarchical Reinforcement Learning

The paper introduces a hierarchical reinforcement learning framework called Feudal Networks (FuN) that enables agents to learn a hierarchy of temporal abstractions. The proposed model consists of a Manager and a Worker, with the Manager learning high-level policies and the Worker learning low-level policies. The approach is demonstrated to achieve state-of-the-art performance on a range of challenging tasks in the Atari domain.

Overview

FeUda Network (FuN) is a hierarchincal reinforcement learning method inspired by the feudal reinforcement learning proposal of Dayan and Hinton. The need for hierarchical reinforcement learning arises because most of the State-Of-The-Art (SOTA) reinforcement learning algorithms are focused on task specifications and are lmited to low-level executions, which receive states and execute actions. They do not perform any high-level planning or goal selection based on state information. Hierarchical reinforcement learning overcomes this limitation by introducing higher-level policies or networks that plan and selects goals while lower-level policies execute actions accordingly.

FuN employs two modules: the Manager and the Worker. The Manager operates at a lower temporal resolution, receiving states and setting abstract goals that are conveyed to and enacted by the Worker [1]. In contrast, the Worker generates primitive actions by observing the environment at every tick. By decoupling the Manager and the Worker, FuN facilitates long timescale credit assignment and the emergence of sub-policies associated with different goals set by the Manager. This enables FuN to outperform strong baseline agents on tasks that require long-term credit assignment or memorization.

The model

FuN is a modular neural network consisting of two modules - the Worker and the Manager [1]. Let’s go little bit in depth on the role of two modules.
figure_1: The schematic illustration of FuN

Figure 1 illustrates the overal design and the following equation explaines the forward dynmaics of FuN:
figure 2: Equation of forward dynamics of FuN

The Manager and the Worker modules of FuN share a perceptual module that computes a shared representation $z_t$ based on environment observation $x_t$. The Manager’s main objective is to generate a goal given states, while the Worker produces primitive actions conditioned on external observation, internal state, and the goal from the Manager. The goals $g_t$ are trained using an approximate transition policy gradient, which exploits the knowledge that the Worker’s behavior will align with the goal defined by the Manager. The goals are then converted to embedded vectors $w_t$ via a linear transform $φ$, which are used to train the Worker and eventually produce the policy – a vector of probabilities over primitive actions.

The Worker is trained through intrinsic reward to generate actions that move towards the goal directions to be achieved. The Worker’s RNN produces $U_t$ for the policy, which is computed from a product between $U_t$ and $w_t$. This training approach enables FuN to learn a hierarchy of sub-policies associated with different goals, and facilitates long-term credit assignment, allowing the agent to outperform a strong baseline agent on tasks involving long-term credit assignment or memorization.

Goal Embedding

Computing the goal embedding is one of novelties of this paper. Essentially, it modulates the policy via “a multiplicative interaction in a low dimensional goal-embedding space.” As a reminder, the embedded goal $w_t$ is obtained by a linear transformation from $R^d$ to $R^k$, where the last $c$ goals are first pooled by summation and then embedded into a vector $w_t$ to ensure that the embedded vector can incorporate multiple goals. By having a multiplicative interaction between $U_t$ and $w_t$ the policy can be modulated to incorporate different goal directions from the Manager.

Firstly, we have to pool the last $c$ goals and do summation. Due to pooling the goals over several time-steps, the conditioning from the Manager varies smoothly. The reason for pooling the last $c$ goals is to create a summary of the recent sub-goals that the Manager has selected. This summary allows the Worker to have a sense of the overall direction the Manager wants to pursue.

And then we need to embed the pooled goals into vector $w$ using linear projection $φ$. The projection φ is linear, with no biases, and is learnt with gradients coming from the Worker’s actions. “Since $\phi$ has no biases it can never produce a constant non-zero vector – which is the only way the setup could ignore the Manager’s input. It makes sure that the goal output by the Manager always influences the final policy.” [1]. By not including any bias terms, the linear projection function $\phi$ ensures that the output vector is never shifted away from zero. This is important because if $\phi$ did include bias terms, it could potentially produce a non-zero vector even if the input vector from the Manager was zero. In this case, the Worker’s policy would not be influenced by the Manager’s input

Lastly, the Worker’s embedding matrix U is then combined with the goal embedding w via a matrix-vector product

Learning

The whole learning process is following a standard reinforcement learning set up. The agent selects the action given observations from the environment and receives reward for the selected actions. The goal of the agent is to maximize the discounted return. The agent’s behaviour is defined by its action-selection policy $\pi$.

However, FuN propose “instead to independently train the Manager to predict advantageous directions (transitions) in state space and to intrinsically reward the Worker to follow these directions.” If the Worker can fulfil the goal of moving these directions (as it is rewarded for doing), then we should end up taking advantageous trajectories through state space.

The Worker network receives two inputs - the current stat of the environment and a goal vector selected by the Manager. Based on these inputs, The Worker network provides an output action to be executed by the agent in the environment. The primary objective of the Worker network is to acquire a policy that maximizes the intrinsic reward, which is determined based on the progress made towards the present goal vector.

The FuN architecture uses policy gradient training to learn the policy, which involves computing gradients of the expected reward with respect to the network parameters of the embedded goal from the Manager and using them to update the network parameters

The issue with computing these gradients in the FuN architecture is that the state representation used by the Worker network depends on the network parameters as well as world observations from the environment. This presents a challenge, as unlike the conventional reinforcement learning process that trains end-to-end through gradient descent on either the “Policy” directly or via TD-learning, FuN takes a different approach. Specifically, the goal from the Manager would lose any semantic meaning if it were incorporated as an input to the Worker network, so FuN treats these goals as internal latent variables of the model. The FeUdal Network architecture overcomes the dependence of the state representation on the network parameters when computing the gradient of the cosine similarity between the state representation and the goal vector by assuming that the state representation is fixed and does not change with the network parameters.

By doing this, the FeUdal Network architecture is able to utilize the cosine similarity between the state representation and the goal vector to provide the Worker network with a useful learning signal without encountering any issues with the chain rule of calculus.

The update rule (gradient) of the goal from the Manger is as follow:
$$
\nabla g_t = A_t^M \nabla_\theta d_{cos} (s_{t+c}-s_t,g_t(\theta)),
$$
where $A_t^M = R_t - V_t^M(x_t,\theta)$ is the Manager’s advantage function, coputed using a value function estimate $V_t^M(x_t,\theta)$ from the internal critic.

The definition of intrinsic reward that incentivizes the Wroker to pursue the specified goals is as follows:
$$
r_t^I = 1/c \sum_{i=1}^cd_{cos}(s_t-s_{t-i}, g_{t-i})
$$
The paper suggests that incorporating directions (latent goals) enables the Worker network to make directional shifts in the latent space without assuming arbitrary absolute locations, making it a more practical approach for high-dimensional and continuous latent spaces where exploring all possible locations would be difficult.

In fact, the feudal reinforcement learning formulation by (Dayan and Hinton in 1993) proposed completely hiding the reward signal from lower levels of hierarchy in the learning process. However, this paper, a softer approach of adding an instrinsic reward for following the goals to the environment reward was taken. The Worker, therfore, is trained to maximise a weighted sum $R_t+\alpha R_t^I$ , where $\alpha$ is a hyperparameter that regulates the influence of the intrinsic reward. A large $\alpha$ value would give more influence to the Worker than the task-specific policies.

At the end, the task specific policy $\pi$ can be trained to maximize intrinsic reward from a conventional deep reinforcement learning such as actor critic (Mnih et al., 2016):
$$
\nabla\pi_t=A_t^D \nabla_\theta \log \pi(a_t|x_t;\theta),
$$
where $A_t^D = (R_t + \alpha R_t^I - V_t^D(x_t;\theta))$ a.k.a. the Advantage function is calculated using an internal critic, which estimates the value functions for both rewards. Importantly, the Worker and Manager have different discount factors $\lambda$ for comuting the return since the Worker can be more greedy and focus on immediate rewards while the Manager can put efforts on a long-term planning.

Transition Policy Gradients

The transition policy refer to to the meathod of transitioning from one policy to another policy. A hight-level policy $o_t = \mu(s_t,\theta)$ selects amoung sub-policies (possibly from a continuous set) where it’s only used for fixed behaviours - remember goal is lasting for c steps. Then a follow-up question is how sub-policies are transitioned over to others. The paper said a transition distribution, $p(s_{t_c}|s_t,o_t)$ , the distribution of states that end up at the end of the sub policy, given the start state is used to enact next sub-policy. The transition policy, $\pi^{TP}(s_{t+c}|s_t)=p(s_{t_c}|s_t,\mu(s_t,\theta))$, is composed with the high-level policy with the transition distribution.

The paper says it’s valid to refer to this as a policy instead of a general transition distribution is becase the original MDP is isomorphic to a new MDP with policy $\pi^{TP}$ and transition function $s_{t+c} = \pi^{TP}(s_t)$ Therefore, the policy gradient theorem can be applied to the transition policy $\pi^{TP}$
$$
\nabla_{\theta \pi_t^{TP}} = \mathop{\mathbb{E}}[(R_t-V(s_t))]\nabla_{\theta} \log p(s_{t_c}|s_t,\mu(s_t,\theta)))
$$
The FuN framework offers a departure from the conventional approach of directly optimizing the policy to maximize the cumulative reward. Instead, it models and optimizes the transition policy, which is a distribution over future states given the current state and action. By modeling the transitions, the agent can bypass the complex trajectory of the Worker and directly optimize the predicted transition distribution. This is possible because, with knowledge of where the trajectories are likely to end up, we can follow the policy gradient of the predicted transition, rather than the Worker’s behavior. FuN assumes that the direction in state-space, $s_{t+c}−s_t$, follows a von Mises-Fisher distribution.

The von Mises-Fisher distribution is used in FuN to model the direction in state-space that the agent may transition to when executing a given sub-policy. The Manager outputs the goal vector that encodes the desired direction of transition and sets the mean direction of the distribution. Meanwhile, the concentration parameter determines how likely the agent is to transition in the direction of the goal vector versus in a random direction.

Architecture

The perceptual module $f^{percept}$ is a convolutional network (CNN) followed by a fully connected layer. I think the perceptual module can be modified to differenty type of networks depedning on use cases. Each convolutional and fully-connected layer is followed by a rectifier non-linerarity - this is substantially the same CNN as in, the only difference is that in the pre-processing state all colour channels are retained.

Dilated LSTM

The paper proposed a novel RNN architecture for the Manager, which runs at lower temporal resolution than the world observation from the environment. A dilate LSTM is analogously defined to dilated convolutional networks (Yu & Koltun, 2016).

It basically separte $r$ groups of sub-states or ‘cores’ instead of full state $h_t$. So the hidden state of Dilated LSTM is $h= (\hat{h^i})_{i=1}^r,$ meaning that it’s composed of r separate groups. Threfore, in each timestep $t$, the network equestion is as follow:
$$
\hat{h}_t^{t\zeta r},g_t=LSTM(s_t,\hat{h} _{t-1}^{t \zeta r}; \theta^{LSTM}),
$$
where $\zeta$ denotes the modulo operation and allows us to indicate which group of cores is currently being updated. $\theta^{LSMT}$ explicitly stree that the same set of parameters governs the update for each or the $r$ groups within the dLSTM.

The advantage of this Dilated LSTM is that the $r$ groups of cores in the dLSTM can preserve the memories for long periods while the dLSTM as a whole is still processing and learning from every experience, and it can update the output at every step, which is similar to clockwork RNNs (Kulkarni et al., 2016), however there the top level “ticks” at a fixed, slow pace, whereas the dLSTM observes all the available training data instead. In the experiments we set $r = 10$, and this was also used as the predictions horizon, $c$ [1]

Conclusion

In conclusion, this paper presents an interesting approach to the challenge of task selection for reinforcement learning agents, which is often a difficult and non-deterministic problem. The FeUdal Network architecture allows for end-to-end training of both the Manager and Worker modules, which is a remarkable achievement. While training the Manager module on its own is not a difficult task, the paper’s approach of training the Manager and Worker together in a holistic process is novel and promising. However, the scalability of the Manager’s role remains a potential issue, as an increasing number of goals and state dimensions may not always result in appropriate goals for the Worker. Further research is necessary to determine the effectiveness of this approach in more complex and diverse environments, as well as to investigate potential solutions for the scalability issue. Overall, the FeUdal Network architecture presents an innovative approach to reinforcement learning with potential for further exploration and refinement.

If you want to explore experiment result, please find details from the original paper.

References

[1] Vezhnevets, Alexander Sasha, et al. “Feudal networks for hierarchical reinforcement learning.” International Conference on Machine Learning. PMLR, 2017.

[2] Dayan, Peter and Hinton, Geoffrey E. Feudal reinforcement learning. In NIPS. Morgan Kaufmann Publishers, 1993.

[3] Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.

[4] Kulkarni, Tejas D., Narasimhan, Karthik R., Saeedi, Ardavan, and Tenenbaum, Joshua B. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.