Distilling Reinforcement Learning
What is Reinforcement Learning
Consider an agent situated in an unfamiliar setting, capable of garnering rewards through engagment with its surroundings. The agent’s objective is to perform actions that optimize the accumulation of rewards. Practical examples of this concept incldue a gaming bot striving for top scors or a robot executing physical tasks with tanglible objects, although the possibilities extend beyond these instances.
The primary objective of Reinforcement Learning (RL) is to develop an effective strategy for an agent through iterative trials and straightforward feedback. By emplying the optimial strategy, the agent can dynamically adpat to the environment, thereby maximizing future rewards.
Let’s delve into some fundamental concepts in RL.
An agent operates within an environment, and the way the environment, and the way the environment reacts to specific actions is dictacted by a model, which may be known or unknown to us. The agent can occupy on of many state ($s \in S$) within the environment and choose to take one of several actions ($a \in A$) to transition between states. The transition probabilities between state (P) determine the agents’ destination state. Upon taking an action, the environment offers a reward ($r \in R$) as feedback.
The model outlines the reward function and transition probabilities. Depending on whether we know the model’s working or not, there are two distnct situations.
Known Model: Execute model-based RL with perfect inffomration for planning. When the environment is fully understood, the optimal solution can be found using Dynamic Programming (DP). This is reminiscent of solving problems such as the “longest increasing subsequence” or “traveling salesman problem.”
Unknown Model: Perform model-free RL or attempt to explicitly learn the model as part of the algorithm, given that information is incomplete. The majority of the subsequent content addresses scenarios where the model is unknown.
The agent’s policy, $\pi(s)$, offeres guidance on the optimal action to take in a specific state to maximize total rewards. Each state is linked to a value function, $V(s)$, which forecasts the anticipated amount of future rewards obtainable in that state by following the corresponding policy. In essence, the value funcution measures a state’s quality. Both policy and value functions are the targets of reinforcement learning.
The interaction between the agent and the environment involves a series of actions and observed reward over time, $t = 1,2, …, T$. Throughout this process, the agent accumulates knowledge about the environment, learns the optimial policy, and decides on the next action to take to efficiently learn the best policy. Let’s denote the state, action, and reward at time step $t$ as $S_t, A_t$ and $R_t$ respectively. Therefore, the interaction sequence is fully represented by a single episode (also referred to as “trial” or “trajectory”) that concludes at the terminal state $S_T$:
$$
S_1, A_1,R_1,S_2,A_2,R_2,…,S_T
$$
Model
The model serves as a representation of the environment, enabling us to understand or deduce how the environment interacts with and offers feedback to the agent. The model consists of two primary components: the transition probability function $P$ and the reward function $R$.
Consider a situation where we are in state $s$ and decide to take action $a$, leading to the next state s' and receiving reward r. This single transtion step is denoted by the tuple $(s, a, s’, r)$.
The transition function $P$ documents the likelihood of transitioning from state s to s' upon taking action a and acquiring reward $r$. $\mathbb{P}$ symbolizes “probability” in this context.
$$
P(s’,r|s,a) = \mathbb{P}[S_{t+1}=s’, R_{t+1}=r|S_t=s, A_t=a]
$$
Thus, the state=transition function can be defined as a function of $P(s’, r|s,a):$
$$
P_{ss’}^a =P(s’|s,a)=\mathbb{P}[S_{t+1}=s’,|S_t=s, A_t=a]=\sum_{r \in R}P(s’,r|s,a)
$$
The reward function R is obtained by the next reward triggered by an action:
$$
R(s,a) = \mathbb{E}[R_{t+1}|S_t=s, A_t=a]=\sum_{r \in R}r\sum_{s’\in S}P(s’,r|s,a)
$$
Model-Based vs Model-free:
Model-based methods rely on constructing and utilizing a model of the environment to plan and make decions, while model-free methods learn the optimal policy directly from the agent’s experiences without building an explicit model. Each approach has its own advantages and disadvantages, depending on the problem and environment at hand
- Model-based: It relies on building an explicit model of the environment, which includes transition probabilities (how the environment changes with actions) and reward functions. It uses the model to plan and make decisions, siulating potential future scenarios to find the optimal policy. It can be more sample-efficient, as it leverages the environment’s model to learn faster and require fewer interactions with the environment. However, building an accurate model can be challenging, particularly for complex environments with large state and action spaces. Example: Dynamic Programming techniques like Value Iteration and Policy Iteration
- Model-free: It does not rely on explicit model of the environment; instead, it learns the optimal policy directly from the agent’s experiences (state transitions and rewards). It learns through trial-and-error, using techniques like Temporal Difference learning or Monte Carlo methods. This method can be more straightforward to implement, as they do not require an explicit model of the environment, making them suitable for complex and high-dimensional environments. However, they might require more interations with the environment to learn the optimal policy, making them less sample-efficient. Example: Q-Learning, SARSA, and Actor-Critic algorithms.
Policy
A policy, denoted as the agent’s behavior function $\pi$, provides guidance on which action to take in a given state $s$. It represents a mapping from state $s$ to action $a$ and can be either deterministic or stochastic:
- Deterministic: $\pi(s)=a$
- Stochastic: $\pi(a|s)= \mathbb{P}_{\pi}[A=a|S=s]$
On-policy vs Off-policy:
On-policy methods learn directly from the agent’s interactions with the environment while adhearing to the current policy. In contrast, off-policy methods learn from experiences generated by a different policy, allowing for greater flexibility and more efficient learning from a diverse set of experiences.
- On-policy: On-policy methods learn the value function and policy by using the same policy for both exploration and exploitation. Thy follow the current policy while making decisions and simultaneously update the policy based on the experiences gathered. These methods usually strike a balance between exploration and exploitation during learning.
- Off-policy: It separate the policy used for learning (target policy) from the policy used for exploration (behavior policy). The behavior policy is responsible for generating experiences, while the target policy is updated based on the experiences collected by the behavior policy. This separation allows off-policy methods to learn from a wider range of experiences, including historical data or experiences from other agents.
Value Function
The value function represents the expected cumulative reward that an agent can obtain, starting from a specific state and following a given policy. The value function helps the agent to estimate the long-term value of each state, considering future rewards it may receive by taking actions according to the policy. It is a key concept in RL, as it helps guid the agent’s decision-making process to maximize the total reward over time.  Let’s compute the return $G_t$ starting from time $t$:
$$
G_t = R_{t+1} + \gamma R_{t+2} + … \sum_{k=0}^{\infty}\gamma R_{t+k+1}
$$
The discount factor $\gamma$ which ranges from 0 to 1, discounts future rewars for several reasons:
- Future reward may be more uncertain
- Immediate rewards are oftern more appealing than delayed gratification, as humans tend to prefer to enjoying themselves now rather than waiting for years
- Discounting offeres mathematical simplicity, as it eliminates the need to account for infinite future steps when calculating returns
- It helps mitigate concerns about infinite loops in the state transition graph, ensuring the agent converges on an optimal policy
State-value function ($V(s)$): This function estimates the expected cumulative reward when starting from state $s$ and following a particular policy. It considers the value of the current state without taking into account the immediate action the agent will take.
$$
V_{\pi}(s) = \mathbb E_{\pi}[G_t|S_t=s]
$$
Action-value function($Q(s,a)$): This function estimates the expectued cumulative reward when starting from state $s$, taking action $a$, and then following a particular policy. It considers the value of the current state-action pair, accouting for both the state and the action the agent will take.
$$
Q_{\pi}(s,a) = \mathbb E_{\pi}[G_t|S_t=s, A_t=a]
$$
Furthermore, when following the target policy $\pi$ we can leverage the probability destribution over possible actions and the Q-values to derive the state-value:
$$
V_{\pi}(s)=\sum_{a \in A}Q_{\pi}(s,a) \pi (a|s)
$$
The action advantage function, also known as the “A-value,” represents the difference between the action-value (Q-value) and state-value (V-value). The function quantifies the relative benefit of taking a specific acion in a given state, compared to the average value of that state following the target policy. By calculating the A-value, we can identify which actions are advantageous or disadvantageous in a particular state, guiding the agent’s decision-making process for improved performance.
$$
A_{\pi}(s,a) = Q_{\pi}(s,a)- V_{\pi}(s)
$$
Markov Decision Processes
In more formal terms, the majority of reinforcement learning problems can be formulated as Markov Decision Processes (MDPs). All states in an MDP exhibit the “Markov” property, which means that the future depends solely on the current state, rather than the entire history:
$$
\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, …, S_t]
$$
In other words, given the present state, the future and the past are conditionally independent. This is because the current state contains all the necessary information to determine future outcomes. A Markove decision process can be defined as $\mathcal{M} =<\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R}, \gamma>$, where:
- $\mathcal{S}$: a set of states;
- $\mathcal{A}$: a set of actions;
- $\mathcal{P}$: transition probability function;
- $\mathcal{R}$: reward function;
- $\gamma$: the discount factor is used to account for future rewards. In an unfamilar environment, we lack complete information about the transition probabilities ($\mathcal{P}$) and reward function ($\mathcal{R}$).
Bellman Equations
Bellman equations are essential in RL because they provide a systematic way to decompose and solve complex decision-making problems. They underpin many widely-used RL algorithms and help acheive optimality and convergence to effective policies - more reasons:
Optimal Substructure: Bellman equations exploit the fact that optimal solutions can be built from optimal-sub solutions. This property allows RL algorithms to find the best actions to take in any given state by decomposing complex problems into simpler sub-problems.
Dynamic Programming: Bellman equations form the foundation of dynamic programming techinques such as Value Iteration and Policy Iteration. These technique are used to compute optimal policies by iteratively updating state values or action values until convergence.
Temporal Difference Learning: Bellman equations are a key component of temporal difference learning algorithm like Q-learning and State-Action-Reward-State-Action (SARSA). These algorithms use the difference between successive state values to update the value estimates, which allows them to learn online and incrementally from experience.
Convergence to Optimal Policies: The recursive nature of the Bellman equations ensures that algorithms based on them can find optimal policies, provided certain conditions are met (like sufficient exploration). This guarantee of optimality is an attractive feature for solving complex decision-making problems.
Markov Decision Processes: Bellman equations enable the modeling of RL problems as Markov Decision Processes (MDPs), where the optimal decision-making process depends only on the current state and not the history. This simplificiation allows for efficient computation and generalization across a wide range or problems.
The value function can be broken down into the immediate reward combined with the discounted future values.
$$
\begin{flalign}
v(s) & = \mathbb{E}[G_t|S_t=s] \
          & = \mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+…|S_t=s] \
         & =\mathbb{E}[R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+…)|S_t=s] \
         & =\mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s] \
         & =\mathbb{E}[R_{t+1}+\gamma V(S_{t+1})|S_t=s] \
\end{flalign}
$$
$$
\begin{flalign}
Q(s,a) & =\mathbb{E}[R_{t+1}+\gamma V(S_{t+1})|S_t=s, A_t=a] \
& = \mathbb{E}[R_{t+1}+\gamma \mathbb{E}{a \backsim \pi}Q(S{t+1}, a) | S_t=s, A_t = a]
\end{flalign}
$$
The above recursive update process can be further separated into equations based on both state-value and action-value functions. As we progress through future action steps, we alternately extend $V$ and $Q$ by adhearing to the policy  $\pi$
Meta Reinforcement Learning
A robust meta-learning model should effectively generalize to novel tasks or environments, even those it hasn’t encountered during training. Minimal adaptation should occur during testing with limited exposure. Ideally, the model should self-adjust its internal hidden states to learn without requiring explicit fine-tuning (no gradient backpropagation on trainable variables).
References
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
https://towardsdatascience.com/policy-gradients-in-reinforcement-learning-explained-ecec7df94245
https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b