Algorithms
Taxonomy of Reinforcement Learning Algorithms (2018)
Image Credit: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#a-taxonomy-of-rl-algorithms
Advantage Actor-Critic (A2C)
A2C, or Advantage Actor Critic, is a synchronous version of the A3C policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors.
Asynchronous Advantage Actor-Critic (A3C)
A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy and an estimate of the value function . It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function.
Categorical DQN (C51)
C51 introduces a distributional perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. Empirically, C51 demonstrates impressive performance in ALE.
Deep Deterministic Policy Gradient (DDPG):
- https://arxiv.org/abs/1509.02971v6
- https://paperswithcode.com/method/ddpg
- https://docs.cleanrl.dev/rl-algorithms/ddpg/
DDPG, or Deep Deterministic Policy Gradient, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from DQNs.
DeepCubeAI
DeepCubeAI is an algorithm that learns a discrete world model and employs Deep Reinforcement Learning methods to learn a heuristic function that generalizes over start and goal states. We then integrate the learned model and the learned heuristic function with heuristic search, such as Q* search, to solve sequential decision making problems.
Double DQN
A Double Deep Q-Network, or Double DQN utilises Double Q-learning to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value.
Deep Q-Network (DQN)
- https://arxiv.org/abs/1312.5602v1
- https://paperswithcode.com/method/dqn
- https://docs.cleanrl.dev/rl-algorithms/dqn/
A DQN, or Deep Q-Network, approximates a state-value function in a Q-Learning framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output.
Phasic Policy Gradient (PPG)
PPG is a DRL algorithm that separates policy and value function training by introducing an auxiliary phase. The training proceeds by running PPO during the policy phase, saving all the experience in a replay buffer. Then the replay buffer is used to train the value function. This makes the algorithm considerably slower than PPO, but improves sample efficiency on Procgen benchmark.
Proximal Policy Optimization (PPO)
- https://arxiv.org/abs/1707.06347v2
- https://paperswithcode.com/method/ppo
- https://docs.cleanrl.dev/rl-algorithms/ppo/
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.
Tranformer-XL (PPO-TrXL)
Real-world tasks may expose imperfect information (e.g. partial observability). Such tasks require an agent to leverage memory capabilities. One way to do this is to use recurrent neural networks (e.g. LSTM) as seen in ppo_atari_lstm.py. Here, Transformer-XL is used as episodic memory in Proximal Policy Optimization (PPO).
Parallel Q Network (PQN)
PQN is a parallelized version of the Deep Q-learning algorithm. It is designed to be more efficient than DQN by using multiple agents to interact with the environment in parallel. PQN can be thought of as DQN (1) without replay buffer and target networks, and (2) with layer normalizations and parallel environments.
QDagger
QDagger is an extension of the DQN algorithm that uses previously computed results, like teacher policy and teacher replay buffer, to help train student policy. This method eliminates the need for learning from scratch, improving sample efficiency and reducing computational effort in training new policy.
Quantile Regression DQN (QR-DQN)
Quantile Regression DQN (QR-DQN) builds on Deep Q-Network (DQN) and make use of quantile regression to explicitly model the distribution over returns, instead of predicting the mean return (DQN).
Random Network Distillation (RND)
RND is an exploration bonus for RL methods that's easy to implement and enables significant progress in some hard exploration Atari games such as Montezuma's Revenge. We use Proximal Policy Gradient as our RL method.
Robust Policy Optimization (RPO)
RPO leverages a method of perturbing the distribution representing actions. The goal is to encourage high-entropy actions and provide a better representation of the action space. The method consists of a simple modification on top of the objective of the PPO algorithm. In the RPO algorithm, the mean of the action distribution is perturbed using a random number drawn from a Uniform distribution.
Switchable Atrous Convolution (SAC)
- https://arxiv.org/abs/2006.02334v2
- https://paperswithcode.com/method/sac
- https://docs.cleanrl.dev/rl-algorithms/sac/
Switchable Atrous Convolution (SAC) softly switches the convolutional computation between different atrous rates and gathers the results using switch functions. The switch functions are spatially dependent, i.e., each location of the feature map might have different switches to control the outputs of SAC.
Twin Delayed Deep Deterministic (TD3)
- https://arxiv.org/abs/1802.09477v3
- https://paperswithcode.com/method/td3
- https://docs.cleanrl.dev/rl-algorithms/td3/
TD3 builds on the DDPG algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises clipped double Q-learning, delayed update of target and policy networks, and target policy smoothing.
Trust Region Policy Optimization (TRPO):
Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.