jaxplore • Glossary

Definitions of key terms, notations, and concepts used in reinforcement learning.

Action Space

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#action-spaces

The set of all valid actions in a given environment.

discrete action spaces: only a finite number of moves are available to the agent
continuous action spaces: actions are real-valued vectors

Advantage Function

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after.

$$ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s). $$

Markov Decision Processes

An MDP is a 5-tuple, $\langle S, A, R, P, \rho_0 \rangle$, where

$S$ is the set of all valid states,
$A$ is the set of all valid actions,
$R : S \times A \times S \to \mathbb{R}$ is the reward function, with $r_t = R(s_t, a_t, s_{t+1})$,
$P : S \times A \to \mathcal{P}(S)$ is the transition probability function, with $P(s’|s,a)$ being the probability of transitioning into state $s’$ if you start in state $s$ and take action $a$,
and $\rho_0$ is the starting state distribution.

The name Markov Decision Process refers to the fact that the system obeys the Markov property: transitions only depend on the most recent state and action, and no prior history.

Model of the Environment

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#model-free-vs-model-based-rl

A model of the environment is a function which predicts state transitions and rewards. One of the most important branching points in an RL algorithm is the question of whether the agent has access to (or learns) a model of the environment. Algorithms which use a model are called model-based methods, and those that don’t are called model-free.

Pros of Model-Based Methods:
- Enables agents to simulate potential outcomes before acting
- Allows strategic thinking and systematic option evaluation
- Improves decision-making efficiency
- Can substantially enhance sample efficiency
Cons of Model-Based Methods:
- Ground-truth environmental models are rarely available
- Requires learning models purely from experience
- High risk of model bias leading to suboptimal real-world performance
- Model-learning is inherently challenging
- Significant time and computational investment may not yield results

Observation

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#states-and-observations

Partial description of a state, which may omit information.

fully observed: the agent is able to observe the complete state of the environment
partially observed: the agent can only see a partial observation

Policy

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#policies

A rule used by an agent to decide what actions to take.

deterministic policy: a policy that always chooses the same action for a given state, with no randomness:

$$ a_t = \mu(s_t) $$

stochastic policy: a policy that chooses actions probabilistically, introducing randomness: (e.g. categorical policies for discrete action spaces and diagonal Gaussian policies for continuous action spaces)

$$ a_t \sim \pi( \cdot \mid s_t) $$

parameterized policy: policies whose outputs are computable functions that depend on a set of parameters (often denoted by $\theta$ or $\phi$)

$$ a_t = \mu_\theta (s_t) $$ $$ a_t \sim \pi_\theta ( \cdot \mid s_t) $$

Reward and Return

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return

The reward function $R$ depends on the current state of the world, the action just taken, and the next state of the world:

$$ r_t = R(s_t, a_t, s_{t+1}) $$

The return is the cumulative reward over a trajectory, $R(\tau)$.

finite-horizon undiscounted return: the sum of rewards obtained in a fixed window of steps:

$$ R(\tau) = \sum_{t=0}^T r_t $$

infinite-horizon discounted return: the sum of all rewards ever obtained by the agent, discounted by how far off in the future they’re obtained. For $\gamma \in (0,1)$:

$$ R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t $$

State

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#states-and-observations

A complete description of the state of the world.

Trajectory

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#trajectories

A trajectory $\tau$ is a sequence of states and actions in the world.

$$ \tau = (s_0,a_0,s_1,a_1,…) $$

Value Function

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#trajectories

A value function predicts the expected return from a state when following a specific policy.

On-Policy Value Function $V^{\pi}(s)$: Expected return when starting in state $s$ and following policy $\pi$:

$$ V^{\pi}(s) = \underset{\tau \sim \pi}{\Epsilon}\left[R(\tau) \mid s_0 = s\right] $$

On-Policy Action-Value Function: Expected return when starting in state $s$, taking action $a$, then following policy $\pi$:

$$ Q^{\pi}(s,a) = \underset{\tau \sim \pi}{\Epsilon}\left[R(\tau) \mid s_0 = s, a_0 = a\right] $$

Optimal Value Function: Expected return when starting in state s and following the optimal policy

$$ V^* (s) = \max_{\pi}\underset{\tau \sim \pi}{\Epsilon}\left[R(\tau) \mid s_0 = s\right] $$

Optimal Action-Value Function: Expected return when starting in state s, taking action a, then following the optimal policy

$$ Q^*(s,a) = \max_{\pi}\underset{\tau \sim \pi}{\Epsilon}\left[R(\tau) \mid s_0 = s, a_0 = a\right] $$