[ad_1]

The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: *a policy is an agent’s strategy*.

For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:

Obviously, some policies are better than others, and there are multiple ways to assess them, namely *state-value function* and *action-value function*. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):

*A policy defines the learning agent’s way of behaving at a given time.*

### Formally

More formally, we should first define *Markov Decision Process* (MDP) as a tuple (`S`

, `A`

, `P`

, `R`

, `y`

), where:

`S`

is a finite set of states`A`

is a finite set of actions`P`

is a state transition probability matrix (probability of ending up in a state for each current state and each action)`R`

is a reward function, given a state and an action`y`

is a discount factor, between 0 and 1

Then, a policy `π`

is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I’m skipping a lot of details here). This definition corresponds to the second part of your definition.

I highly recommend David Silver’s RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.

[ad_2]