[ad_1]

I am pretty new in RL. Could anyone suggest results/paper about whether or not policy gradient (or more general RL algorithms) can be applied to the problems where actions does not determine next state? e.g. next state is independent to action P(s_{t+1} | s_{t}, a_{t}) = P(s_{t+1} | s_{t})

I think it is doable as it won’t change the derivation of policy gradient, e.g.

enter image description for my derivations

Also, I am curious about the difference between the RL setting where the next state is independent to action P(s_{t+1} | s_{t}, a_{t}) = P(s_{t+1} | s_{t}) and the multi-armed bandits setting. If the problem is the next state variable is independent of actions, what would be the correct framework to start with?

[ad_2]