Closing the Gap

tl;dr:

Reward shaping holds in entropy-regularized (a.k.a. MaxEnt) RL. Any* two RL tasks can be related by an auxiliary “corrective” task. Solving a composite task (a new reward function which is a function of previous tasks’ rewards) can be done by applying that function to the optimal value functions (\(f(Q_1, Q_2, \dots)\)) and then learning to “close the gap” by learning the corrective task’s value function.

*The two tasks must be in the same state and action space, with the same discount factor.

The Story

Classical “reward shaping” is a technique for modifying an RL agent’s reward function to improve its training time while keeping the optimal policy unchanged. We’ve found that the standard formulation of reward shaping carries over to the entropy-regularized (a.k.a MaxEnt) RL setting.

After applying a composition function \(f\) to a set of subtasks’ rewards (\(\widetilde{r}(s,a)=f(r_1(s,a), r_2(s,a), \dots)\)), one may wonder if the corresponding optimal value function \(\widetilde{Q}(s,a)\) can be found by applying \(f\) to the optimal value functions of the subtasks (\(\widetilde{Q}(s,a)=f(Q_1(s,a), Q_2(s,a), \dots)\)). This is not generally the case (but see this, this, and this for examples where it is the case, with proper assumptions). Nevertheless, we can still find the optimal value function of the composite task by learning the optimal value function of a related “corrective task” (which we denote \(K\).) This corrective value function generalizes the \(C^\infty\) term introduced in this paper to any transformation function (rather than just convex combinations) of rewards.

The Results

Accepted paper in the Technical Track of AAAI (2023).
Attended AAAI-2023 with a student scholarship to present the paper.
See a brief presentation on YouTube.

The Details

🏗️ (Under construction)

Future Work

Some generalizations and future work that I am interested in:

Extend the results to the case where the corrective task is not in the same state and action space as the composite task.
Extend the results to the case where tasks have different discount factors (as \(\gamma\) is typically taken to be a hyperparameter during training).