reinforce_agent
RL4CRN.agents.reinforce_agent
REINFORCE-style policy gradient agent.
This module provides REINFORCEAgent, a policy-gradient agent that
collects per-step log-probabilities and entropies during rollout, and performs
an update using a risk-seeking REINFORCE objective with an entropy bonus.
Optionally, it can add a self-imitation learning (SIL) loss computed from a
hall-of-fame buffer.
Terminology
This code treats the optimization target as a loss to be minimized
(smaller is better). The variable name rewards in update actually
represents per-sample final losses.
Risk-seeking objective
Let \(\ell_i\) be the final loss for sample \(i\) in a batch of size \(N\). Let \(\pi_\theta\) be the policy and let \(\log \pi_\theta(a_{i,t} \mid s_{i,t})\) be the log-probability of the action chosen at step \(t\).
A risk parameter \(r \in [0, 1]\) defines a top-k subset \(\mathcal{K}\) of the best samples (lowest losses), where:
The code computes a baseline \(b\) as the worst (largest) loss among these top-k samples (or \(\max_i \ell_i\) if \(k = 0\)), ensuring non-negative weights within the selected subset.
The (un-normalized) policy-gradient loss term used in the implementation is:
Entropy regularization
An entropy term is subtracted from the objective to encourage exploration. The implementation tracks entropies per step and forms a batch-level entropy statistic with separate weights for the top-k subset and the remainder.
Self-imitation learning (optional): If enabled, an additional term \(\mathcal{L}_{\text{SIL}}\) is added to the total loss. It replays trajectories from a hall-of-fame buffer and reinforces actions whose final loss improves upon the current batch best.
Notes
actstores tensors needed for the later update in internal lists. Callupdateonce per rollout batch to clear this state.- This agent assumes the policy supports a forward signature compatible with
this code (see
actfor details).
REINFORCEAgent
Bases: AbstractAgent
Risk-seeking REINFORCE agent with entropy regularization and optional SIL.
The agent performs batched rollouts, storing per-step log-probabilities and
entropies. During update, it selects the best samples according to a
risk parameter, forms a baseline from that subset, and applies a REINFORCE-
style policy gradient update with an entropy bonus.
| PARAMETER | DESCRIPTION |
|---|---|
policy
|
Policy network used to sample actions and return log-probabilities
and entropies. It must support being called as
|
allow_input_influence
|
Whether actions may include input influence. The
DEFAULT:
|
logger
|
Optional logger providing
DEFAULT:
|
learning_rate
|
Learning rate for the Adam optimizer.
DEFAULT:
|
entropy_scheduler
|
Dictionary controlling entropy regularization. If empty, defaults are used. Supported keys:
DEFAULT:
|
risk_scheduler
|
Dictionary controlling the risk parameter. If empty, defaults are used. Supported keys:
DEFAULT:
|
sil_settings
|
Dictionary controlling self-imitation learning. If empty, defaults are used. Supported keys:
DEFAULT:
|
device
|
Torch device. If None, defaults to CPU.
DEFAULT:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
logPs_sequence |
List of tensors containing per-step log-probabilities.
|
entropies_sequence |
List of tensors containing per-step entropies.
|
entropy_scheduler |
Dictionary of entropy scheduling parameters.
|
risk_scheduler |
Dictionary of risk scheduling parameters.
|
act(states, actuator, mode='full')
Sample actions from the policy for a batch of states.
This method performs the forward pass through the policy, stores the resulting log-probabilities and entropies for the later update, and converts raw policy outputs into environment actions via the provided actuator.
| PARAMETER | DESCRIPTION |
|---|---|
states
|
Batch of observed states (typically a tensor of shape
|
actuator
|
Actuator that converts raw policy actions into environment
actions (must provide
|
mode
|
Policy mode. Expected values include
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
A tuple
|
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError
|
If |
self_imitation_learingin_loss(hof, final_loss_for_each_sample, top_k_indices, weighting_scheme='uniform', observer=None, tensorizer=None, stepper=None, sil_batch_size=None)
Compute self-imitation learning (SIL) loss using hall-of-fame samples.
The SIL term replays trajectories from the hall-of-fame (HoF) buffer and reinforces actions that yield a loss better than the current batch best.
For each HoF sample \(j\), the advantage used in the implementation is:
where \(\ell_{\text{best}}\) is the best (lowest) loss in the current batch among the selected top-k samples. Constrained to positive advantages, this encourages the agent to imitate only those HoF samples that improve upon its current best performance.
The SIL objective is:
where \(\log \pi_\theta(\tau_j)\) is the sum of log-probabilities assigned by the current policy to the replayed trajectory \(\tau_j\) and \(w_j\) are optional sample weights.
| PARAMETER | DESCRIPTION |
|---|---|
hof
|
Hall-of-fame buffer providing
|
final_loss_for_each_sample
|
Tensor-like object containing the final per-sample loss for the current batch.
|
top_k_indices
|
Tensor of indices of the selected top-k samples in the current batch.
|
weighting_scheme
|
Currently only
DEFAULT:
|
observer
|
Observer used to produce observations from environments.
DEFAULT:
|
tensorizer
|
Tensorizer used to convert observations to tensors.
DEFAULT:
|
stepper
|
Stepper used to apply actions in the environment.
DEFAULT:
|
sil_batch_size
|
Number of HoF samples to replay. Defaults to
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
Scalar SIL loss (float or tensor). Returns 0.0 if HoF is empty. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
NotImplementedError
|
If |
update(rewards, step_iteration=None, hof=None, use_sil=False, sil_weighting_scheme='uniform', observer=None, tensorizer=None, stepper=None, sil_batch_size=None)
Update the policy using stored rollout statistics and final losses.
This method consumes the sequences collected by act and performs
a single optimization step.
Important
The argument rewards is treated as final losses to minimize.
Smaller values are better.
Overview of computations
- Stack and sum log-probabilities and entropies across steps.
- Select top-k samples according to the risk parameter.
- Compute a baseline (worst loss in top-k, or batch max if k=0).
- Form the policy gradient loss on the selected subset.
- Subtract an entropy regularization term.
- Optionally add a self-imitation learning (SIL) term.
- Backpropagate, clip gradients, and update policy parameters.
| PARAMETER | DESCRIPTION |
|---|---|
rewards
|
List/array/tensor of length
|
step_iteration
|
Optional integer step for logging.
DEFAULT:
|
hof
|
Optional hall-of-fame buffer used for SIL.
DEFAULT:
|
use_sil
|
If True, adds the SIL loss term.
DEFAULT:
|
sil_weighting_scheme
|
Weighting scheme for SIL samples. Currently only
DEFAULT:
|
observer
|
Observer used for SIL replay.
DEFAULT:
|
tensorizer
|
Tensorizer used for SIL replay.
DEFAULT:
|
stepper
|
Stepper used for SIL replay.
DEFAULT:
|
sil_batch_size
|
Number of HoF samples to replay.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
None. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If called before any |
translate_state(state)
Translate an environment state into an agent-specific representation.
Concrete agents should override this method to implement state encoding / feature extraction suitable for their policy and value function(s).
| PARAMETER | DESCRIPTION |
|---|---|
state
|
Environment state object.
|
| RETURNS | DESCRIPTION |
|---|---|
|
An agent-specific representation of |