reinforce_agent

`RL4CRN.agents.reinforce_agent`

REINFORCE-style policy gradient agent.

This module provides REINFORCEAgent, a policy-gradient agent that collects per-step log-probabilities and entropies during rollout, and performs an update using a risk-seeking REINFORCE objective with an entropy bonus. Optionally, it can add a self-imitation learning (SIL) loss computed from a hall-of-fame buffer.

Terminology

This code treats the optimization target as a loss to be minimized (smaller is better). The variable name rewards in update actually represents per-sample final losses.

Risk-seeking objective

Let \(\ell_i\) be the final loss for sample \(i\) in a batch of size \(N\). Let \(\pi_\theta\) be the policy and let \(\log \pi_\theta(a_{i,t} \mid s_{i,t})\) be the log-probability of the action chosen at step \(t\).

A risk parameter \(r \in [0, 1]\) defines a top-k subset \(\mathcal{K}\) of the best samples (lowest losses), where:

\[k = \left\lfloor N (1 - r) \right\rfloor\]

The code computes a baseline \(b\) as the worst (largest) loss among these top-k samples (or \(\max_i \ell_i\) if \(k = 0\)), ensuring non-negative weights within the selected subset.

The (un-normalized) policy-gradient loss term used in the implementation is:

\[\mathcal{L}_{\text{PG}} = \frac{1}{k} \sum_{i \in \mathcal{K}} (\ell_i - b) \sum_t \log \pi_\theta(a_{i,t}\mid s_{i,t})\]

Entropy regularization

An entropy term is subtracted from the objective to encourage exploration. The implementation tracks entropies per step and forms a batch-level entropy statistic with separate weights for the top-k subset and the remainder.

Self-imitation learning (optional): If enabled, an additional term \(\mathcal{L}_{\text{SIL}}\) is added to the total loss. It replays trajectories from a hall-of-fame buffer and reinforces actions whose final loss improves upon the current batch best.

Notes

act stores tensors needed for the later update in internal lists. Call update once per rollout batch to clear this state.
This agent assumes the policy supports a forward signature compatible with this code (see act for details).

`REINFORCEAgent`

Bases: AbstractAgent

Risk-seeking REINFORCE agent with entropy regularization and optional SIL.

The agent performs batched rollouts, storing per-step log-probabilities and entropies. During update, it selects the best samples according to a risk parameter, forms a baseline from that subset, and applies a REINFORCE- style policy gradient update with an entropy bonus.

PARAMETER	DESCRIPTION
`policy`	Policy network used to sample actions and return log-probabilities and entropies. It must support being called as `policy(states, mode=...)` and return `(raw_actions, logPs, entropies)` for action sampling. For SIL replay, it must also support being called as `policy(observations, mode='full', action=raw_actions)` and return per-sample log-probabilities.
`allow_input_influence`	Whether actions may include input influence. The `'parameters'` mode with input influence is not implemented. DEFAULT: `False`
`logger`	Optional logger providing `log_metric(name, value, step=...)`. DEFAULT: `None`
`learning_rate`	Learning rate for the Adam optimizer. DEFAULT: `0.001`
`entropy_scheduler`	Dictionary controlling entropy regularization. If empty, defaults are used. Supported keys: `entropy_weight`: global multiplier for entropy regularization `topk_entropy_weight`: weight for entropy of top-k subset `remainder_entropy_weight`: weight for entropy of the remainder subset `entropy_update_coefficient`: multiplicative update factor `entropy_schedule`: update period (iterations) `minimum_entropy_weight`: lower bound for `entropy_weight` DEFAULT: `{}`
`risk_scheduler`	Dictionary controlling the risk parameter. If empty, defaults are used. Supported keys: `risk`: initial risk value \(r\) (higher means fewer samples used) `risk_update`: additive increment for `risk` `max_risk`: upper bound for `risk` `risk_schedule`: update period (iterations) DEFAULT: `{}`
`sil_settings`	Dictionary controlling self-imitation learning. If empty, defaults are used. Supported keys: `sil_loss_weight`: multiplier for the SIL term `use_adaptive_baseline`: if True, uses an exponential moving baseline `baseline_annealing_rate`: EMA coefficient for adaptive baseline DEFAULT: `{}`
`device`	Torch device. If None, defaults to CPU. DEFAULT: `None`

ATTRIBUTE	DESCRIPTION
`logPs_sequence`	List of tensors containing per-step log-probabilities.
`entropies_sequence`	List of tensors containing per-step entropies.
`entropy_scheduler`	Dictionary of entropy scheduling parameters.
`risk_scheduler`	Dictionary of risk scheduling parameters.

`act(states, actuator, mode='full')`

Sample actions from the policy for a batch of states.

This method performs the forward pass through the policy, stores the resulting log-probabilities and entropies for the later update, and converts raw policy outputs into environment actions via the provided actuator.

PARAMETER	DESCRIPTION
`states`	Batch of observed states (typically a tensor of shape `(N, state_dim)`).
`actuator`	Actuator that converts raw policy actions into environment actions (must provide `actuate(policy_action)`).
`mode`	Policy mode. Expected values include `'full'`, `'partial'`, and `'parameters'` (depending on the policy implementation). DEFAULT: `'full'`

RETURNS	DESCRIPTION
	A tuple `(actions, raw_actions)`: actions: list of environment actions produced by the actuator. raw_actions: list of raw policy actions prior to actuation (used by self-imitation learning).

RAISES	DESCRIPTION
`NotImplementedError`	If `mode == 'parameters'` and `allow_input_influence` is True.

`self_imitation_learingin_loss(hof, final_loss_for_each_sample, top_k_indices, weighting_scheme='uniform', observer=None, tensorizer=None, stepper=None, sil_batch_size=None)`

Compute self-imitation learning (SIL) loss using hall-of-fame samples.

The SIL term replays trajectories from the hall-of-fame (HoF) buffer and reinforces actions that yield a loss better than the current batch best.

For each HoF sample \(j\), the advantage used in the implementation is:

\[ A_j = \max(0, \ell_{\text{best}} - \ell^{\text{HoF}}_j)\]

where \(\ell_{\text{best}}\) is the best (lowest) loss in the current batch among the selected top-k samples. Constrained to positive advantages, this encourages the agent to imitate only those HoF samples that improve upon its current best performance.

The SIL objective is:

\[\mathcal{L}_{\text{SIL}} = -\frac{1}{M} \sum_{j=1}^M w_j A_j \log \pi_\theta(\tau_j)\]

where \(\log \pi_\theta(\tau_j)\) is the sum of log-probabilities assigned by the current policy to the replayed trajectory \(\tau_j\) and \(w_j\) are optional sample weights.

PARAMETER	DESCRIPTION
`hof`	Hall-of-fame buffer providing `__len__` and `sample(batch_size)`. Each sample must be cloneable (used to create replay environments) and must provide `get_raw_action(j)` and `get_action(j)` for each replay step.
`final_loss_for_each_sample`	Tensor-like object containing the final per-sample loss for the current batch.
`top_k_indices`	Tensor of indices of the selected top-k samples in the current batch.
`weighting_scheme`	Currently only `'uniform'` is supported. DEFAULT: `'uniform'`
`observer`	Observer used to produce observations from environments. DEFAULT: `None`
`tensorizer`	Tensorizer used to convert observations to tensors. DEFAULT: `None`
`stepper`	Stepper used to apply actions in the environment. DEFAULT: `None`
`sil_batch_size`	Number of HoF samples to replay. Defaults to `len(hof)`. DEFAULT: `None`

RETURNS	DESCRIPTION
	Scalar SIL loss (float or tensor). Returns 0.0 if HoF is empty.

RAISES	DESCRIPTION
`ValueError`	If `observer`, `tensorizer`, or `stepper` is not provided.
`NotImplementedError`	If `weighting_scheme` is not implemented.

`update(rewards, step_iteration=None, hof=None, use_sil=False, sil_weighting_scheme='uniform', observer=None, tensorizer=None, stepper=None, sil_batch_size=None)`

Update the policy using stored rollout statistics and final losses.

This method consumes the sequences collected by act and performs a single optimization step.

Important

The argument rewards is treated as final losses to minimize. Smaller values are better.

Overview of computations

Stack and sum log-probabilities and entropies across steps.
Select top-k samples according to the risk parameter.
Compute a baseline (worst loss in top-k, or batch max if k=0).
Form the policy gradient loss on the selected subset.
Subtract an entropy regularization term.
Optionally add a self-imitation learning (SIL) term.
Backpropagate, clip gradients, and update policy parameters.

PARAMETER	DESCRIPTION
`rewards`	List/array/tensor of length `N` containing final per-sample losses for the batch.
`step_iteration`	Optional integer step for logging. DEFAULT: `None`
`hof`	Optional hall-of-fame buffer used for SIL. DEFAULT: `None`
`use_sil`	If True, adds the SIL loss term. DEFAULT: `False`
`sil_weighting_scheme`	Weighting scheme for SIL samples. Currently only `'uniform'` is supported. DEFAULT: `'uniform'`
`observer`	Observer used for SIL replay. DEFAULT: `None`
`tensorizer`	Tensorizer used for SIL replay. DEFAULT: `None`
`stepper`	Stepper used for SIL replay. DEFAULT: `None`
`sil_batch_size`	Number of HoF samples to replay. DEFAULT: `None`

RETURNS	DESCRIPTION
	None.

RAISES	DESCRIPTION
`RuntimeError`	If called before any `act` call (no stored logPs).

`translate_state(state)`

Translate an environment state into an agent-specific representation.

Concrete agents should override this method to implement state encoding / feature extraction suitable for their policy and value function(s).

PARAMETER	DESCRIPTION
`state`	Environment state object.

RETURNS	DESCRIPTION
	An agent-specific representation of `state` (e.g., tensors, feature vectors, graphs). The exact type depends on the implementation.