add_reaction_by_ordered_index

`RL4CRN.policies.add_reaction_by_ordered_index`

`AddReactionByOrderedIndex`

Bases: AddReactionByIndex

Extension of AddReactionByIndex that enforces an ordered reaction-selection scheme.

The base policy samples a reaction index from the library (excluding already-present reactions) and then samples its parameters. This subclass adds two extra structural constraints:

1) Template-aware ordering At the first call in an episode/batch, the current IOCRN reaction multi-hot vector is snapshotted as a template (template_mask). Only reactions added after this snapshot are considered "added by the agent". Ordering constraints are applied only to these added reactions, so template reactions do not affect the allowed index range.

2) Sequentiality constraint Once the agent has added at least one reaction, subsequent reactions must have an index strictly greater than the maximum index among the agent-added reactions so far. Concretely: r_next > max(added_indices) This is enforced with either:

a soft penalty (finite constraint_strength), or
a hard mask (constraint_strength = inf), making violations impossible.

Additionally, an optional combinatorial bias term can be added to the structure logits to shape the policy toward a uniform distribution over unordered sets of a target size (rather than uniform over ordered action sequences).

Compared to the base class, the parameters heads/generators are unchanged; only the structure sampling logits are modified prior to constructing the categorical distribution.

init(num_reactions, num_parameters, num_inputs, encoder_attributes, deep_layer_size, structure_head_attributes, parameter_head_attributes, input_influence_head_attributes, target_set_size, masks=None, continuous_distribution={'type': 'lognormal'}, discrete_distribution={'type': 'categorical', 'categories': torch.tensor([1, 2])}, entropy_weights_per_head=None, structure_head_temperature={'target_entropy_ratio_to_max': 1.0, 'initial_temperature': 1.0, 'rate': 0.0, 'current_temperature': 1.0}, allow_input_influence=False, device=None, combinatorial_bias_enabled=True, constraint_strength=float('inf'))

Initialize the ordered-index reaction-addition policy.

All parameters from AddReactionByIndex are supported. Additional parameters:

PARAMETER	DESCRIPTION
`target_set_size`	int Desired total number of reactions in the final CRN (including template reactions). Used to compute the combinatorial prior so that, under an uninformative policy, the probability of arriving at a particular final set is approximately uniform: P(set) ∝ 1 / C(M, K) where M is library size and K is `target_set_size`.
`combinatorial_bias_enabled`	bool, default=True If True, adds a combinatorial bias term to the structure logits that accounts for how many completions remain if a given index is chosen next.
`constraint_strength`	float, default=inf Strength of the ordering constraint. - If finite: applies a subtractive penalty to out-of-order logits (soft constraint). - If infinite: treats out-of-order choices as impossible (hard mask).

Internal state

template_mask (torch.Tensor or None): Snapshot of the initial reaction multi-hot vector for the current episode/batch. Shape (N, M). Set on the first forward call after reset_template().
library_indices (torch.Tensor): Float tensor [0, 1, ..., M-1] used to compute max indices efficiently.

`reset_template()`

Reset the internal template snapshot.

Call this at the start of a new episode (or whenever the “template CRN” changes) so that the next call to forward captures the current reaction multi-hot vector as template_mask.

Why this matters

The ordering constraint is designed to apply only to reactions added by the agent. Resetting the template ensures that pre-existing/template reactions do not influence the computed max_added_index and therefore do not restrict future choices.

`forward(state, mode='full', action=None, structure_temp=None)`

Sample or score actions under ordered-index and combinatorial constraints.

This method mirrors AddReactionByIndex.forward but modifies the structure logits before sampling/scoring the reaction index.

PARAMETER	DESCRIPTION
`state`	torch.Tensor Batched observation tensor (N, D). The first M entries must be the reaction multi-hot vector indicating reactions present in the current IOCRN.
`mode`	{"full", "partial"} - "full": sample structure + parameters (supported). - "partial": not implemented.
`action`	list[dict] or None If provided, the method computes log π(action\|state) for the given actions instead of sampling. The action dictionaries must include a "reaction index" and parameter fields consistent with the configured generators (same as base class).
`structure_temp`	float or None Optional temperature override for the structure head logits.

RETURNS	DESCRIPTION
	If `action is None`: `actions` (list[dict]): Sampled actions, one per batch element. `log_probabilities` (torch.Tensor): Log-probability per batch element, including structure + parameter terms. `entropies` (torch.Tensor): Weighted entropy per batch element (structure + parameter heads).
	If `action is not None`: `log_probabilities` (torch.Tensor): Log-probability per batch element for the provided actions.

Ordering logic

Template snapshot (first call only) If template_mask is not set, store state[:, :M] as the template.
Determine agent-added reactions added_reactions_mask = (state[:,:M] - template_mask) > 0.5 and compute: num_added_by_agent, total_existing_counts
Sequentiality mask Let max_added_index be the maximum library index among added reactions. If the agent has added at least one reaction, indices <= max_added_index are penalized or masked (depending on constraint_strength).
Combinatorial bias (optional) A bias term is added to each candidate reaction index i representing the log-count of ways to complete the remaining set after choosing i, accounting for template-fixed items. Invalid completions yield -inf and are hard-masked.
Hard vs soft masks Hard mask:
- template reactions (cannot re-select fixed/template entries)
- impossible completions from combinatorial bias (-inf) Soft mask:
- out-of-order indices (sequentiality violations), optionally penalized
Emergency valve If all logits become -inf for any batch element, the last index is set to 0 to avoid crashing the categorical distribution construction.
Sampling / scoring Build a Categorical over masked logits (with temperature) and sample or evaluate the provided indices.
Parameter generation Delegates to the same continuous/discrete parameter generators as the base class.

Notes

This class does not change the parameterization; it only constrains structure sampling.
The “entropy correction” term when combinatorial bias is enabled modifies the structure entropy signal by adding E_p[bias], which corresponds to optimizing toward the biased prior (i.e., minimizing KL(p || exp(bias)) up to a constant).

`log_combinations(n, k)`

Compute the logarithm of the binomial coefficient, log C(n, k), in a numerically stable way.

This helper is used to build combinatorial priors/biases over remaining action choices. It supports tensor-valued inputs and returns -inf for invalid pairs (k < 0 or k > n), which is convenient when treating invalid combinations as impossible events.

PARAMETER	DESCRIPTION
`n`	torch.Tensor Number of items available (can be broadcasted).
`k`	torch.Tensor Number of items to choose (can be broadcasted).

RETURNS	DESCRIPTION
	torch.Tensor `log(C(n, k))` with the broadcasted shape of `n` and `k`. Entries corresponding to invalid (n, k) pairs are `-inf`.

Notes

Uses the identity:

\[\log C(n, k) = \log\Gamma(n+1) - \log\Gamma(k+1) - \log\Gamma(n-k+1)\]

and clamps intermediate values to avoid NaNs when masking invalid inputs.