add_reaction_by_index

`RL4CRN.policies.add_reaction_by_index`

Neural-network policies for adding reactions to an IOCRN.

This module contains policy networks that map a tensorized IOCRN observation to a distribution over actions that extend the CRN by adding one reaction.

In the default “add reaction by index” formulation, an action is a dictionary:

reaction index (int):, which library reaction to add next
continuous parameters (list[float]): sampled continuous parameters (masked per reaction)
discrete parameters (list[int]): | None,
parameters (array-like): concatenation of continuous+discrete (if any)

The policy factorizes the joint action distribution into a structure term and (optional) parameter terms:

\[\pi(a | s) = \pi_{struct}(r | s) \cdot \pi_{cont}(\theta_c | s, r) \cdot \pi_{disc}(\theta_d | s, r, \theta_c)\]

Log-probabilities and entropies returned by the policy correspond to this factorization:

\[\log \pi(a|s) = \log \pi_{struct}(r|s) + \log \pi_{cont}(\theta_c|s,r) + \log \pi_{disc}(\theta_d|s,r,\theta_c)\]

\[H(\pi) = w_s H(\pi_{struct}) + w_c H(\pi_{cont}) + w_d H(\pi_{disc}) \quad \text{(weighted per head)}\]

Masking is used to: - forbid selecting reactions already present in the IOCRN (structure logits masked to -∞), - forbid sampling parameters that do not exist for the chosen reaction (dimension masks), - forbid invalid discrete-category combinations when using a flattened logit space (logit masks).

Temperature scaling can be applied to the structure logits to control exploration:

\[ \pi_{struct}(r|s) = \text{softmax}\left(\frac{z_r(s)}{T}\right) \]

where T may be adapted online to target a desired entropy ratio.

`AddReactionByIndex`

Bases: Module

Policy network that samples one reaction addition for each element of a batch of IOCRNs.

The policy has an encoder + multiple “heads”:

Encoder: maps the observation vector state to a learned embedding h.
Structure head: produces logits over M library reactions, then samples a reaction index.
Continuous parameter generator (optional): samples continuous parameters for the chosen reaction.
Discrete parameter generator (optional): samples discrete parameters for the chosen reaction.

The action distribution factorizes as:

\[\pi(a|s) = \pi_{struct}(r|s) \cdot \pi_{cont}(\theta_c|s,r) \cdot \pi_{disc}(\theta_d|s,r,\theta_c)\]

where:

\(r\) is the reaction index (0..M-1),
\(\theta_c\) are continuous parameters (e.g. LogNormal),
\(\theta_d\) are discrete parameters (e.g. Categorical).

Notes: State layout (no input-influence observation): state ∈ R^{N×(M+K)}

state[:, :M] : multi-hot “reactions present” indicator
state[:, M:] : flattened parameter vector (0 where not present)

If allow_input_influence=True, the expected state layout is larger (includes additional per-input parameter influence features). This path is partially scaffolded but not implemented end-to-end in the current code.

Returns from forward:

sampled action dictionaries (unless action is provided),
log-probabilities (per batch element),
entropies (per batch element, weighted per head).

init(num_reactions, num_parameters, num_inputs, encoder_attributes, deep_layer_size, structure_head_attributes, parameter_head_attributes, input_influence_head_attributes, masks=None, zero_reaction_idx=None, stop_flag=False, continuous_distribution={'type': 'lognormal'}, discrete_distribution={'type': 'categorical', 'categories': torch.tensor([1, 2])}, entropy_weights_per_head=None, structure_head_temperature={'target_entropy_ratio_to_max': 1.0, 'initial_temperature': 1.0, 'rate': 0.0, 'current_temperature': 1.0}, allow_input_influence=False, device=None)

Initialize the AddReactionByIndex policy.

PARAMETER	DESCRIPTION
`num_reactions`	int Number of candidate reactions in the library (denoted M).
`num_parameters`	int Size of the flattened global parameter vector across the library (denoted K). This corresponds to the “explicit” parameterization used by observers/tensorizers.
`num_inputs`	int Number of IO inputs (denoted p). Only relevant when using input-influence features.
`encoder_attributes`	dict Configuration for the encoder MLP (`hidden_size`, `num_layers`).
`deep_layer_size`	int Dimensionality of the encoder output embedding h(s).
`structure_head_attributes`	dict Configuration for the structure head MLP (`hidden_size`, `num_layers`).
`parameter_head_attributes`	dict Configuration for parameter generator backbones (`hidden_size`, `num_layers`).
`input_influence_head_attributes`	dict Reserved for a future input-influence head (currently not implemented).
`masks`	dict or None Optional masks derived from the reaction library: - 'continuous': float mask of shape (M, max_num_continuous_params) - 'discrete' : float mask of shape (M, max_num_discrete_params) - 'logit' : bool mask of shape (M, total_num_discrete_combinations) These masks are used to ensure only existing parameters/logits are used for each reaction.
`zero_reaction_idx`	int or None If provided, the policy will be allowed to resample the “zero reaction” more than once.
`stop_flag`	bool If True, the policy will stop adding reactions when the “zero reaction” is selected.
`continuous_distribution`	dict Continuous parameter distribution spec passed to ParameterGeneratorFromDistribution (e.g. {"type": "lognormal", ...}). The policy sets `dim` automatically from masks.
`discrete_distribution`	dict Discrete parameter distribution spec (e.g. {"type": "categorical", "categories": ...}). The policy sets `dim` automatically from masks. Current implementation assumes the same categories for each discrete dimension.
`entropy_weights_per_head`	dict or None Entropy weights for each head. Keys: {'structure','continuous','discrete','input_influence'}. Used to form a weighted entropy signal: H_total = Σ_i w_i H_i
`structure_head_temperature`	dict Temperature schedule state for the structure head. Expected keys: - target_entropy_ratio_to_max - initial_temperature - rate - current_temperature The logits are scaled as z/T before constructing the Categorical distribution.
`allow_input_influence`	bool If True, the observation and architecture include additional features/heads for input influence. (Currently not implemented.)
`device`	torch.device or None Device where parameters and tensors should live.

`forward(state, mode='full', action=None, structure_temp=None)`

Sample actions (or score provided actions) for a batch of IOCRN observations.

PARAMETER	DESCRIPTION
`state`	torch.Tensor Batched observation tensor of shape (N, D). For `allow_input_influence=False`, D = M + K and the layout is: `state[:, :M]` : multi-hot indicator of reactions already present `state[:, M:]` : flattened parameters (0 for absent reactions) The method asserts that the input contains no NaNs.
`mode`	{"full", "partial"} `full`: sample reaction structure and parameters. `partial`: intended for parameter-only decisions given a fixed structure (not implemented in current code).
`action`	list[dict] or None If provided, the policy does not sample; it computes log π(action\|state) for the given batch of actions (used e.g. in SIL replay / scoring). Each dict must include at least: `reaction index` `continuous parameters` (if continuous generator exists) `discrete parameters` (if discrete generator exists)
`structure_temp`	float or None If provided, overrides the structure-head temperature `T` used for this call.

If action is None:
- actions : list[dict] Sampled actions, one per batch element.
- log_probabilities : torch.Tensor Log-probabilities per batch element, shape (N,). Computed as the sum of head log-probabilities: log π(a|s) = log π_struct + log π_cont + log π_disc (+ log π_input_influence)
- entropies : torch.Tensor Weighted entropy per batch element, shape (N,): H = w_s H_struct + w_c H_cont + w_d H_disc
If action is not None:
- log_probabilities : torch.Tensor Log-probabilities of the provided actions, shape (N,).

Implementation details

Structure sampling with masking Let z(s) be the structure logits (N×M). Reactions already present are masked: z_masked = z(s) with z_masked[r_present] = -∞ Then temperature scaling is applied: z_T = z_masked / T and a Categorical distribution is formed: r ~ Categorical(logits=z_T)

Adaptive temperature (training only) When sampling (action is None) in training mode, the current temperature is nudged based on the observed mean structure entropy relative to the maximum entropy log(M).

Parameter generation Continuous and discrete parameters are generated conditionally using ParameterGeneratorFromDistribution, and are masked so that nonexistent parameters are zeroed out and/or omitted from the returned per-sample lists.