Skip to content

add_reaction_by_index

RL4CRN.policies.add_reaction_by_index

Neural-network policies for adding reactions to an IOCRN.

This module contains policy networks that map a tensorized IOCRN observation to a distribution over actions that extend the CRN by adding one reaction.

In the default “add reaction by index” formulation, an action is a dictionary:

  • reaction index (int):, which library reaction to add next
  • continuous parameters (list[float]): sampled continuous parameters (masked per reaction)
  • discrete parameters (list[int]): | None,
  • parameters (array-like): concatenation of continuous+discrete (if any)

The policy factorizes the joint action distribution into a structure term and (optional) parameter terms:

\[\pi(a | s) = \pi_{struct}(r | s) \cdot \pi_{cont}(\theta_c | s, r) \cdot \pi_{disc}(\theta_d | s, r, \theta_c)\]

Log-probabilities and entropies returned by the policy correspond to this factorization:

\[\log \pi(a|s) = \log \pi_{struct}(r|s) + \log \pi_{cont}(\theta_c|s,r) + \log \pi_{disc}(\theta_d|s,r,\theta_c)\]
\[H(\pi) = w_s H(\pi_{struct}) + w_c H(\pi_{cont}) + w_d H(\pi_{disc}) \quad \text{(weighted per head)}\]

Masking is used to: - forbid selecting reactions already present in the IOCRN (structure logits masked to -∞), - forbid sampling parameters that do not exist for the chosen reaction (dimension masks), - forbid invalid discrete-category combinations when using a flattened logit space (logit masks).

Temperature scaling can be applied to the structure logits to control exploration:

\[ \pi_{struct}(r|s) = \text{softmax}\left(\frac{z_r(s)}{T}\right) \]

where T may be adapted online to target a desired entropy ratio.

AddReactionByIndex

Bases: Module

Policy network that samples one reaction addition for each element of a batch of IOCRNs.

The policy has an encoder + multiple “heads”:

  • Encoder: maps the observation vector state to a learned embedding h.
  • Structure head: produces logits over M library reactions, then samples a reaction index.
  • Continuous parameter generator (optional): samples continuous parameters for the chosen reaction.
  • Discrete parameter generator (optional): samples discrete parameters for the chosen reaction.

The action distribution factorizes as:

\[\pi(a|s) = \pi_{struct}(r|s) \cdot \pi_{cont}(\theta_c|s,r) \cdot \pi_{disc}(\theta_d|s,r,\theta_c)\]

where:

  • \(r\) is the reaction index (0..M-1),
  • \(\theta_c\) are continuous parameters (e.g. LogNormal),
  • \(\theta_d\) are discrete parameters (e.g. Categorical).

Notes: State layout (no input-influence observation): state ∈ R^{N×(M+K)}

  • state[:, :M] : multi-hot “reactions present” indicator
  • state[:, M:] : flattened parameter vector (0 where not present)

If allow_input_influence=True, the expected state layout is larger (includes additional per-input parameter influence features). This path is partially scaffolded but not implemented end-to-end in the current code.

Returns from forward:

  • sampled action dictionaries (unless action is provided),
  • log-probabilities (per batch element),
  • entropies (per batch element, weighted per head).

__init__(num_reactions, num_parameters, num_inputs, encoder_attributes, deep_layer_size, structure_head_attributes, parameter_head_attributes, input_influence_head_attributes, masks=None, zero_reaction_idx=None, stop_flag=False, continuous_distribution={'type': 'lognormal'}, discrete_distribution={'type': 'categorical', 'categories': torch.tensor([1, 2])}, entropy_weights_per_head=None, structure_head_temperature={'target_entropy_ratio_to_max': 1.0, 'initial_temperature': 1.0, 'rate': 0.0, 'current_temperature': 1.0}, allow_input_influence=False, device=None)

Initialize the AddReactionByIndex policy.

PARAMETER DESCRIPTION
num_reactions

int Number of candidate reactions in the library (denoted M).

num_parameters

int Size of the flattened global parameter vector across the library (denoted K). This corresponds to the “explicit” parameterization used by observers/tensorizers.

num_inputs

int Number of IO inputs (denoted p). Only relevant when using input-influence features.

encoder_attributes

dict Configuration for the encoder MLP (hidden_size, num_layers).

deep_layer_size

int Dimensionality of the encoder output embedding h(s).

structure_head_attributes

dict Configuration for the structure head MLP (hidden_size, num_layers).

parameter_head_attributes

dict Configuration for parameter generator backbones (hidden_size, num_layers).

input_influence_head_attributes

dict Reserved for a future input-influence head (currently not implemented).

masks

dict or None Optional masks derived from the reaction library: - 'continuous': float mask of shape (M, max_num_continuous_params) - 'discrete' : float mask of shape (M, max_num_discrete_params) - 'logit' : bool mask of shape (M, total_num_discrete_combinations) These masks are used to ensure only existing parameters/logits are used for each reaction.

zero_reaction_idx

int or None If provided, the policy will be allowed to resample the “zero reaction” more than once.

stop_flag

bool If True, the policy will stop adding reactions when the “zero reaction” is selected.

continuous_distribution

dict Continuous parameter distribution spec passed to ParameterGeneratorFromDistribution (e.g. {"type": "lognormal", ...}). The policy sets dim automatically from masks.

discrete_distribution

dict Discrete parameter distribution spec (e.g. {"type": "categorical", "categories": ...}). The policy sets dim automatically from masks. Current implementation assumes the same categories for each discrete dimension.

entropy_weights_per_head

dict or None Entropy weights for each head. Keys: {'structure','continuous','discrete','input_influence'}. Used to form a weighted entropy signal: H_total = Σ_i w_i H_i

structure_head_temperature

dict Temperature schedule state for the structure head. Expected keys: - target_entropy_ratio_to_max - initial_temperature - rate - current_temperature The logits are scaled as z/T before constructing the Categorical distribution.

allow_input_influence

bool If True, the observation and architecture include additional features/heads for input influence. (Currently not implemented.)

device

torch.device or None Device where parameters and tensors should live.

forward(state, mode='full', action=None, structure_temp=None)

Sample actions (or score provided actions) for a batch of IOCRN observations.

PARAMETER DESCRIPTION
state

torch.Tensor Batched observation tensor of shape (N, D). For allow_input_influence=False, D = M + K and the layout is:

  • state[:, :M] : multi-hot indicator of reactions already present
  • state[:, M:] : flattened parameters (0 for absent reactions) The method asserts that the input contains no NaNs.

mode

{"full", "partial"}

  • full: sample reaction structure and parameters.
  • partial: intended for parameter-only decisions given a fixed structure (not implemented in current code).

action

list[dict] or None If provided, the policy does not sample; it computes log π(action|state) for the given batch of actions (used e.g. in SIL replay / scoring). Each dict must include at least:

  • reaction index
  • continuous parameters (if continuous generator exists)
  • discrete parameters (if discrete generator exists)

structure_temp

float or None If provided, overrides the structure-head temperature T used for this call.

  • If action is None:
    • actions : list[dict] Sampled actions, one per batch element.
    • log_probabilities : torch.Tensor Log-probabilities per batch element, shape (N,). Computed as the sum of head log-probabilities: log π(a|s) = log π_struct + log π_cont + log π_disc (+ log π_input_influence)
    • entropies : torch.Tensor Weighted entropy per batch element, shape (N,): H = w_s H_struct + w_c H_cont + w_d H_disc
  • If action is not None:
    • log_probabilities : torch.Tensor Log-probabilities of the provided actions, shape (N,).
Implementation details

Structure sampling with masking Let z(s) be the structure logits (N×M). Reactions already present are masked: z_masked = z(s) with z_masked[r_present] = -∞ Then temperature scaling is applied: z_T = z_masked / T and a Categorical distribution is formed: r ~ Categorical(logits=z_T)

Adaptive temperature (training only) When sampling (action is None) in training mode, the current temperature is nudged based on the observed mean structure entropy relative to the maximum entropy log(M).

Parameter generation Continuous and discrete parameters are generated conditionally using ParameterGeneratorFromDistribution, and are masked so that nonexistent parameters are zeroed out and/or omitted from the returned per-sample lists.