deterministic

`RL4CRN.rewards.deterministic`

RL4CRN.rewards

Reward / cost functions for evaluating IOCRN behaviors under different design tasks.

This module provides task-specific wrappers around an IOCRN's simulation interface (e.g., IOCRN.transient_response and IOCRN.transient_response_piecewise) and converts the resulting trajectories into scalar performance measures that can be used as RL rewards (or costs).

Included objectives:

Dynamic tracking (continuous-valued): weighted L1/L2 tracking error to a reference trajectory or setpoint across multiple scenarios.
Piecewise tracking: same as above, but with piecewise-constant inputs and segmented time horizons (useful for protocols / sequences).
Oscillation shaping: penalizes deviations from desired oscillatory features such as frequency, mean level, damping, and peak ratios using oscillation_metrics.
Logic circuit scoring: evaluates steady-state binary behavior (via BCE or thresholded mismatch) for combinational circuits and piecewise logic protocols (e.g., latches).
Custom relationship tracking: evaluates arbitrary algebraic constraints between species trajectories defined by a user-supplied function (targeting zero error).

All functions return a scalar performance (interpretable as cost unless you negate it) and update crn.last_task_info with metadata such as the reward value, task type, and the simulation settings that produced it.

`dynamic_tracking_error(crn, u_list, x0_list, time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)`

Compute a dynamic tracking cost for an IOCRN over a batch of scenarios.

The function simulates the CRN for each scenario in the Cartesian product of u_list and x0_list (as implemented by crn.transient_response) and evaluates how well the output trajectories track the provided references.

PARAMETER	DESCRIPTION
`crn`	IOCRN An IOCRN-like object implementing `transient_response(u_list, x0_list, time_horizon, LARGE_NUMBER=...)`. The simulation is expected to return `(t, x_list, y_list, last_task_info)`, where `y_list` is a list of output trajectories.
`u_list`	list[np.ndarray] List of constant input vectors. Each element has shape `(p,)`, where `p` is the number of CRN inputs.
`x0_list`	list[np.ndarray] List of initial state vectors. Each element has shape `(n,)`, where `n` is the number of CRN species.
`time_horizon`	np.ndarray 1D array of evaluation times with shape `(T,)`.
`r_list`	list[np.ndarray] List of reference signals/targets for each scenario. The expected shape and interpretation depend on `performance_metric`. Common usage is a list of arrays with shape `(q, T)` or `(q,)` (setpoints), where `q` is the number of outputs.
`w`	np.ndarray Weights for the tracking error. Typically shape `(q, T)` so each output and time point can be weighted differently. (Exact expectations follow `performance_metric`.)
`norm`	int, default=1 Norm used in the tracking error. Passed to `performance_metric`. Common options are 1 (L1) or 2 (L2 / squared error).
`relative`	bool, default=False If True, compute a relative error (as supported by `performance_metric`).
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to `transient_response`. If the integrator fails or becomes unstable, trajectories may be filled with `LARGE_NUMBER`, which typically yields a large cost.

RETURNS	DESCRIPTION
`performance`	float Scalar tracking cost aggregated across scenarios, outputs, and time.
`last_task_info`	dict Updated `crn.last_task_info` dictionary, augmented with: 'reward': performance 'setpoint': r_list 'initial_conditions': x0_list 'reward type': 'dynamic_tracking_error'

`habituation_error_piecewise(crn, u_nested_list, x0_list, nested_time_horizon, w, LARGE_NUMBER=10000.0, min_peak=0.1, max_peak=2.0)`

Compute a habituation cost for a piecewise protocol using peak ratios.

This function simulates the CRN under a piecewise-constant input protocol and evaluates "habituation" as a change in peak response across repeated stimulus windows.

Convention used here

nested_time_horizon defines K segments with time grids t_0, t_1, ..., t_{K-1}.
We treat even-indexed segments (0, 2, 4, ...) as "stimulus" windows. Peaks are measured in those windows for each scenario output trajectory.
A habituation score is computed from ratios of consecutive stimulus peaks: ratio_k = peak_{k+1} / peak_k (Lower ratios indicate stronger habituation.)

The function also enforces peak bounds. If any measured peak is outside [min_peak, max_peak], it returns LARGE_NUMBER.

PARAMETER	DESCRIPTION
`crn`	IOCRN IOCRN-like object implementing: `transient_response_piecewise(u_nested_list, x0_list, nested_time_horizon, LARGE_NUMBER=...)`.
`u_nested_list`	list[list[np.ndarray]] List of input protocols. Each protocol is a list of input vectors (p,) applied segment-wise. The inner list length must match `len(nested_time_horizon)`.
`x0_list`	list[np.ndarray] List of initial conditions (n,).
`nested_time_horizon`	list[np.ndarray] List of time grids, one per segment. These will be stitched by the simulator.
`w`	Union[float, Sequence[float], np.ndarray] Weights applied to each peak ratio term. If a scalar, the same weight is applied to all ratios. If a sequence, it should have length equal to the number of ratios (n_peaks - 1).
`LARGE_NUMBER`	float, default=1e4 Penalty returned if simulation diverges or peak constraints fail. TYPE: `float` DEFAULT: `10000.0`
`min_peak`	float, default=0.1 Minimum acceptable peak value. Peaks below this are considered invalid. TYPE: `float` DEFAULT: `0.1`
`max_peak`	float, default=2.0 Maximum acceptable peak value. Peaks above this are considered invalid. TYPE: `float` DEFAULT: `2.0`

RETURNS	DESCRIPTION
	Tuple[float, dict]: - performance: Scalar habituation cost (lower is better). - last_task_info: `crn.last_task_info` updated with metadata: - 'reward': performance - 'min_peak', 'max_peak' - 'initial_conditions' - 'reward type': 'habituation_error_piecewise'

`check_off_ss_invariant(crn, t, x, off_intervals, u_off, ss_tol_abs=0.0001, ss_tol_rel=0.001, xtol=1e-09, maxfev=2000)`

x: full state trajectory, shape (n, T) off_intervals: list of (start,end) absolute u_off: input vector for OFF

`habituation_metric(intervals, t, y_list, w, min_peak=0.1, max_peak=2.0, base_valley=0.0, LARGE_NUMBER=10000.0)`

Compute a habituation cost from output peaks across repeated stimulus windows.

This metric extracts peak amplitudes from specified time intervals and computes ratios of consecutive stimulus peaks. By default, it assumes even-indexed intervals (0, 2, 4, ...) correspond to repeated stimulus windows.

Steps

1) For each scenario output trajectory in y_list, compute peaks in stimulus windows (even intervals). 2) Enforce peak bounds: if any peak is outside [min_peak, max_peak], return LARGE_NUMBER. 3) Compute peak ratios: ratio_k = peak_{k+1} / peak_k. 4) Return weighted mean of ratios (lower implies stronger habituation).

Notes

If there are fewer than 2 stimulus windows, this metric cannot form a ratio and returns LARGE_NUMBER.
If any peak is zero or extremely small, division can blow up; we guard with a small epsilon.

PARAMETER	DESCRIPTION
`intervals`	Sequence[Tuple[float, float]] Time intervals (start, end) defining protocol segments. TYPE: `Sequence[Tuple[float, float]]`
`t`	np.ndarray Stitched time vector from the simulator, shape (T,). TYPE: `ndarray`
`y_list`	Sequence[np.ndarray] List of output trajectories, one per scenario. Each element is typically shape (q, T) (q outputs). TYPE: `Sequence[ndarray]`
`w`	Union[float, Sequence[float], np.ndarray] Weights for each ratio term. If scalar, repeated. If sequence, must match number of ratios (n_peaks - 1) or will be broadcast/clipped. TYPE: `Union[float, Sequence[float], ndarray]`
`min_peak`	float, default=0.1 Minimum acceptable peak amplitude. TYPE: `float` DEFAULT: `0.1`
`max_peak`	float, default=2.0 Maximum acceptable peak amplitude. TYPE: `float` DEFAULT: `2.0`
`LARGE_NUMBER`	float, default=1e4 Penalty returned if constraints fail or insufficient windows exist. TYPE: `float` DEFAULT: `10000.0`

RETURNS	DESCRIPTION
`float`	Scalar habituation cost. Lower is better. TYPE: `float`

`dynamic_tracking_error_piecewise(crn, u_nested_list, x0_list, nested_time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)`

Compute a dynamic tracking cost for piecewise-constant input protocols.

This is the piecewise analogue of dynamic_tracking_error. Instead of constant inputs over a single horizon, each scenario specifies a sequence of inputs applied over segmented time horizons.

PARAMETER	DESCRIPTION
`crn`	IOCRN An IOCRN-like object implementing `transient_response_piecewise(u_nested_list, x0_list, nested_time_horizon, LARGE_NUMBER=...)`.
`u_nested_list`	list[list[np.ndarray]] List of input protocols. Each element is a sequence `[u_0, u_1, ..., u_K]`, where each `u_k` has shape `(p,)`. The inner list length must match `len(nested_time_horizon)`.
`x0_list`	list[np.ndarray] List of initial state vectors, each of shape `(n,)`.
`nested_time_horizon`	list[np.ndarray] List of time grids `[t_0, t_1, ..., t_K]`, one per protocol segment. Each `t_k` is a 1D array of times for that segment. The CRN simulator is responsible for stitching the full trajectory.
`r_list`	list[np.ndarray] Reference signals/targets for each scenario (see `performance_metric`).
`w`	np.ndarray Weights for the tracking error (typically shape `(q, T_full)`), where `T_full` matches the concatenated time grid used in the simulation.
`norm`	int, default=1 Norm used in the tracking error (passed to `performance_metric`).
`relative`	bool, default=False If True, compute a relative error (as supported by `performance_metric`).
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to the simulator.

RETURNS	DESCRIPTION
`performance`	float Scalar tracking cost.
`last_task_info`	dict Updated `crn.last_task_info` with reward metadata (same keys as `dynamic_tracking_error`).

`oscillation_error(crn, u_list, x0_list, time_horizon, f_list=None, mean_list=None, w=[1 / 4, 1 / 4, 1 / 4, 1 / 4], t0=0, LARGE_NUMBER=10000.0)`

Compute an oscillation-shaping cost based on output time-series metrics.

The CRN is simulated (as in transient_response), then oscillatory features are extracted via oscillation_metrics. The returned scalar cost is a weighted sum of several error components:

mean error (if mean_list is provided)
frequency error (if f_list is provided)
damping deviation from 1
peak ratio r1 deviation from 1

PARAMETER	DESCRIPTION
`crn`	IOCRN IOCRN-like object implementing `transient_response`.
`u_list`	list[np.ndarray] List of constant input vectors, each of shape `(p,)`.
`x0_list`	list[np.ndarray] List of initial states, each of shape `(n,)`.
`time_horizon`	np.ndarray 1D array of evaluation times with shape `(T,)`.
`f_list`	list[np.ndarray] or None, default=None Desired oscillation frequencies per scenario and output. Expected format follows `oscillation_metrics`. If None, frequency error is not included.
`mean_list`	list[np.ndarray] or None, default=None Desired mean values per scenario and output (format follows `oscillation_metrics`). If None, mean error is not included.
`w`	list[float], default=[1/4, 1/4, 1/4, 1/4] Weights `[mean_error, frequency_error, damping_error, r1_error]`.
`t0`	float, default=0 Time threshold after which oscillation metrics are evaluated (to ignore transients).
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to the simulator.

RETURNS	DESCRIPTION
`performance`	float Scalar oscillation cost.
`last_task_info`	dict Updated `crn.last_task_info`, augmented with: 'reward': performance 'frequency': f_list 'reward type': 'oscillation_error'

`logic_circuit_reward(crn, u_list, x0_list, time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)`

Compute a steady-state logic circuit cost using binary cross-entropy (BCE).

The CRN is simulated for each scenario. For each output trace, the final time point is treated as the steady-state output y_ss and compared against the target logic value r using BCE:

\[\text{BCE}(r, y_ss) = - [ r \log(y_ss) + (1-r) \log(1-y_ss) ].\]

Notes

This function currently ignores w, norm, and relative (kept for API compatibility with tracking rewards).
Outputs are clipped to [1e-6, 1-1e-6] to avoid log(0).

PARAMETER	DESCRIPTION
`crn`	IOCRN IOCRN-like object implementing `transient_response`.
`u_list`	list[np.ndarray] List of constant inputs, each shape `(p,)`.
`x0_list`	list[np.ndarray] List of initial states, each shape `(n,)`.
`time_horizon`	np.ndarray 1D array of evaluation times with shape `(T,)`.
`r_list`	list[np.ndarray] List of desired binary targets per scenario. Each `r` is expected to have shape `(q,)` (one target per output).
`w`	np.ndarray Unused (present for signature compatibility).
`norm`	int Unused.
`relative`	bool Unused.
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to the simulator.

RETURNS	DESCRIPTION
`performance`	float Mean BCE across scenarios and outputs (lower is better).
`last_task_info`	dict Updated `crn.last_task_info`, augmented with: 'reward': performance 'setpoint': r_list 'initial_conditions': x0_list 'reward type': 'dynamic_tracking_error' (kept as-is; consider renaming to 'logic_circuit_reward')

`dynamic_tracking_error_piecewise_logic(crn, u_nested_list, x0_list, nested_time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)`

Compute a piecewise logic tracking cost using thresholded mismatch.

This is intended for sequential / protocol-driven logic tasks (e.g. latches), where targets are specified as binary values and outputs are evaluated using a 0.5 threshold across the entire time horizon (not only at steady state).

Internally, the CRN is simulated with transient_response_piecewise, then the score is computed by performance_metric_logic: \(\text{mean}(|1[r>0.5] - 1[y>0.5]|)\) over scenarios, outputs, and time.

PARAMETER	DESCRIPTION
`crn`	IOCRN IOCRN-like object implementing `transient_response_piecewise`.
`u_nested_list`	list[list[np.ndarray]] List of input protocols (see `dynamic_tracking_error_piecewise`).
`x0_list`	list[np.ndarray] Initial states, each shape `(n,)`.
`nested_time_horizon`	list[np.ndarray] List of time grids per segment.
`r_list`	list[np.ndarray] Target logic values per scenario, each expected shape `(q,)` or broadcastable to outputs (used as constant targets across time by `performance_metric_logic`).
`w`	np.ndarray Unused (present for signature compatibility).
`norm`	int Unused.
`relative`	bool Unused.
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to the simulator.

RETURNS	DESCRIPTION
`performance`	float Mean thresholded mismatch across scenarios, outputs, and time.
`last_task_info`	dict Updated `crn.last_task_info` with reward metadata.

`performance_metric_logic(r_list, y_list)`

Compute a binary (thresholded) mismatch score between targets and outputs.

Targets r_list are treated as desired binary outputs (thresholded at 0.5). Outputs y_list are thresholded at 0.5 across all time points. The returned score is the mean absolute mismatch: \(\text{mean}(|1[r>0.5] - 1[y>0.5]|)\) averaged over scenarios, outputs, and time.

PARAMETER	DESCRIPTION
`r_list`	list[np.ndarray] List of reference logic targets, typically each of shape `(q,)` where `q` is the number of outputs.
`y_list`	list[np.ndarray] List of output trajectories, each of shape `(q, T)`.

RETURNS	DESCRIPTION
	float Mean mismatch rate in [0, 1], where 0 indicates perfect logic behavior.

`track_relationship(crn, u_list, x0_list, time_horizon, w, species_names, relationship_func, norm=1, LARGE_NUMBER=10000.0)`

Compute a cost for enforcing an algebraic relationship between species trajectories.

This utility is for tasks where the objective is not tracking a pre-specified reference trajectory, but rather satisfying a constraint among species, e.g.

\[A(t) - B(t) = 0,\]

\[A(t) + B(t) - C(t) = 0,\]

\[A(t)B(t) - C(t) = 0,\]

etc.

The user supplies relationship_func, which is called on the requested species trajectories and should return an error signal that is zero when the desired relationship holds. The function then aggregates the error into a scalar cost using a weighted L1 or L2 norm across time (and across scenarios).

PARAMETER	DESCRIPTION
`crn`	IOCRN IOCRN-like object implementing `transient_response`.
`u_list`	list[np.ndarray] List of constant input vectors, each shape `(p,)`.
`x0_list`	list[np.ndarray] List of initial conditions, each shape `(n,)`.
`time_horizon`	np.ndarray 1D time grid of shape `(T,)`.
`w`	np.ndarray Weight array for the relationship error over time. Typically shape `(q_rel, T)` where `q_rel` is the output dimension of `relationship_func` (often 1). If `w` is 1D `(T,)`, it is treated as `(1, T)`.
`species_names`	list[str] Names of species to feed into `relationship_func`, in the same order as the function arguments.
`relationship_func`	callable Function mapping species trajectories to an error signal. It will be called as: relationship_func(traj_1, traj_2, ..., traj_N) where each `traj_i` has shape `(T,)`. The function should return either: a 1D array `(T,)` (interpreted as a single error channel), or a 2D array `(q_rel, T)` for multiple error channels.
`norm`	int, default=1 Norm for aggregation: mean weighted absolute error mean weighted squared error
`LARGE_NUMBER`	float, default=1e4 Divergence penalty passed to `transient_response`. If trajectories contain values near `LARGE_NUMBER`, the relationship error is set to `LARGE_NUMBER` to strongly penalize divergence.

RETURNS	DESCRIPTION
`performance`	float Scalar relationship-tracking cost.
`last_task_info`	dict Updated `crn.last_task_info`, augmented with: 'reward': performance 'initial_conditions': x0_list 'reward type': 'relationship_tracking' 'tracked_species': species_names

`habituation_metric_with_gap(*, intervals, t, y_list, w, n_repeats_pre, n_repeats_post, gap_weight=5.0, recovery_tol=0.05, dishabituate_rho=1.0, min_peak=0.1, max_peak=2.0, LARGE_NUMBER=10000.0, sensitization=False)`

Returns: habituation loss + gap consistency penalty. Keeps your "log(max ratio)" style for habituation.