deterministic
RL4CRN.rewards.deterministic
RL4CRN.rewards
Reward / cost functions for evaluating IOCRN behaviors under different design tasks.
This module provides task-specific wrappers around an IOCRN's simulation interface
(e.g., IOCRN.transient_response and IOCRN.transient_response_piecewise)
and converts the resulting trajectories into scalar performance measures that can be
used as RL rewards (or costs).
Included objectives:
- Dynamic tracking (continuous-valued): weighted L1/L2 tracking error to a reference trajectory or setpoint across multiple scenarios.
- Piecewise tracking: same as above, but with piecewise-constant inputs and segmented time horizons (useful for protocols / sequences).
- Oscillation shaping: penalizes deviations from desired oscillatory features such as
frequency, mean level, damping, and peak ratios using
oscillation_metrics. - Logic circuit scoring: evaluates steady-state binary behavior (via BCE or thresholded mismatch) for combinational circuits and piecewise logic protocols (e.g., latches).
- Custom relationship tracking: evaluates arbitrary algebraic constraints between species trajectories defined by a user-supplied function (targeting zero error).
All functions return a scalar performance (interpretable as cost unless you negate it)
and update crn.last_task_info with metadata such as the reward value, task type, and
the simulation settings that produced it.
dynamic_tracking_error(crn, u_list, x0_list, time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)
Compute a dynamic tracking cost for an IOCRN over a batch of scenarios.
The function simulates the CRN for each scenario in the Cartesian product
of u_list and x0_list (as implemented by crn.transient_response) and
evaluates how well the output trajectories track the provided references.
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
An IOCRN-like object implementing
|
u_list
|
list[np.ndarray]
List of constant input vectors. Each element has shape
|
x0_list
|
list[np.ndarray]
List of initial state vectors. Each element has shape
|
time_horizon
|
np.ndarray
1D array of evaluation times with shape
|
r_list
|
list[np.ndarray]
List of reference signals/targets for each scenario. The expected shape
and interpretation depend on
|
w
|
np.ndarray
Weights for the tracking error. Typically shape
|
norm
|
int, default=1
Norm used in the tracking error. Passed to
|
relative
|
bool, default=False
If True, compute a relative error (as supported by
|
LARGE_NUMBER
|
float, default=1e4
Divergence penalty passed to
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Scalar tracking cost aggregated across scenarios, outputs, and time. |
last_task_info
|
dict
Updated
|
habituation_error_piecewise(crn, u_nested_list, x0_list, nested_time_horizon, w, LARGE_NUMBER=10000.0, min_peak=0.1, max_peak=2.0)
Compute a habituation cost for a piecewise protocol using peak ratios.
This function simulates the CRN under a piecewise-constant input protocol and evaluates "habituation" as a change in peak response across repeated stimulus windows.
Convention used here
nested_time_horizondefines K segments with time grids t_0, t_1, ..., t_{K-1}.- We treat even-indexed segments (0, 2, 4, ...) as "stimulus" windows. Peaks are measured in those windows for each scenario output trajectory.
- A habituation score is computed from ratios of consecutive stimulus peaks: ratio_k = peak_{k+1} / peak_k (Lower ratios indicate stronger habituation.)
The function also enforces peak bounds. If any measured peak is outside
[min_peak, max_peak], it returns LARGE_NUMBER.
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
IOCRN-like object implementing:
|
u_nested_list
|
list[list[np.ndarray]]
List of input protocols. Each protocol is a list of input vectors (p,)
applied segment-wise. The inner list length must match
|
x0_list
|
list[np.ndarray] List of initial conditions (n,).
|
nested_time_horizon
|
list[np.ndarray] List of time grids, one per segment. These will be stitched by the simulator.
|
w
|
Union[float, Sequence[float], np.ndarray] Weights applied to each peak ratio term. If a scalar, the same weight is applied to all ratios. If a sequence, it should have length equal to the number of ratios (n_peaks - 1).
|
LARGE_NUMBER
|
float, default=1e4 Penalty returned if simulation diverges or peak constraints fail.
TYPE:
|
min_peak
|
float, default=0.1 Minimum acceptable peak value. Peaks below this are considered invalid.
TYPE:
|
max_peak
|
float, default=2.0 Maximum acceptable peak value. Peaks above this are considered invalid.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
|
Tuple[float, dict]:
- performance: Scalar habituation cost (lower is better).
- last_task_info: |
check_off_ss_invariant(crn, t, x, off_intervals, u_off, ss_tol_abs=0.0001, ss_tol_rel=0.001, xtol=1e-09, maxfev=2000)
x: full state trajectory, shape (n, T) off_intervals: list of (start,end) absolute u_off: input vector for OFF
habituation_metric(intervals, t, y_list, w, min_peak=0.1, max_peak=2.0, base_valley=0.0, LARGE_NUMBER=10000.0)
Compute a habituation cost from output peaks across repeated stimulus windows.
This metric extracts peak amplitudes from specified time intervals and computes ratios of consecutive stimulus peaks. By default, it assumes even-indexed intervals (0, 2, 4, ...) correspond to repeated stimulus windows.
Steps
1) For each scenario output trajectory in y_list, compute peaks in stimulus
windows (even intervals).
2) Enforce peak bounds: if any peak is outside [min_peak, max_peak], return
LARGE_NUMBER.
3) Compute peak ratios: ratio_k = peak_{k+1} / peak_k.
4) Return weighted mean of ratios (lower implies stronger habituation).
Notes
- If there are fewer than 2 stimulus windows, this metric cannot form a ratio
and returns
LARGE_NUMBER. - If any peak is zero or extremely small, division can blow up; we guard with a small epsilon.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Sequence[Tuple[float, float]] Time intervals (start, end) defining protocol segments.
TYPE:
|
t
|
np.ndarray Stitched time vector from the simulator, shape (T,).
TYPE:
|
y_list
|
Sequence[np.ndarray] List of output trajectories, one per scenario. Each element is typically shape (q, T) (q outputs).
TYPE:
|
w
|
Union[float, Sequence[float], np.ndarray] Weights for each ratio term. If scalar, repeated. If sequence, must match number of ratios (n_peaks - 1) or will be broadcast/clipped.
TYPE:
|
min_peak
|
float, default=0.1 Minimum acceptable peak amplitude.
TYPE:
|
max_peak
|
float, default=2.0 Maximum acceptable peak amplitude.
TYPE:
|
LARGE_NUMBER
|
float, default=1e4 Penalty returned if constraints fail or insufficient windows exist.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Scalar habituation cost. Lower is better.
TYPE:
|
dynamic_tracking_error_piecewise(crn, u_nested_list, x0_list, nested_time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)
Compute a dynamic tracking cost for piecewise-constant input protocols.
This is the piecewise analogue of dynamic_tracking_error. Instead of
constant inputs over a single horizon, each scenario specifies a sequence
of inputs applied over segmented time horizons.
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
An IOCRN-like object implementing
|
u_nested_list
|
list[list[np.ndarray]]
List of input protocols. Each element is a sequence
|
x0_list
|
list[np.ndarray]
List of initial state vectors, each of shape
|
nested_time_horizon
|
list[np.ndarray]
List of time grids
|
r_list
|
list[np.ndarray]
Reference signals/targets for each scenario (see
|
w
|
np.ndarray
Weights for the tracking error (typically shape
|
norm
|
int, default=1
Norm used in the tracking error (passed to
|
relative
|
bool, default=False
If True, compute a relative error (as supported by
|
LARGE_NUMBER
|
float, default=1e4 Divergence penalty passed to the simulator.
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Scalar tracking cost. |
last_task_info
|
dict
Updated |
oscillation_error(crn, u_list, x0_list, time_horizon, f_list=None, mean_list=None, w=[1 / 4, 1 / 4, 1 / 4, 1 / 4], t0=0, LARGE_NUMBER=10000.0)
Compute an oscillation-shaping cost based on output time-series metrics.
The CRN is simulated (as in transient_response), then oscillatory features
are extracted via oscillation_metrics. The returned scalar cost is a weighted
sum of several error components:
- mean error (if
mean_listis provided) - frequency error (if
f_listis provided) - damping deviation from 1
- peak ratio
r1deviation from 1
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
IOCRN-like object implementing
|
u_list
|
list[np.ndarray]
List of constant input vectors, each of shape
|
x0_list
|
list[np.ndarray]
List of initial states, each of shape
|
time_horizon
|
np.ndarray
1D array of evaluation times with shape
|
f_list
|
list[np.ndarray] or None, default=None
Desired oscillation frequencies per scenario and output. Expected format
follows
|
mean_list
|
list[np.ndarray] or None, default=None
Desired mean values per scenario and output (format follows
|
w
|
list[float], default=[1/4, 1/4, 1/4, 1/4]
Weights
|
t0
|
float, default=0 Time threshold after which oscillation metrics are evaluated (to ignore transients).
|
LARGE_NUMBER
|
float, default=1e4 Divergence penalty passed to the simulator.
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Scalar oscillation cost. |
last_task_info
|
dict
Updated
|
logic_circuit_reward(crn, u_list, x0_list, time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)
Compute a steady-state logic circuit cost using binary cross-entropy (BCE).
The CRN is simulated for each scenario. For each output trace, the final time
point is treated as the steady-state output y_ss and compared against the
target logic value r using BCE:
Notes
- This function currently ignores
w,norm, andrelative(kept for API compatibility with tracking rewards). - Outputs are clipped to
[1e-6, 1-1e-6]to avoidlog(0).
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
IOCRN-like object implementing
|
u_list
|
list[np.ndarray]
List of constant inputs, each shape
|
x0_list
|
list[np.ndarray]
List of initial states, each shape
|
time_horizon
|
np.ndarray
1D array of evaluation times with shape
|
r_list
|
list[np.ndarray]
List of desired binary targets per scenario. Each
|
w
|
np.ndarray Unused (present for signature compatibility).
|
norm
|
int Unused.
|
relative
|
bool Unused.
|
LARGE_NUMBER
|
float, default=1e4 Divergence penalty passed to the simulator.
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Mean BCE across scenarios and outputs (lower is better). |
last_task_info
|
dict
Updated
|
dynamic_tracking_error_piecewise_logic(crn, u_nested_list, x0_list, nested_time_horizon, r_list, w, norm=1, relative=False, LARGE_NUMBER=10000.0)
Compute a piecewise logic tracking cost using thresholded mismatch.
This is intended for sequential / protocol-driven logic tasks (e.g. latches), where targets are specified as binary values and outputs are evaluated using a 0.5 threshold across the entire time horizon (not only at steady state).
Internally, the CRN is simulated with transient_response_piecewise, then
the score is computed by performance_metric_logic:
\(\text{mean}(|1[r>0.5] - 1[y>0.5]|)\) over scenarios, outputs, and time.
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
IOCRN-like object implementing
|
u_nested_list
|
list[list[np.ndarray]]
List of input protocols (see
|
x0_list
|
list[np.ndarray]
Initial states, each shape
|
nested_time_horizon
|
list[np.ndarray] List of time grids per segment.
|
r_list
|
list[np.ndarray]
Target logic values per scenario, each expected shape
|
w
|
np.ndarray Unused (present for signature compatibility).
|
norm
|
int Unused.
|
relative
|
bool Unused.
|
LARGE_NUMBER
|
float, default=1e4 Divergence penalty passed to the simulator.
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Mean thresholded mismatch across scenarios, outputs, and time. |
last_task_info
|
dict
Updated |
performance_metric_logic(r_list, y_list)
Compute a binary (thresholded) mismatch score between targets and outputs.
Targets r_list are treated as desired binary outputs (thresholded at 0.5).
Outputs y_list are thresholded at 0.5 across all time points. The returned
score is the mean absolute mismatch:
\(\text{mean}(|1[r>0.5] - 1[y>0.5]|)\)
averaged over scenarios, outputs, and time.
| PARAMETER | DESCRIPTION |
|---|---|
r_list
|
list[np.ndarray]
List of reference logic targets, typically each of shape
|
y_list
|
list[np.ndarray]
List of output trajectories, each of shape
|
| RETURNS | DESCRIPTION |
|---|---|
|
float Mean mismatch rate in [0, 1], where 0 indicates perfect logic behavior. |
track_relationship(crn, u_list, x0_list, time_horizon, w, species_names, relationship_func, norm=1, LARGE_NUMBER=10000.0)
Compute a cost for enforcing an algebraic relationship between species trajectories.
This utility is for tasks where the objective is not tracking a pre-specified reference trajectory, but rather satisfying a constraint among species, e.g.
etc.
The user supplies relationship_func, which is called on the requested species
trajectories and should return an error signal that is zero when the desired
relationship holds. The function then aggregates the error into a scalar cost
using a weighted L1 or L2 norm across time (and across scenarios).
| PARAMETER | DESCRIPTION |
|---|---|
crn
|
IOCRN
IOCRN-like object implementing
|
u_list
|
list[np.ndarray]
List of constant input vectors, each shape
|
x0_list
|
list[np.ndarray]
List of initial conditions, each shape
|
time_horizon
|
np.ndarray
1D time grid of shape
|
w
|
np.ndarray
Weight array for the relationship error over time. Typically shape
|
species_names
|
list[str]
Names of species to feed into
|
relationship_func
|
callable
Function mapping species trajectories to an error signal. It will be called as:
relationship_func(traj_1, traj_2, ..., traj_N)
where each
|
norm
|
int, default=1 Norm for aggregation:
|
LARGE_NUMBER
|
float, default=1e4
Divergence penalty passed to
|
| RETURNS | DESCRIPTION |
|---|---|
performance
|
float Scalar relationship-tracking cost. |
last_task_info
|
dict
Updated
|
habituation_metric_with_gap(*, intervals, t, y_list, w, n_repeats_pre, n_repeats_post, gap_weight=5.0, recovery_tol=0.05, dishabituate_rho=1.0, min_peak=0.1, max_peak=2.0, LARGE_NUMBER=10000.0, sensitization=False)
Returns: habituation loss + gap consistency penalty. Keeps your "log(max ratio)" style for habituation.