Batch runner to compute population-level sentencing metrics and suitability scores for all individuals, writing a flat file (CSV/Parquet). The pipeline is strict about missing inputs: when nothing can be evaluated for a person, we emit NaNs instead of 0 so the case can be flagged, metrics are skipped when their prerequisites aren’t present (no fabricated values). Metrics are named and extensible; new metrics can be added without changing positional order.
config.py— Paths (DEV/PROD), column map (COLS), defaults (DEFAULTS), offense lists (OFFENSE_LISTS), and metric weights (METRIC_WEIGHTS).compute_metrics.py— Library functions to read raw tables and compute named features for a single ID (skip-if-missing).sentencing_math.py— Pure math (no I/O): time decomposition, proportions, frequency/trend, rehab, and name-based suitability.run.py— CLI to iterate over the cohort and export results (CSV or Parquet).
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -U pip pandas numpy openpyxl tqdm pyarrowEdit config.py:
CFG_PROFILE:DEVuses local files; defaultPRODcan read GitHub raw URLs.PATHS: setdemographics,current_commitments,prior_commitments(CSV/XLSX).COLS: map canonical fields (anyNonedisables that field/metric).OFFENSE_LISTS: explicit code lists; unlisted →other(no implicit fallback).METRIC_WEIGHTS: dict by name; only present features contribute to the score.- Optional: set
DEFAULTS["months_elapsed_total"]andDEFAULTS["freq_min_rate"/"freq_max_rate"]if you want frequency metrics to appear without per-person exposure inference. DEFAULT_TIME_ELAPSED_YEARSandSEVERITY_DECAY_RATE: control the time horizon and exponential decay inseverity_trend. IfDEFAULT_TIME_ELAPSED_YEARSis notNone, it overrides the computed years elapsed;SEVERITY_DECAY_RATEis the λ inseverity = Δv · exp(−λ · years_elapsed).
# CSV (auto format by extension)
python run.py --out population_metrics.csv
# Parquet
python run.py --out population_metrics.parquet --format parquet
# Subset of IDs (CSV with a column named as config.COLS['id'])
python run.py --out subset_metrics.csv --ids-csv ids_subset.csv
# Quick smoke test
python run.py --out sample.csv --limit 500--include-aux: adds diagnostics likeage_value,% completed,time_outside, and flattened offense counts (counts_current_...,counts_prior_...).--print-every N: also prints progress every N rows (besides tqdm).--fail-fast: aborts on first error (default: continue and log per-ID error).
- Main file: CSV/Parquet with one row per ID, including named metrics,
score(suitability), andscore_out_of(sum of absolute weights for present metrics). - If an ID has no evaluable metrics (e.g. all offenses → other, or required denominators are 0), the runner now writes: score = NaN score_out_of = NaN score_ratio = NaN evaluated = 0 This allows downstream tools to tell “not evaluated / insufficient data” apart from “evaluated and low score.
- If
--include-auxis used, the file also includestime_outside_months(the paper’soutᵢᵗ), representing total months spent outside prison across all convictions. - Errors (if any):
*.errors.jsonlwith{id, error}records. - Console preview prints the first rows/columns for a quick check.
These examples walk through exactly what the pipeline computes for a specific ID: counts → denominators → proportions → time pieces → trend/frequency → named vector → suitability. The LaTeX below matches the paper notation.
CDCR ID: 00173d8423
Offense Lists (active for this run)
- Violent:
['187', '211', '245'] - Nonviolent:
['459', '484', '10851']
- Current offense rows found: 11
- Prior offense rows found: 6
- Current: {'violent': 1, 'nonviolent': 1, 'other': 9, 'clash': 0}
- Prior: {'violent': 0, 'nonviolent': 4, 'other': 2, 'clash': 0}
current_sentence_months= 10000.000completed_months= 330.000past_time_months= NApct_current_completed= 3.300time_outside_months= 0.000
Definition:
-
desc_nonvio_curr = 1/2 = 0.500(see Eq. DESC-NONVIO-CURR) -
desc_nonvio_past = 4/4 = 1.000(see Eq. DESC-NONVIO-PAST) -
Violent proportions for trend:
desc_vio_curr = 1/2 = 0.500(see Eq. DESC-VIO-CURR)desc_vio_past = 0/4 = 0.000(see Eq. DESC-VIO-PAST)
-
Severity trend:
severity_trend = 0.500 * exp(-0.150 * 10.0) = 0.112(see Eq. SEVERITY-TREND;severity_trend = Δv * exp(-λ * T))
-
Frequency (per month outside):
raw_freq_violent = NA; raw_freq_total = NAnormalized: **SKIPPED**(requirestime_outside > 0,freq_min_rateandfreq_max_rate) (see Eqs. FREQ-VIO, FREQ-TOTAL)
-
Age (min–max):
age_raw = 38.000,min = 18.000,max = 90.000→age = 0.278(see Eq. AGE-NORM)
Order: desc_nonvio_curr, desc_nonvio_past, age, freq_violent, freq_total, severity_trend, edu_general, edu_advanced, rehab_general, rehab_advanced
Values: [0.500, 1.000, 0.278, SKIPPED, SKIPPED, 0.112, SKIPPED, SKIPPED, SKIPPED, SKIPPED]
Score: 1.889 (out of 3.000) — 63.0% of maximum
Contributing metrics: age, desc_nonvio_curr, desc_nonvio_past, severity_trend
CDCR ID: 0029029e5b
Offense Lists (active for this run)
- Violent:
['187', '211', '245'] - Nonviolent:
['459', '484', '10851']
- Current offense rows found: 1
- Prior offense rows found: 2
- Current: {'violent': 1, 'nonviolent': 0, 'other': 0, 'clash': 0}
- Prior: {'violent': 2, 'nonviolent': 0, 'other': 0, 'clash': 0}
current_sentence_months= 84.000completed_months= 67.200past_time_months= NApct_current_completed= 80.000time_outside_months= 0.000
Definition:
-
desc_nonvio_curr = 0/1 = 0.000(see Eq. DESC-NONVIO-CURR) -
desc_nonvio_past = 0/2 = 0.000(see Eq. DESC-NONVIO-PAST) -
Violent proportions for trend:
desc_vio_curr = 1/1 = 1.000(see Eq. DESC-VIO-CURR)desc_vio_past = 2/2 = 1.000(see Eq. DESC-VIO-PAST)
-
Severity trend:
severity_trend = 0.000(Δv = 0.000, so severity trend is 0; see Eq. SEVERITY-TREND,severity_trend = Δv * exp(-λ * T))
-
Frequency (per month outside):
violent_total = 3; total_conv = 3; time_outside = 0.000raw_freq_violent = NA; raw_freq_total = NAnormalized: **SKIPPED**(requirestime_outside > 0,freq_min_rateandfreq_max_rate) (see Eqs. FREQ-VIO, FREQ-TOTAL)
-
Age (min–max):
age_raw = 38.000,min = 18.000,max = 90.000→age = 0.278(see Eq. AGE-NORM)
Order: desc_nonvio_curr, desc_nonvio_past, age, freq_violent, freq_total, severity_trend, edu_general, edu_advanced, rehab_general, rehab_advanced
Values: [0.000, 0.000, 0.278, SKIPPED, SKIPPED, 0.000, SKIPPED, SKIPPED, SKIPPED, SKIPPED]
Score: 0.278 (out of 3.000) — 9.3% of maximum
Contributing metrics: age, desc_nonvio_curr, desc_nonvio_past, severity_trend
macOS/Linux
CFG_PROFILE=DEV python docs_1/make_worked_example.py --uid "0029029e5b" --violent "187,211,245" --nonviolent "459,484,10851" --age-years 38 --exposure-months 480 --freq-bounds "0,0.05" --out docs_1/README_worked_example_0029029e5b.md
CFG_PROFILE=DEV python docs_1/make_worked_example.py --uid "00173d8423" --violent "187,211,245" --nonviolent "459,484,10851" --age-years 38 --exposure-months 480 --freq-bounds "0,0.05" --out docs_1/README_worked_example_00173d8423.mdWindows PowerShell
$env:CFG_PROFILE="DEV"
python docs_1\make_worked_example.py --uid "0029029e5b" --violent "187,211,245" --nonviolent "459,484,10851" --age-years 38 --exposure-months 480 --freq-bounds "0,0.05" --out "docs_1\README_worked_example_0029029e5b.md"
python docs_1\make_worked_example.py --uid "00173d8423" --violent "187,211,245" --nonviolent "459,484,10851" --age-years 38 --exposure-months 480 --freq-bounds "0,0.05" --out "docs_1\README_worked_example_00173d8423.md"- Descriptive proportions:
- Violent proportions (used in trend):
- Severity trend (updated exponential form):
We first define the change in violent share between past and current.
Let
Only increases in violent share are penalized (if the violent share does not increase, Δvi = 0).
Let (\lambda > 0) be the decay rate (configured as SEVERITY_DECAY_RATE in config.py), and let
DEFAULT_TIME_ELAPSED_YEARS).
The implemented severity trend is:
Properties:
-
The ideal value is (0) (no recent worsening in severity).
-
Larger values correspond to more recent increases in violent share.
-
The exponential term down-weights very old changes.
-
Frequency (per month outside; min–max normalize if bounds are set):
Normalized:
- Age (min–max):
- Suitability (weighted dot product of available normalized metrics):
Out of:
Direction:
We encode orientation using d_k ∈ {+1, −1}:
d_k = +1→ higher metric valuem_{k,i}increases suitabilityd_k = −1→ higher metric valuem_{k,i}decreases suitability
Weights w_k ≥ 0 represent magnitudes only.
The numerator uses w_k * m_{k,i} and the “out-of” denominator uses w_k * x_k*,
where x_k* = 1 for d_k = +1 (positive-direction metrics)
and x_k* = 0 for d_k = −1 (negative-direction metrics).
Notes:
• Proportion metrics are computed only when denominators (> 0); otherwise the metric is SKIPPED.
• Frequency requires bothtime_outside > 0and configuredfreq_min_rate/freq_max_rate.
• Rehab/education are per‑month‑inside, then min–max normalized only if inputs and bounds are provided; otherwise omitted.
- Proportion metrics are computed only when denominators (> 0); otherwise the metric is SKIPPED.
- Frequency requires both
time_outside > 0andfreq_min_rate/freq_max_rateinconfig.py. - Offense classification uses only
OFFENSE_LISTS; anything unlisted → other (and does not contribute to denominators). - Suitability uses only present (gated) features with explicit
METRIC_WEIGHTS(no hidden zero‑weights). - When comparing individuals (similarity), compute on the intersection of present features and require a minimum shared‑dimension count (e.g., ≥3). Consider also Euclidean or Tanimoto for sensitivity analysis.
- If no metrics pass the gating (denominators 0, missing exposure, missing age, etc.), the scorer returns NaN (or None, depending on runner) and sets evaluated = 0. This is intentional and we do not fabricate zeros for unevaluable people.
import math
import config as CFG
import compute_metrics as cm
import sentencing_math as sm
import pandas as pd
# load source tables
demo = cm.read_table(CFG.PATHS["demographics"])
cur = cm.read_table(CFG.PATHS["current_commitments"])
pri = cm.read_table(CFG.PATHS["prior_commitments"])
# take a few IDs for the demo
ids = demo[CFG.COLS["id"]].astype(str).dropna().unique().tolist()[:3]
rows = []
for uid in ids:
feats, aux = cm.compute_features(uid, demo, cur, pri, CFG.OFFENSE_LISTS)
# name-based suitability; may return NaN/None if no evaluable metrics
score_ratio, num, den = sm.suitability_score_named(
feats,
weights=CFG.METRIC_WEIGHTS,
directions=getattr(CFG, "METRIC_DIRECTIONS", {}),
return_parts=True,
) # score_ratio == (num / den) when den > 0
# NaN / “not evaluated” safe handling
no_denom = (
den is None
or den == 0
or (isinstance(den, float) and math.isnan(den))
)
if no_denom:
score_ratio_safe = math.nan
num_safe = math.nan
den_safe = math.nan
evaluated = 0
else:
score_ratio_safe = float(score_ratio)
num_safe = float(num)
den_safe = float(den)
evaluated = 1
# Optional: expose time_outside if present in aux
time_outside_months = aux.get("time_outside")
pct_completed = aux.get("pct_completed")
rows.append(
{
CFG.COLS["id"]: uid,
**feats, # all computed named metrics
"score": num_safe, # numerator (Σ w·m)
"score_out_of": den_safe, # denominator (Σ w·x*)
"score_ratio": score_ratio_safe,
"evaluated": evaluated, # 1 = evaluated, 0 = not evaluable
"time_outside_months": time_outside_months,
"pct_completed": pct_completed,
}
)
df = pd.DataFrame(rows)
print(df.head())- No computable features: verify
COLSnames and required time fields inDEFAULTS['require_time_fields']. - All similarities/metrics look constant: set meaningful
freq_min_rate/freq_max_rateand ensure exposure window is computed. - XLSX read errors:
pip install openpyxl. - Parquet write errors:
pip install pyarrow.
MIT