Skip to content
Joosep Pata edited this page Sep 25, 2025 · 44 revisions

Public results in CMS

MLPF outputs

The PF and MLPF NANOAOD from CMSSW can be found at

gfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/results/CMSSW_15_0_5_mlpf_v2.6.0pre1_puppi_2372e2/cuda_False

The raw model outputs for debugging can be found at

gfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/models/pyg-cms_20250722_101813_274478/preds_checkpoint-10-3.812332

The model itself can be found in the same directory, or also at huggingface.co/jpata/particleflow/tree/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478.

Code setup in CMSSW

The following should work in lxplus:

#ensure proxy is set
voms-proxy-init -voms cms -valid 192:00
voms-proxy-info

#Initialize EL8
cmssw-el8

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_15_0_5
cd CMSSW_15_0_5/src
cmsenv
git cms-init
cd src
git checkout CMSSW_15_0_5
git checkout -b my_dev_branch
git cms-merge-topic jpata:2372e2eca3fea355e15a9ff6b79f9f3111a90b21

#use the following instead of cms-merge-topic if you plan to commit changes to the dev branch: https://github.com/jpata/cmssw/pull/73
#git cms-checkout-topic jpata:pfanalysis_caloparticle_CMSSW_15_0_5_debug_withpu

#compile
scram b -j4

#download the latest MLPF model
mkdir -p RecoParticleFlow/PFProducer/data/mlpf
wget https://huggingface.co/jpata/particleflow/resolve/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478/checkpoints/test_fp32_fused.onnx -O RecoParticleFlow/PFProducer/data/mlpf/mlpf_5M_attn2x3x256_bm12_relu_checkpoint10_8xmi250_fp32_fused_20250722.onnx

Running MLPF in CMSSW

PF validation

We use the following datasets for rerunning reconstruction and PF (to be updated):

QCD_PU: /RelValQCD_FlatPt_15_3000HS_14/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_ReHLT_2024_PU-v1/GEN-SIM-DIGI-RAW
QCD_noPU: /RelValQCD_FlatPt_15_3000HS_14/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_Recycled_2024_noPU-v2/GEN-SIM-DIGI-RAW
TTbar_PU: /RelValTTbar_14TeV/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_ReHLT_2024_PU-v1/GEN-SIM-DIGI-RAW
TTbar_noPU: /RelValTTbar_14TeV/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_Recycled_2024_noPU-v2/GEN-SIM-DIGI-RAW
JetMET0: /JetMET0/Run2024B-v1/RAW, /JetMET0/Run2024C-v1/RAW

Since we need to rerun reconstruction, the datasets need to be in the GEN-SIM-DIGI-RAW (or RAW for data) tier.

Currently, only RelVal datasets are available at this tier. These datasets have been copied to disk at T2_EE_Estonia to ensure access.

MINIAOD with PF and MLPF

The PF validation workflows can be run using the scripts in

cd particleflow

#the number 1 signifies the row index (filename) in the input file to process
#mlpf corresponds to MLPF with PUPPI, pf corresponds to standard PF
./scripts/cmssw/validation_job.sh False mlpf scripts/cmssw/qcd_pu.txt QCD_PU 1
./scripts/cmssw/validation_job.sh False pf scripts/cmssw/qcd_pu.txt QCD_PU 1

The MINIAOD output will be in $CMSSW_BASE/out/QCD_PU_mlpfpu and $CMSSW_BASE/out/QCD_PU_pf.

Training

Datasets

There are three stages of datasets:

  • Raw CMSSW dump as a flat ROOT TTree pftree, generated using CMSSW_15_0_5 and PFAnalysisNtuplizer.cc
  • Postprocessed events as a pkl.bz2 file, containing the inputs and target particles after the physics-based target definition
  • efficient ML training dataset in the .array-record format suitable for high-performance IO
    • gfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/tensorflow_datasets/2.8.0

Raw CMSSW dump in pftree

This ROOT TTree, named pftree, contains event-by-event information for developing and validating machine learning algorithms for particle flow reconstruction in the CMS experiment. The data is organized into several collections, detailed below.


Event Information

This section details the unique identification for each collision event.

  • run, lumi, event: These integer values store the run number, luminosity section, and event number, respectively. Together, they form a unique identifier for each collision event.

Generator-Level Information

Here you'll find information about the particles as they were produced in the initial hard interaction, before interacting with the detector.

  • gen_*: This collection describes the particles from the event generator. It includes their kinematics (pt, eta, phi, px, py, pz, energy), charge (charge), particle type (pdgid), and status (status). The gen_daughters branch maps the decay products for each particle.
  • genjet_*: Contains the four-momentum (pt, eta, phi, energy) of jets clustered from the generator-level particles.
  • genmet_*: Stores the transverse momentum (pt) and direction (phi) of the generated missing transverse energy (MET).

Simulation Information

This section provides a "ground truth" of how particles interacted with the detector, based on the Geant4 simulation.

  • trackingparticle_*: Describes simulated particles that leave a trace in the tracking detectors. It includes their kinematics (pt, eta, phi, etc.), production vertex (ovx, ovy, ovz), decay vertex (dvx, dvy, dvz), particle type (pid), and charge (charge).
  • caloparticle_*: Details simulated particles as they deposit energy in the calorimeters. It includes their kinematics (pt, eta, phi, energy), particle type (pid), charge (charge), and the total simulated energy deposited (simenergy). It also contains an index (idx_trackingparticle) to link back to the corresponding TrackingParticle.
  • simcluster_*: Represents clusters of energy depositions in the calorimeter from a single simulated particle. It stores their kinematics (pt, eta, phi, energy), particle type (pid), charge (charge), and indices to the parent CaloParticle (idx_caloparticle) and TrackingParticle (idx_trackingparticle).

Reconstructed Particle Flow Elements

These are the reconstructed tracks and calorimeter clusters. These represent the inputs to the particle flow algorithm.

  • element_*: The properties of each PFBlockElement.
    • Identification: type (e.g., track, ECAL cluster, HCAL cluster), charge.
    • Kinematics: Transverse momentum (pt), momentum components (px, py, pz), energy (energy), and their associated errors. It also includes corrected energy (corr_energy) and its error.
    • Position: Position at various detector layers (eta_ecal, phi_ecal, eta_hcal, phi_hcal) and vertex information (vx, vy, vz).
    • Track-specific: Number of hits (num_hits), trajectory parameters (lambda, theta) and their errors.
    • Cluster-specific: Information about cluster shape and flags (cluster_flags).
    • Muon-specific: Number of hits in the muon systems (muon_dt_hits, muon_csc_hits) and muon type (muon_type).
    • Electron-specific: Information from the GSF algorithm and electron seed classifiers (gsf_electronseed_*).

Reconstructed Particle Candidates

This section contains the final output of the standard particle flow algorithm.

  • pfcandidate_*: The properties of the final reconstructed particles (PFCandidate). This includes their kinematics (pt, eta, phi, px, py, pz, energy) and their identified particle type (pdgid).

Associations and Links

This set of branches links the different data collections, enabling performance studies and algorithm training.

  • *_to_element: These branches link the ground truth particles to the reconstructed elements.
    • trackingparticle_to_element: Links TrackingParticle to PFBlockElement.
    • caloparticle_to_element: Links CaloParticle to PFBlockElement.
    • simcluster_to_element: Links SimCluster to PFBlockElement.
    • The *_cmp branches store a "comparison" metric for the link, related to shared energy or hits.
  • element_to_candidate: Links PFBlockElement to the final PFCandidate they are part of.
  • caloparticle_to_simcluster: Links a CaloParticle to the SimClusters it generated.
  • element_distance_*: Stores the pre-calculated "distance" between pairs of PFBlockElements (element_distance_i, element_distance_j, element_distance_d), which is a measure of their likelihood of originating from the same particle. This is a key input for graph-based machine learning models.

Postprocessed inputs and targets

The Python script postprocessing2.py transforms particle flow data from a ROOT file into a format optimized for machine learning. Its final output is a single Python pickle file (.pkl) that contains a list of dictionaries, where each dictionary holds the processed data for one event.

Event Data Structure

Each event dictionary contains several NumPy arrays that represent the inputs and targets for an ML model.


Machine Learning Arrays
  • Xelem: The primary input feature array. Each row corresponds to a single Particle Flow (PF) element, such as a track or calorimeter cluster, after removing less informative Preshower and Bremsstrahlung elements. Features include the element's kinematics (pt, eta, phi), type, charge, and other detector-specific measurements.
  • ytarget: The ground truth target array, with a one-to-one correspondence with the rows of Xelem. Each target is constructed by merging all simulated CaloParticles that contributed to the corresponding input element. Its features include the true particle ID (pid), kinematics, and an index (jet_idx) linking it to a targetjet.
  • ycand: A "baseline" truth array representing the output of the standard CMSSW PFCandidate reconstruction, associated back to the input elements. It shares the same structure as ytarget and is used for performance comparisons.

Supporting Arrays
  • pythia: An array of stable generator-level particles from Pythia, excluding neutrinos.
  • genjet: An array of jets clustered from the stable pythia particles. Each jet is described by its pt, eta, phi, and energy.
  • targetjet: An array of jets clustered from the ytarget truth particles, providing a physics-level target for jet reconstruction.
  • genmet: An array containing the generator-level missing transverse energy (pt and phi) from the original ROOT file.
  • full_graph (Optional): If run with the --save-full-graph flag, the script also saves the complete networkx graph object for each event, which is useful for debugging.

Efficient ML format

This dataset is designed for machine learning and contains event-by-event information stored as flat NumPy arrays. Each event is a dictionary containing the following keys.


X (Input Features)

This is a 2D array of shape (num_elements, num_features) representing the input detector elements for the ML model.

  • Content: Each row corresponds to a single Particle Flow (PF) element from the detector.
  • Filtering: Preshower (PS1, PS2) and Bremsstrahlung (BREM) elements are removed from the original set.
  • Features: The features for each element are derived from the elem_branches in postprocessing2.py and organized by the X_FEATURES list in cms_utils.py. Key features include:
    • typ_idx: An integer index representing the element type (TRACK, ECAL, HCAL, etc.), mapped from the ELEM_NAMES_CMS list.
    • Kinematics: pt, eta, sin_phi, cos_phi, energy, px, py, pz.
    • Detector-specific information: layer, depth, charge, position at ECAL/HCAL (eta_ecal, phi_ecal, etc.), and muon system hits (muon_dt_hits, muon_csc_hits).
    • Errors and quality flags: pterror, etaerror, phierror, cluster_flags, etc..

ytarget (Ground Truth Particles)

This is a 2D array of the same length as X, (num_elements, num_truth_features), representing the ground truth particle corresponding to each input element.

  • Content: Each row is a target particle constructed from one or more simulated CaloParticles that are associated with the corresponding input element. If multiple CaloParticles are linked to a single element, their four-vectors are summed to form one target particle.
  • Features: The features are defined by the Y_FEATURES list in cms_utils.py. They include:
    • typ_idx: An integer index for the particle type, where specific PDGIDs are mapped to a simplified set of classes (e.g., ch.had, n.had, gamma, ele, mu) defined in CLASS_NAMES_CMS.
    • Kinematics: charge, pt, eta, sin_phi, cos_phi, energy.
    • Provenance: ispu (a flag for pileup), generatorStatus, and simulatorStatus.
    • jet_idx: An index indicating which targetjet this particle belongs to. A value of -1 means it's not part of a jet. This information is currently not used by the algorithm.

ycand (CMS PFCandidate Truth)

This is a 2D array with the same shape as ytarget, representing the "baseline" truth from the standard CMS PFCandidate reconstruction.

  • Content: Each row corresponds to the reconstructed PFCandidate that was primarily associated with the input element. This allows for a direct comparison between the ML model's output and the standard reconstruction.
  • Features: It has the same feature set as ytarget, as defined by Y_FEATURES.

Jet and MET Collections

These are event-level 2D arrays containing kinematic information, used for quickly cross-checking the reconstruction in the ML training scripts.

  • genjets: Jets clustered using the anti-kT algorithm (R=0.4) from stable Pythia generator particles (excluding neutrinos). Each row is a jet with (pt, eta, phi, energy).
  • targetjets: Jets clustered from the ytarget truth particles, using the same algorithm. Each row is a jet with (pt, eta, phi, energy).
  • genmet: The generator-level missing transverse energy (pt, phi) from the original ROOT file.

Dataset physics configurations

  • no pileup: CMSSW_15_0_5, auto:phase1_2023_realistic, Realistic25ns13p6TeVEarly2023Collision
    • TTbar_14TeV_TuneCUETP8M1_cficms_pf_ttbar_nopu
    • ZTT_All_hadronic_14TeV_TuneCUETP8M1_cficms_pf_ztt_nopu
    • QCDForPF_14TeV_TuneCUETP8M1_cficms_pf_ttbar_nopu
  • with pileup: CMSSW_15_0_5, auto:phase1_2023_realistic, Realistic25ns13p6TeVEarly2023Collision, Run3_Flat55To75_PoissonOOTPU, /RelValMinBias_14TeV/CMSSW_14_1_0_pre7-140X_mcRun3_2024_realistic_v21_STD_MinBias_2026D110_GenSim-v1/GEN-SIM
    • TTbar_14TeV_TuneCUETP8M1_cficms_pf_ttbar
    • ZTT_All_hadronic_14TeV_TuneCUETP8M1_cficms_pf_ztt
    • QCDForPF_14TeV_TuneCUETP8M1_cficms_pf_qcd

Generating MLPF training samples

If you want to regenerate ML training samples from scratch with CMSSW, check the scripts

mlpf/data_cms/genjob_nopu.sh
mlpf/data_cms/genjob_pu55to75.sh

pytorch training

Copy the datasets from xrootd (about 1.8TB of disk space required):

gfal-copy -r root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/tensorflow_datasets/2.8.0 ./

Download the pytorch distribution:

wget https://jpata.web.cern.ch/jpata/pytorch.simg:2024-12-03

On a machine with a single GPU, the following is a quick test of the training workflow

singularity exec --env CUDA_VISIBLE_DEVICES=0 -B /scratch/persistent --nv \
    --env PYTHONPATH=`pwd` \
    --env KERAS_BACKEND=torch \
    pytorch.simg python3.10 mlpf/pipeline.py --dataset cms --gpus 1 \
    --data-dir ./tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml \
    --train --test --make-plots --num-epochs 10 --gpu-batch-multiplier 1 \
    --num-workers 4 --prefetch-factor 100 --checkpoint-freq 1 --ntrain 1000 --ntest 1000 --nvalid 1000

Workflow

---
config:
  markdownAutoWrap: false
---
graph TD;
  subgraph genjob [genjob_pu55to75,genjob_nopu.sh]
    Samples(TTbar_14TeV_TuneCUETP8M1_cfi.py)-->|cmsDriver.py| gensim(standard GEN-SIM-RECO)
    gensim -->|PFAnalysisNtuplizer.cc| pfntuple(PFElements, CaloParticles, SimClusters: flat *.root)
  end
  subgraph dataprep [Dataset preprocessing]
    pfntuple-->|postprocessing2.py| postprocessing(MLPF inputs and targets: *.pkl.bz2);
    postprocessing -->|tfds build heptfds/cms_pf/ttbar.py| tfds(ML dataset splits 1-10: *.tfrecords)
  end
    pfntuple -->|mlpf/data/cms/plot_cms.py| dataset_plots(Dataset plots: *.pkl)
    postprocessing -->|mlpf/data/cms/plot_cms.py| dataset_plots
  subgraph ml [ML training & eval]
    tfds -->|mlpf/pipeline.py --train ...| checkpoints(checkpoint-epoch-loss.pth)
    checkpoints -->|mlpf/pipeline.py --load checkpoint.pth --test ... | predictions(Predictions: *.parquet)
    checkpoints -->|cms-validate-onnx.ipynb| onnx(ONNX model: *.onnx)
    predictions -->|mlpf/pipeline.py --load checkpoint.pth --make-plots | eval_plots(Validation plots: *.pdf)
  end
  subgraph inference
    onnx -->|cmsDriver ... -s RECO ... --procModifiers mlpf| mlpfnanoaod(BTV NANOAOD)
    mlpfnanoaod -->|cmssw-validation.ipynb| cmsswplots(CMSSW validation plots: *.pdf)
    mlpfnanoaod -->|cmssw-validation-data.ipynb| cmsswplots(CMSSW validation plots: *.pdf)
  end
Loading