-
Notifications
You must be signed in to change notification settings - Fork 36
CMS
- EPS 2025:
- CERN-CMS-DP-2025-033: https://cds.cern.ch/record/2937578
- ACAT 2022:
- CERN-CMS-DP-2022-061: http://cds.cern.ch/record/2842375
- ACAT 2021:
- J. Phys. Conf. Ser. 2438 012100: http://dx.doi.org/10.1088/1742-6596/2438/1/012100
- CERN-CMS-DP-2021-030: https://cds.cern.ch/record/2792320
The PF and MLPF NANOAOD from CMSSW can be found at
gfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/results/CMSSW_15_0_5_mlpf_v2.6.0pre1_puppi_2372e2/cuda_False
The raw model outputs for debugging can be found at
gfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/models/pyg-cms_20250722_101813_274478/preds_checkpoint-10-3.812332
The model itself can be found in the same directory, or also at huggingface.co/jpata/particleflow/tree/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478.
The following should work in lxplus:
#ensure proxy is set
voms-proxy-init -voms cms -valid 192:00
voms-proxy-info
#Initialize EL8
cmssw-el8
export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_15_0_5
cd CMSSW_15_0_5/src
cmsenv
git cms-init
cd src
git checkout CMSSW_15_0_5
git checkout -b my_dev_branch
git cms-merge-topic jpata:2372e2eca3fea355e15a9ff6b79f9f3111a90b21
#use the following instead of cms-merge-topic if you plan to commit changes to the dev branch: https://github.com/jpata/cmssw/pull/73
#git cms-checkout-topic jpata:pfanalysis_caloparticle_CMSSW_15_0_5_debug_withpu
#compile
scram b -j4
#download the latest MLPF model
mkdir -p RecoParticleFlow/PFProducer/data/mlpf
wget https://huggingface.co/jpata/particleflow/resolve/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478/checkpoints/test_fp32_fused.onnx -O RecoParticleFlow/PFProducer/data/mlpf/mlpf_5M_attn2x3x256_bm12_relu_checkpoint10_8xmi250_fp32_fused_20250722.onnx
We use the following datasets for rerunning reconstruction and PF (to be updated):
QCD_PU: /RelValQCD_FlatPt_15_3000HS_14/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_ReHLT_2024_PU-v1/GEN-SIM-DIGI-RAW
QCD_noPU: /RelValQCD_FlatPt_15_3000HS_14/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_Recycled_2024_noPU-v2/GEN-SIM-DIGI-RAW
TTbar_PU: /RelValTTbar_14TeV/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_ReHLT_2024_PU-v1/GEN-SIM-DIGI-RAW
TTbar_noPU: /RelValTTbar_14TeV/CMSSW_14_1_0-140X_mcRun3_2024_realistic_v21_STD_Recycled_2024_noPU-v2/GEN-SIM-DIGI-RAW
JetMET0: /JetMET0/Run2024B-v1/RAW, /JetMET0/Run2024C-v1/RAW
Since we need to rerun reconstruction, the datasets need to be in the GEN-SIM-DIGI-RAW (or RAW for data) tier.
Currently, only RelVal datasets are available at this tier. These datasets have been copied to disk at T2_EE_Estonia to ensure access.
The PF validation workflows can be run using the scripts in
cd particleflow
#the number 1 signifies the row index (filename) in the input file to process
#mlpf corresponds to MLPF with PUPPI, pf corresponds to standard PF
./scripts/cmssw/validation_job.sh False mlpf scripts/cmssw/qcd_pu.txt QCD_PU 1
./scripts/cmssw/validation_job.sh False pf scripts/cmssw/qcd_pu.txt QCD_PU 1
The MINIAOD output will be in $CMSSW_BASE/out/QCD_PU_mlpfpu and $CMSSW_BASE/out/QCD_PU_pf.
There are three stages of datasets:
- Raw CMSSW dump as a flat ROOT TTree
pftree, generated usingCMSSW_15_0_5andPFAnalysisNtuplizer.cc - Postprocessed events as a
pkl.bz2file, containing the inputs and target particles after the physics-based target definition - efficient ML training dataset in the
.array-recordformat suitable for high-performance IOgfal-ls root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/tensorflow_datasets/2.8.0
This ROOT TTree, named pftree, contains event-by-event information for developing and validating machine learning algorithms for particle flow reconstruction in the CMS experiment. The data is organized into several collections, detailed below.
This section details the unique identification for each collision event.
-
run,lumi,event: These integer values store the run number, luminosity section, and event number, respectively. Together, they form a unique identifier for each collision event.
Here you'll find information about the particles as they were produced in the initial hard interaction, before interacting with the detector.
-
gen_*: This collection describes the particles from the event generator. It includes their kinematics (pt,eta,phi,px,py,pz,energy), charge (charge), particle type (pdgid), and status (status). Thegen_daughtersbranch maps the decay products for each particle. -
genjet_*: Contains the four-momentum (pt,eta,phi,energy) of jets clustered from the generator-level particles. -
genmet_*: Stores the transverse momentum (pt) and direction (phi) of the generated missing transverse energy (MET).
This section provides a "ground truth" of how particles interacted with the detector, based on the Geant4 simulation.
-
trackingparticle_*: Describes simulated particles that leave a trace in the tracking detectors. It includes their kinematics (pt,eta,phi, etc.), production vertex (ovx,ovy,ovz), decay vertex (dvx,dvy,dvz), particle type (pid), and charge (charge). -
caloparticle_*: Details simulated particles as they deposit energy in the calorimeters. It includes their kinematics (pt,eta,phi,energy), particle type (pid), charge (charge), and the total simulated energy deposited (simenergy). It also contains an index (idx_trackingparticle) to link back to the correspondingTrackingParticle. -
simcluster_*: Represents clusters of energy depositions in the calorimeter from a single simulated particle. It stores their kinematics (pt,eta,phi,energy), particle type (pid), charge (charge), and indices to the parentCaloParticle(idx_caloparticle) andTrackingParticle(idx_trackingparticle).
These are the reconstructed tracks and calorimeter clusters. These represent the inputs to the particle flow algorithm.
-
element_*: The properties of eachPFBlockElement.-
Identification:
type(e.g., track, ECAL cluster, HCAL cluster),charge. -
Kinematics: Transverse momentum (
pt), momentum components (px,py,pz), energy (energy), and their associated errors. It also includes corrected energy (corr_energy) and its error. -
Position: Position at various detector layers (
eta_ecal,phi_ecal,eta_hcal,phi_hcal) and vertex information (vx,vy,vz). -
Track-specific: Number of hits (
num_hits), trajectory parameters (lambda,theta) and their errors. -
Cluster-specific: Information about cluster shape and flags (
cluster_flags). -
Muon-specific: Number of hits in the muon systems (
muon_dt_hits,muon_csc_hits) and muon type (muon_type). -
Electron-specific: Information from the GSF algorithm and electron seed classifiers (
gsf_electronseed_*).
-
Identification:
This section contains the final output of the standard particle flow algorithm.
-
pfcandidate_*: The properties of the final reconstructed particles (PFCandidate). This includes their kinematics (pt,eta,phi,px,py,pz,energy) and their identified particle type (pdgid).
This set of branches links the different data collections, enabling performance studies and algorithm training.
-
*_to_element: These branches link the ground truth particles to the reconstructed elements.-
trackingparticle_to_element: LinksTrackingParticletoPFBlockElement. -
caloparticle_to_element: LinksCaloParticletoPFBlockElement. -
simcluster_to_element: LinksSimClustertoPFBlockElement. - The
*_cmpbranches store a "comparison" metric for the link, related to shared energy or hits.
-
-
element_to_candidate: LinksPFBlockElementto the finalPFCandidatethey are part of. -
caloparticle_to_simcluster: Links aCaloParticleto theSimClusters it generated. -
element_distance_*: Stores the pre-calculated "distance" between pairs ofPFBlockElements (element_distance_i,element_distance_j,element_distance_d), which is a measure of their likelihood of originating from the same particle. This is a key input for graph-based machine learning models.
The Python script postprocessing2.py transforms particle flow data from a ROOT file into a format optimized for machine learning. Its final output is a single Python pickle file (.pkl) that contains a list of dictionaries, where each dictionary holds the processed data for one event.
Each event dictionary contains several NumPy arrays that represent the inputs and targets for an ML model.
-
Xelem: The primary input feature array. Each row corresponds to a single Particle Flow (PF) element, such as a track or calorimeter cluster, after removing less informative Preshower and Bremsstrahlung elements. Features include the element's kinematics (pt,eta,phi), type, charge, and other detector-specific measurements. -
ytarget: The ground truth target array, with a one-to-one correspondence with the rows ofXelem. Each target is constructed by merging all simulatedCaloParticlesthat contributed to the corresponding input element. Its features include the true particle ID (pid), kinematics, and an index (jet_idx) linking it to atargetjet. -
ycand: A "baseline" truth array representing the output of the standard CMSSWPFCandidatereconstruction, associated back to the input elements. It shares the same structure asytargetand is used for performance comparisons.
-
pythia: An array of stable generator-level particles from Pythia, excluding neutrinos. -
genjet: An array of jets clustered from the stablepythiaparticles. Each jet is described by itspt,eta,phi, andenergy. -
targetjet: An array of jets clustered from theytargettruth particles, providing a physics-level target for jet reconstruction. -
genmet: An array containing the generator-level missing transverse energy (ptandphi) from the original ROOT file. -
full_graph(Optional): If run with the--save-full-graphflag, the script also saves the completenetworkxgraph object for each event, which is useful for debugging.
This dataset is designed for machine learning and contains event-by-event information stored as flat NumPy arrays. Each event is a dictionary containing the following keys.
This is a 2D array of shape (num_elements, num_features) representing the input detector elements for the ML model.
- Content: Each row corresponds to a single Particle Flow (PF) element from the detector.
- Filtering: Preshower (PS1, PS2) and Bremsstrahlung (BREM) elements are removed from the original set.
-
Features: The features for each element are derived from the
elem_branchesinpostprocessing2.pyand organized by theX_FEATURESlist incms_utils.py. Key features include:-
typ_idx: An integer index representing the element type (TRACK, ECAL, HCAL, etc.), mapped from theELEM_NAMES_CMSlist. -
Kinematics:
pt,eta,sin_phi,cos_phi,energy,px,py,pz. -
Detector-specific information:
layer,depth,charge, position at ECAL/HCAL (eta_ecal,phi_ecal, etc.), and muon system hits (muon_dt_hits,muon_csc_hits). -
Errors and quality flags:
pterror,etaerror,phierror,cluster_flags, etc..
-
This is a 2D array of the same length as X, (num_elements, num_truth_features), representing the ground truth particle corresponding to each input element.
-
Content: Each row is a target particle constructed from one or more simulated
CaloParticlesthat are associated with the corresponding input element. If multipleCaloParticlesare linked to a single element, their four-vectors are summed to form one target particle. -
Features: The features are defined by the
Y_FEATURESlist incms_utils.py. They include:-
typ_idx: An integer index for the particle type, where specific PDGIDs are mapped to a simplified set of classes (e.g.,ch.had,n.had,gamma,ele,mu) defined inCLASS_NAMES_CMS. -
Kinematics:
charge,pt,eta,sin_phi,cos_phi,energy. -
Provenance:
ispu(a flag for pileup),generatorStatus, andsimulatorStatus. -
jet_idx: An index indicating whichtargetjetthis particle belongs to. A value of -1 means it's not part of a jet. This information is currently not used by the algorithm.
-
This is a 2D array with the same shape as ytarget, representing the "baseline" truth from the standard CMS PFCandidate reconstruction.
-
Content: Each row corresponds to the reconstructed
PFCandidatethat was primarily associated with the input element. This allows for a direct comparison between the ML model's output and the standard reconstruction. -
Features: It has the same feature set as
ytarget, as defined byY_FEATURES.
These are event-level 2D arrays containing kinematic information, used for quickly cross-checking the reconstruction in the ML training scripts.
-
genjets: Jets clustered using the anti-kT algorithm (R=0.4) from stable Pythia generator particles (excluding neutrinos). Each row is a jet with (pt,eta,phi,energy). -
targetjets: Jets clustered from theytargettruth particles, using the same algorithm. Each row is a jet with (pt,eta,phi,energy). -
genmet: The generator-level missing transverse energy (pt,phi) from the original ROOT file.
- no pileup:
CMSSW_15_0_5,auto:phase1_2023_realistic,Realistic25ns13p6TeVEarly2023Collision-
TTbar_14TeV_TuneCUETP8M1_cfi→cms_pf_ttbar_nopu -
ZTT_All_hadronic_14TeV_TuneCUETP8M1_cfi→cms_pf_ztt_nopu -
QCDForPF_14TeV_TuneCUETP8M1_cfi→cms_pf_ttbar_nopu
-
- with pileup:
CMSSW_15_0_5,auto:phase1_2023_realistic,Realistic25ns13p6TeVEarly2023Collision,Run3_Flat55To75_PoissonOOTPU,/RelValMinBias_14TeV/CMSSW_14_1_0_pre7-140X_mcRun3_2024_realistic_v21_STD_MinBias_2026D110_GenSim-v1/GEN-SIM-
TTbar_14TeV_TuneCUETP8M1_cfi→cms_pf_ttbar -
ZTT_All_hadronic_14TeV_TuneCUETP8M1_cfi→cms_pf_ztt -
QCDForPF_14TeV_TuneCUETP8M1_cfi→cms_pf_qcd
-
If you want to regenerate ML training samples from scratch with CMSSW, check the scripts
mlpf/data_cms/genjob_nopu.sh
mlpf/data_cms/genjob_pu55to75.sh
Copy the datasets from xrootd (about 1.8TB of disk space required):
gfal-copy -r root://xrootd.hep.kbfi.ee:1094//store/user/jpata/mlpf/tensorflow_datasets/2.8.0 ./
Download the pytorch distribution:
wget https://jpata.web.cern.ch/jpata/pytorch.simg:2024-12-03
On a machine with a single GPU, the following is a quick test of the training workflow
singularity exec --env CUDA_VISIBLE_DEVICES=0 -B /scratch/persistent --nv \
--env PYTHONPATH=`pwd` \
--env KERAS_BACKEND=torch \
pytorch.simg python3.10 mlpf/pipeline.py --dataset cms --gpus 1 \
--data-dir ./tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml \
--train --test --make-plots --num-epochs 10 --gpu-batch-multiplier 1 \
--num-workers 4 --prefetch-factor 100 --checkpoint-freq 1 --ntrain 1000 --ntest 1000 --nvalid 1000
---
config:
markdownAutoWrap: false
---
graph TD;
subgraph genjob [genjob_pu55to75,genjob_nopu.sh]
Samples(TTbar_14TeV_TuneCUETP8M1_cfi.py)-->|cmsDriver.py| gensim(standard GEN-SIM-RECO)
gensim -->|PFAnalysisNtuplizer.cc| pfntuple(PFElements, CaloParticles, SimClusters: flat *.root)
end
subgraph dataprep [Dataset preprocessing]
pfntuple-->|postprocessing2.py| postprocessing(MLPF inputs and targets: *.pkl.bz2);
postprocessing -->|tfds build heptfds/cms_pf/ttbar.py| tfds(ML dataset splits 1-10: *.tfrecords)
end
pfntuple -->|mlpf/data/cms/plot_cms.py| dataset_plots(Dataset plots: *.pkl)
postprocessing -->|mlpf/data/cms/plot_cms.py| dataset_plots
subgraph ml [ML training & eval]
tfds -->|mlpf/pipeline.py --train ...| checkpoints(checkpoint-epoch-loss.pth)
checkpoints -->|mlpf/pipeline.py --load checkpoint.pth --test ... | predictions(Predictions: *.parquet)
checkpoints -->|cms-validate-onnx.ipynb| onnx(ONNX model: *.onnx)
predictions -->|mlpf/pipeline.py --load checkpoint.pth --make-plots | eval_plots(Validation plots: *.pdf)
end
subgraph inference
onnx -->|cmsDriver ... -s RECO ... --procModifiers mlpf| mlpfnanoaod(BTV NANOAOD)
mlpfnanoaod -->|cmssw-validation.ipynb| cmsswplots(CMSSW validation plots: *.pdf)
mlpfnanoaod -->|cmssw-validation-data.ipynb| cmsswplots(CMSSW validation plots: *.pdf)
end