Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ docs/_build
# pixi environments
.pixi/*
!.pixi/config.toml


# data files
outputs/
preproc_uproot/
# others
preproc_uproot/
skimmed/
test*
output*
_*
146 changes: 139 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,22 +72,30 @@ pip install --upgrade pip
pip install -r requirements.txt
```

#### Data Pre-processing
#### Data Skimming

The analysis expects pre-processed data files. If you do not have them, you can generate them by running the pre-processing step. This will download the necessary data from the CERN Open Data Portal and skim it according to the configuration.
The analysis expects skimmed data files. If you do not have them, you can generate them by running the skimming step. This will download the necessary data from the CERN Open Data Portal and skim it according to the configuration.

```bash
# This command overrides the default config to run only the pre-processing step.
# It may take a while to download and process the data.
python run.py general.run_preprocessing=True general.run_mva_training=False general.analysis=nondiff general.run_histogramming=False general.run_statistics=False
# This command runs only the skimming step to produce skimmed files
python analysis.py general.run_skimming=True general.analysis=skip

# Or run skimming and then analysis in one command
python analysis.py general.run_skimming=True
```

The skimming system provides three modes:

1. **Skim-only mode**: `general.analysis=skip` - Only performs skimming, no analysis
2. **Skim-and-analyse mode**: `general.run_skimming=True` - Skims data then runs analysis
3. **Analysis-only mode**: `general.run_skimming=False` - Uses existing skimmed files for analysis

### 2. Run the Differentiable Analysis

Once the pre-processed data is available, you can run the main analysis with a single command:
Once the skimmed data is available, you can run the main analysis with a single command:

```bash
python run.py
python analysis.py
```

### 3. What is Happening?
Expand Down Expand Up @@ -121,9 +129,18 @@ The default configuration (`user/configuration.py`) is set up to perform a diffe
- [1. The Configuration File (`user/configuration.py`)](#1-the-configuration-file-userconfigurationpy)
- [2. Defining Analysis Logic](#2-defining-analysis-logic)
- [3. Running the Analysis](#3-running-the-analysis)
- [Config-Driven Skimming Framework](#config-driven-skimming-framework)
- [Dataset Configuration](#dataset-configuration)
- [Skimming Configuration](#skimming-configuration)
- [Selection Functions](#selection-functions)
- [Integration with Main Configuration](#integration-with-main-configuration)
- [Usage Examples](#usage-examples)
- [Advanced Features](#advanced-features)
- [Configuration Reference](#configuration-reference)
- [`general` Block](#general-block)
- [`preprocess` Block](#preprocess-block)
- [`datasets` Block](#datasets-block)
- [`skimming` Block](#skimming-block)
- [`jax` Block](#jax-block)
- [`mva` Block](#mva-block)
- [`channels` Block](#channels-block)
Expand Down Expand Up @@ -303,6 +320,91 @@ The allowed top-level keys for CLI overrides are:

Attempting to override other keys (e.g., `jax.params`) will result in an error. To change these, you must edit the `user/configuration.py` file directly.

## Skimming Integration

The framework provides an integrated skimming system that handles data preprocessing before analysis.

### Usage Modes

The skimming system operates in three modes:

1. **Skim-only**: `general.analysis=skip` - Only performs skimming, no analysis
2. **Skim-and-analyse**: `general.run_skimming=True` - Skims data then runs analysis
3. **Analysis-only**: `general.run_skimming=False` - Uses existing skimmed files

### Dataset Configuration

The dataset manager expects text files containing lists of ROOT file paths. Configure datasets in `user/skim.py` by pointing to these text files:

```python
# user/skim.py - See existing implementation for details
dataset_manager_config = {
"datasets": [
{
"name": "signal",
"directory": "datasets/signal/", # Directory containing .txt files with ROOT file lists
"cross_section": 1.0,
},
# ... other datasets
]
}
```

Each dataset directory should contain `.txt` files where each line is a path to a ROOT file.

### Skimming Configuration

Define your skimming selection in `user/cuts.py` (see `default_skim_selection` for reference) and configure it in `user/skim.py`:

```python
# user/skim.py - See existing implementation for details
skimming_config = {
"nanoaod_selection": {
"function": default_skim_selection,
"use": [("Muon", None), ("Jet", None), ("PuppiMET", None), ("HLT", None)]
},
"uproot_cut_string": "HLT_TkMu50*(PuppiMET_pt>50)",
# ... other settings
}
```

### Integration

Connect the configurations in `user/configuration.py`:

```python
# user/configuration.py - See existing implementation for details
from user.skim import dataset_manager_config, skimming_config

config = {
"general": {
"run_skimming": False, # Set to True to enable
},
"preprocess": {
"skimming": skimming_config
},
"datasets": dataset_manager_config,
# ... rest of configuration
}
```

### Running

```bash
# Skim and analyze
python analysis.py general.run_skimming=True

# Skim only
python analysis.py general.run_skimming=True general.analysis=skip

# Analyze with existing skimmed files
python analysis.py
```

The framework automatically manages file paths, creates output directories (`{output_dir}/skimmed/`), and handles the transition from skimming to analysis.

---

## Configuration Reference

The analysis is controlled by a central configuration dictionary, typically defined in `user/configuration.py`.
Expand Down Expand Up @@ -348,6 +450,36 @@ Settings for the initial data skimming and filtering step.
| `branches` | `dict` | *Required* | Mapping of collection names to branch lists. |
| `ignore_missing` | `bool` | `False` | Ignore missing branches if `True`. |
| `mc_branches` | `dict` | *Required* | Additional branches for MC samples. |
| `skimming` | `dict` | `None` | Skimming configuration (see `skimming` block below). |

---

### `datasets` Block

List of dataset configurations defining data sample properties.

| Parameter | Type | Default | Description |
|------------------|------------|-------------|-----------------------------------------------------|
| `name` | `str` | *Required* | Unique dataset identifier. |
| `directory` | `str` | *Required* | Path to dataset files. |
| `cross_section` | `float` | *Required* | Cross-section in picobarns (pb). |
| `tree_name` | `str` | `"Events"` | ROOT tree name. |
| `weight_branch` | `str` | `"genWeight"` | Event weight branch name. |
| `metadata` | `dict` | `{}` | Additional dataset metadata. |

---

### `skimming` Block

Configuration for the data skimming step (part of `preprocess` block).

| Parameter | Type | Default | Description |
|----------------------|------------|-------------------|------------------------------------------------|
| `selection_function` | `Callable` | *Required* | Selection function that returns a PackedSelection object. |
| `selection_use` | `list[tuple]` | *Required* | List of (object, variable) tuples specifying inputs for the selection function. |
| `output_dir` | `str` | *Required* | Base directory for skimmed files. Files follow structure: {output_dir}/{dataset}/file__{idx}/part_X.root |
| `chunk_size` | `int` | `100000` | Number of events to process per chunk (used for configuration compatibility). |
| `tree_name` | `str` | `"Events"` | ROOT tree name for input and output files. |

---

Expand Down
84 changes: 48 additions & 36 deletions analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,30 @@
"""
import logging
import sys
import warnings

from coffea.nanoevents import NanoAODSchema, NanoEventsFactory

from analysis.diff import DifferentiableAnalysis
from analysis.nondiff import NonDiffAnalysis
from user.configuration import config as ZprimeConfig
from utils.input_files import construct_fileset
from utils.logging import ColoredFormatter
from utils.datasets import ConfigurableDatasetManager
from utils.logging import setup_logging, log_banner
from utils.schema import Config, load_config_with_restricted_cli
from utils.metadata_extractor import NanoAODMetadataGenerator
from utils.skimming import process_workitems_with_skimming

# -----------------------------
# Logging Configuration
# -----------------------------
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(ColoredFormatter())
if root_logger.hasHandlers():
root_logger.handlers.clear()
root_logger.addHandler(handler)
setup_logging()

logger = logging.getLogger("AnalysisDriver")
logging.getLogger("jax._src.xla_bridge").setLevel(logging.ERROR)

# ANSI color codes
MAGENTA = "\033[95m"
RESET = "\033[0m"
NanoAODSchema.warn_missing_crossrefs = False
warnings.filterwarnings("ignore", category=FutureWarning, module="coffea.*")

def _banner(text: str) -> str:
"""Creates a magenta-colored banner for logging."""
return (
f"\n{MAGENTA}\n{'=' * 80}\n"
f"{' ' * ((80 - len(text)) // 2)}{text.upper()}\n"
f"{'=' * 80}{RESET}"
)
# -----------------------------
# Main Driver
# -----------------------------
Expand All @@ -52,31 +43,52 @@ def main():
full_config = load_config_with_restricted_cli(ZprimeConfig, cli_args)
config = Config(**full_config) # Pydantic validation
logger.info(f"Luminosity: {config.general.lumi}")
dataset_manager = ( ConfigurableDatasetManager(config.datasets)
if config.datasets
else None
)

fileset = construct_fileset(
max_files_per_sample=config.general.max_files
)
logger.info(log_banner("metadata and workitems extraction"))
# Generate metadata and fileset from NanoAODs
generator = NanoAODMetadataGenerator(dataset_manager=dataset_manager)
generator.run(generate_metadata=config.general.run_metadata_generation)
fileset = generator.fileset
workitems = generator.workitems
if not workitems:
logger.error("No workitems available. Please ensure metadata generation completed successfully.")
sys.exit(1)

analysis_mode = config.general.analysis
if analysis_mode == "nondiff":
logger.info(_banner("Running Non-Differentiable Analysis"))
nondiff_analysis = NonDiffAnalysis(config)
nondiff_analysis.run_analysis_chain(fileset)
logger.info(log_banner("SKIMMING AND PROCESSING"))
logger.info(f"Processing {len(workitems)} workitems")

# Process workitems with dask-awkward
processed_datasets = process_workitems_with_skimming(workitems, config, fileset, generator.nanoaods_summary)


analysis_mode = config.general.analysis
if analysis_mode == "skip":
logger.info(log_banner("Skim-Only Mode: Skimming Complete"))
logger.info("✅ Skimming completed successfully. Analysis skipped as requested.")
logger.info(f"Skimmed files are available in the configured output directories.")
return
elif analysis_mode == "nondiff":
logger.info(log_banner("Running Non-Differentiable Analysis"))
nondiff_analysis = NonDiffAnalysis(config, processed_datasets)
nondiff_analysis.run_analysis_chain()
elif analysis_mode == "diff":
logger.info(_banner("Running Differentiable Analysis"))
diff_analysis = DifferentiableAnalysis(config)
diff_analysis.run_analysis_optimisation(fileset)
else:
logger.info(_banner("Running both Non-Differentiable and Differentiable Analysis"))
logger.info(log_banner("Running Differentiable Analysis"))
diff_analysis = DifferentiableAnalysis(config, processed_datasets)
diff_analysis.run_analysis_optimisation()
else: # "both"
logger.info(log_banner("Running both Non-Differentiable and Differentiable Analysis"))
# Non-differentiable analysis
logger.info("Running Non-Differentiable Analysis")
nondiff_analysis = NonDiffAnalysis(config)
nondiff_analysis.run_analysis_chain(fileset)
nondiff_analysis = NonDiffAnalysis(config, processed_datasets)
nondiff_analysis.run_analysis_chain()
# Differentiable analysis
logger.info("Running Differentiable Analysis")
diff_analysis = DifferentiableAnalysis(config)
diff_analysis.run_analysis_optimisation(fileset)
diff_analysis = DifferentiableAnalysis(config, processed_datasets)
diff_analysis.run_analysis_optimisation()


if __name__ == "__main__":
Expand Down
5 changes: 4 additions & 1 deletion analysis/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def is_jagged(array_like: ak.Array) -> bool:
class Analysis:
"""Base class for physics analysis implementations."""

def __init__(self, config: Dict[str, Any]) -> None:
def __init__(self, config: Dict[str, Any], processed_datasets: Optional[Dict[str, List[Tuple[Any, Dict[str, Any]]]]] = None) -> None:
"""
Initialize analysis with configuration for systematics, corrections,
and channels.
Expand All @@ -73,11 +73,14 @@ def __init__(self, config: Dict[str, Any]) -> None:
- 'corrections': Correction configurations
- 'channels': Analysis channel definitions
- 'general': General settings including output directory
processed_datasets : Optional[Dict[str, List[Tuple[Any, Dict[str, Any]]]]], optional
Pre-processed datasets from skimming, by default None
"""
self.config = config
self.channels = config.channels
self.systematics = config.systematics
self.corrections = config.corrections
self.processed_datasets = processed_datasets
self.corrlib_evaluators = self._load_correctionlib()
self.dirs = self._prepare_dirs()

Expand Down
Loading