Skip to content

Split big data eras into multiple anaTuples#222

Merged
kandrosov merged 4 commits intocms-flaf:mainfrom
kandrosov:dataSplit2
Feb 3, 2026
Merged

Split big data eras into multiple anaTuples#222
kandrosov merged 4 commits intocms-flaf:mainfrom
kandrosov:dataSplit2

Conversation

@kandrosov
Copy link
Contributor

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the ability to split large data eras into multiple anaTuple output files, addressing scalability issues with very large data eras. The changes introduce a configurable nEventsPerFile parameter that can be set differently for data vs. MC, defaulting to 1,000,000 events for data and 100,000 for MC.

Changes:

  • Unified merge strategy creation through a new CreateMergeStrategy wrapper function that dispatches to data or MC-specific strategies
  • Enhanced CreateDataMergeStrategy to split data eras into multiple output files based on event count thresholds
  • Added handling for oversized input files in CreateMCMergeStrategy to prevent processing failures
  • Made nEventsPerFile configurable per process group with dictionary-based configuration

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
AnaProd/tasks.py Updated to use unified CreateMergeStrategy function and enhanced nEventsPerFile configuration to support per-process-group values
AnaProd/AnaTupleFileList.py Added wrapper function CreateMergeStrategy, modified CreateDataMergeStrategy to generate multiple output files per era, and added oversized file handling to CreateMCMergeStrategy

@kandrosov
Copy link
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot
Copy link
Collaborator

pipeline#13949620 started

@cms-flaf-bot
Copy link
Collaborator

pipeline#13949620 passed

@kandrosov kandrosov merged commit 04c920d into cms-flaf:main Feb 3, 2026
5 checks passed
@kandrosov kandrosov deleted the dataSplit2 branch February 3, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants