Skip to content

Merge COSIDAG development branch with cleaned examples and workflow improvements#7

Open
falric05 wants to merge 99 commits intomainfrom
pr/dev_bgo
Open

Merge COSIDAG development branch with cleaned examples and workflow improvements#7
falric05 wants to merge 99 commits intomainfrom
pr/dev_bgo

Conversation

@falric05
Copy link
Contributor

Summary

This Pull Request merges the development work from pr/dev_bgo into main.

The branch has been cleaned before opening the PR, but it still represents
a cumulative set of changes developed over time.

Main Changes

  • Added several clean and self-contained COSIDAG example DAGs
  • Improved and reorganized example workflows for documentation and tutorials
  • Disabled automatic retrigger in example COSIDAGs for didactic clarity
  • Removed unused, experimental, or redundant files from the branch
  • Minor cleanup and refactoring related to DAG structure and configuration

Motivation

The goal of this PR is to provide clear and realistic COSIDAG examples that
can be used as reference material for new users of cosiflow, as well as
to improve overall maintainability and clarity of example workflows.

Scope and Impact

  • This PR mainly affects examples, documentation, and helper utilities
  • No changes to production pipelines are intended
  • Core COSIDAG logic is not modified

Notes for Reviewers

  • The PR is relatively large because it merges a long-lived development branch
  • Changes are logically grouped by commits where possible
  • Please focus the review on correctness, clarity, and consistency of the examples

falric05 and others added 30 commits April 24, 2025 11:58
* Add setup script and update entrypoint for Airflow environment initialization

* Updates documentation for creating .env file and removes setup script

* Add an initialization DAG to run the initialization script every two hours and launch the main DAG

* Add Dockerfiles and entrypoint scripts for Airflow and web GUI setup

* Remove the 'heasarc' directory creation from the Dockerfiles and the web GUI entrypoint script to make the given folder be mounted as the 'heasarc' directory

* Refactor file handling in DataPipeline to use shutil.move for better directory management

* Add an interface to browse PDF files in DL0 directory

* Update .gitignore to exclude all files in the given folder except explorer.js and index.html

* Update .gitignore to exclude all files in the data directory except explorer.js and index.html

* Update the initialization DAG to start immediately and improve task management

* Remove unused timedelta import in cosipipe_cosipy.py

* Update cosipipe_cosipy.py

Removed commented line code.

* Rename initialization DAG to 'cosipy_contactsimulator' for clarity

* Update README.md to improve DAG build and testing instructions

* Update UI text for clarity and consistency in explorer.js and index.html

* Updates the instructions in README.md for building and running Docker on Mac and Linux, improving clarity and consistency.

* Set the start date of the DAG 'cosipy_contactsimulator' to a specific time to avoid unexpected behavior
…n the airflow service, and updated the data path in the docker-compose
- Created DAG `cosiflow_alert_monitor` that periodically reads log file `data_pipeline.log`
- Implemented `alert_manager` module with error parsing, YAML rules, deduplication and notification sending
- Integration with Mailhog for local email sending testing
- Added Airflow plugin to access Mailhog Web UI via "Develop tools" menu
- Added support for SMTP environment variables via `.env`
…tatements. Add new notify_email and removed central logging file
…structure

- Changed data mount path in docker-compose.yaml to align with new directory structure.
- Added Conda Terms of Service acceptance in Dockerfile for required channels.
- Enhanced entrypoint script to export COSI directory structure environment variables and create necessary directories if not present.
- Updated Python version comment in environment.yml for clarity.
…ad functionality

- Introduced `heasarc_explorer_plugin` for browsing data files in a specified directory.
- Implemented Flask routes for home, folder navigation, and file downloads.
- Added a basic HTML template for the data explorer interface.
- Created a view plugin for integration with Airflow's app builder.
- Removed the obsolete `dl3_explorer_view_plugin` to streamline the codebase.
…lowing all file types

- Updated `explorer_home` and `explorer_folder` functions to include `current_path` in the template context.
- Modified file listing in `explorer_folder` to show all file types instead of just PDFs.
- Added a visual element in `explorer.html` to display the current path for better user navigation.
… directory paths

- Updated directory path definitions in `DataPipeline` to utilize environment variables for better flexibility.
- Ensured the input directory is created if it doesn't exist and adjusted the inotify watch to monitor the base directory directly.
… COSI installation

- Added `unzip` to the list of installed packages for the Airflow Docker environment.
- Updated the COSI installation process to install `py7zr` and changed the git checkout to version `v0.3.x` for compatibility.
… and navigation improvements

- Added tags for better organization in the `fail_task` DAG.
- Enhanced `explorer.html` with additional CSS comments for clarity.
- Improved navigation button descriptions and added comments for folder and file link functionalities.
- Introduced `cosipipe_tsmap__extpythonenv.py` for multi-task TS map computation with external Python environment.
- Added `cosipipe_tsmap__singletask__extpythonenv.py` for optimized single-task execution of the TS map pipeline.
- Created `cosipipe_tsmap_mulres.py` for multi-resolution TS map computation.
- Implemented `cosipipe_tsmap.py` for standard TS map computation.
- Developed supporting scripts for data preparation, binning, aggregation, and TS map computation.
- Enhanced `tsmap_pipeline.py` to manage the entire TS map processing workflow, supporting both standard and multi-resolution modes.
- Added detailed logging and error handling for improved pipeline robustness.
- Included cleanup tasks for better resource management post-execution.
- Introduced `dag_parallel_test_1` and `dag_parallel_test_2` for parallel task execution.
- Each DAG includes two BashOperator tasks that simulate a 60-second sleep.
- Configured with a maximum of 2 active runs and a concurrency of 3 for testing parallelism.
…vices

- Added a new volume mapping for the pipeline directory to the Airflow service configuration, enhancing the environment setup for pipeline execution.
- Introduced `cosipipe_lightcurve.py` DAG to automate the process of generating GRB light curves from newly arrived compressed folders.
- Implemented a sequence of tasks including waiting for new archives, decompressing them, binning GRB sources and backgrounds, and plotting the light curve.
- Created `cosipipe_lc_ops.py` with utility functions for archive decompression, input validation, and data binning, ensuring modularity and reusability.
- Enhanced the pipeline's robustness with error handling and validation checks for required input files.
…vigation

- Added a new route for file previewing, allowing users to view file contents directly in the interface.
- Implemented security checks to ensure safe access to files and directories.
- Enhanced the HTML template with a two-column layout for file listings and previews, improving user experience.
- Included JavaScript functionality for single-click preview and double-click download actions on files.
- Added error handling and user feedback for file access issues and loading states.
- Created a new example environment file to define essential Airflow environment variables.
- Included settings for Airflow admin credentials, SMTP configuration, and COSI directory structure.
- This file serves as a template for users to set up their local environment for the Airflow application.
- Modified the Dockerfile to set default values for user and group IDs as empty, enabling users to specify their own values during the build process.
- This change enhances flexibility for user management within the Airflow environment.
- Introduced UID, GID, and DISPLAY variables to the .env.example file, allowing users to specify their own bootstrap ID settings for containerized environments.
- This update enhances the configurability of the Airflow environment setup.
…apping

- Added UID and GID environment variables for user-specific configurations in the Postgres, Airflow, and Mailhog services.
- Enabled volume mapping for Postgres data to persist across container restarts.
- Updated Airflow service to utilize an environment file for configuration, improving setup flexibility.
- Changed default values for MY_UID and MY_GID in the Dockerfile to specific integers (12050 and 10000, respectively).
- This update ensures a consistent user and group configuration for the Airflow environment during the build process.
- Changed the SQLAlchemy connection string to specify the Postgres host and port explicitly, enhancing clarity and ensuring proper connectivity for the Airflow environment.
- Introduced new functions for reading configuration values from Airflow Variables, environment variables, and defaults, enhancing flexibility in managing settings.
- Added helper functions for retrieving integer, float, and boolean configurations, improving usability for users needing type-specific values.
- Updated documentation to reflect the new configuration capabilities and usage instructions.
- Introduced a new DAG, cosidag_tsmap.py, for processing GRB data through a series of tasks including binning GRB and background data, and computing TS maps.
- Implemented custom Python callables for data binning and TS map computation, utilizing ExternalPythonOperator for execution in a separate environment.
- Added a new module, cosipipe_tsmap_ops_cosidag.py, containing inlined logic from previous scripts for GRB data processing, including functions for binning and TS map generation.
- Enhanced configuration management by integrating Airflow Variables and environment variables for flexibility in data handling.
- Updated documentation to reflect the new DAG structure and operational details for users.
- Introduced parameters for managing parallelism in the COSIDAG, including max_active_runs, max_active_tasks, and concurrency limits.
- These enhancements allow for better resource management and task execution control within the DAG, improving overall performance and efficiency.
- Added an `auto_retrig` parameter to the COSIDAG class to control automatic DAG retriggering upon detecting new files.
- Updated the logic to conditionally create the `TriggerDagRunOperator` based on the `auto_retrig` setting, improving flexibility in task execution.
- Adjusted documentation to reflect the new parameter and its impact on the DAG workflow.
- Updated the `_date_filter_ok` function to accept a list of date queries, improving flexibility in filtering subfolders based on date criteria.
- Introduced helper functions for parsing date strings and applying date queries, enhancing code readability and maintainability.
- Modified the `_find_new_folder` and `COSIDAG` class to accommodate the new date query logic, ensuring consistent behavior across the module.
- Enhanced documentation to reflect changes in date handling and configuration options for users.
- Introduced a new module, date_helper.py, containing functions for parsing date strings and applying date queries.
- Implemented helper functions to enhance date filtering capabilities, improving flexibility and maintainability in date handling.
- Added regex-based checks for folder naming conventions related to dates, ensuring proper identification of date-related folders.
- Changed the date parameter to date_queries in cosidag_example.py and cosidag_tsmap.py for improved date filtering consistency.
- Introduced a new DAG, cosidag_lcurve.py, for processing light curve data, including custom Python callables for data binning and plotting.
- Enhanced configuration options for the new DAG, including monitoring folders and file patterns for GRB and background data.
- Introduced a new module, cosipipe_lc_ops_cosidag.py, containing functions for binning GRB source and background data, as well as plotting light curves.
- Implemented self-contained functions that do not rely on DAG-level globals, enhancing modularity and reusability.
- Added configuration management for binning processes, including YAML file generation for input parameters.
- Enhanced logging for better traceability during data processing steps.
- Updated comments for better readability, changing Italian to English for consistency.
- Improved the handling of the MAILHOG_WEBUI_URL environment variable by removing unnecessary quotes around the default value.
- Enhanced readability by organizing tasks within their respective DAG contexts.
- Updated descriptions and tags for better categorization and understanding of each DAG's purpose.
- Introduced comprehensive documentation for the COSIDAG module, detailing its purpose, workflow structure, and configuration options.
- Included a minimal example for implementation, along with sections on processed folder tracking, disabling automatic retrigger, and integration with the MailHog link plugin.
- Enhanced clarity on how COSIDAG facilitates dynamic file discovery and task orchestration in scientific workflows.
- Updated the docker-compose.yaml file to include a new volume mount for the modules directory, enhancing the Airflow setup by allowing access to additional module resources.
- Created a new "cosipy_develop" environment for COSIPY tools and dependencies, allowing for development on the latest branch.
- Included installation of necessary packages and cloning of the COSIPY repository for development purposes.
- Deleted multiple outdated DAG files including cosipipe_cosipy_external_python.py, cosipipe_cosipy.py, cosipipe_lightcurve.py, and others to streamline the project structure.
- Removed associated pipeline scripts and utility functions that are no longer in use, enhancing code maintainability and clarity.
- This cleanup is part of an effort to consolidate and refactor the existing workflows for better performance and organization.
- Introduced two new DAGs: cosidag_tutorial_a_svd and cosidag_tutorial_b_reconstruct, demonstrating the process of factorizing a text matrix and reconstructing it from factors, respectively.
- Each DAG includes custom Python and external Python operators for handling data processing and visualization tasks.
- Enhanced the modularity of the workflow by organizing tasks and providing clear documentation within the code.
- These additions aim to provide practical examples for users to understand and utilize the COSIDAG framework effectively.
…ed chaining logic

- Updated the COSIDAG class to allow optional disabling of tasks: `check_new_file` is not created if `monitoring_folders` is empty, and `automatic_retrig` is not created if set to False.
- Improved the chaining logic to adapt automatically based on the existence of tasks, ensuring a more flexible workflow.
- Enhanced documentation within the code to clarify the new behavior and usage of date queries.
- Refactored several sections for better readability and maintainability.
- Updated the output file naming convention in the save_bkg_window function to use a more descriptive format, changing from "_cut.fits" to "_window.fits".
- Improved code clarity by replacing the previous output path assignment with a more robust method using os.path.splitext.
- This change enhances the readability and maintainability of the code while ensuring the output file is appropriately named.
- Updated the output file naming convention in the bin_grb_data and bin_background_data functions to correctly handle file extensions.
- Improved code clarity by ensuring the extension is extracted without modifying it, enhancing the robustness of file path generation.
- These changes contribute to better maintainability and accuracy in file handling within the pipeline.
- Introduced a new DAG named `init_pipelines` to manage the initialization and staging of COSI pipeline data.
- Implemented tasks for preparing raw directories, resolving configuration parameters, staging files using an external Python operator, and creating symlinks for processed files.
- Enhanced the workflow with logic to handle file validation and re-downloading of corrupted files, ensuring robustness in data handling.
- The DAG is designed to optimize the staging process with clear documentation and structured task dependencies.
- Modified the background file pattern in both cosidag_lcurve.py and cosidag_tsmap.py to include a more specific naming convention, changing from "Total_BG*_unbinned_*.fits*" to "Total_BG*_unbinned_*_window.fits*".
- This change enhances clarity and ensures that the correct files are targeted during processing, improving the overall accuracy of the data handling workflow.
- Updated the README to introduce a new workflow for initializing and triggering pipelines through the `init_pipelines` DAG, eliminating the need for manual scripts.
- Enhanced clarity by detailing the steps for enabling COSIDAGs and triggering pipelines via the Airflow Web UI.
- Added sections on supported pipelines and important changes from the old workflow to the new, emphasizing a cleaner and more reproducible process.
- Introduced a new README file detailing all available DAGs and COSIDAGs in the repository.
- Included sections on workflow types, purposes, inputs/outputs, task layouts, operator types, and XCom usage for each DAG.
- Enhanced documentation to provide clear guidance on the structure and functionality of the pipelines, facilitating better understanding and usage for developers and users.
- Removed outdated note regarding the COSIFEST 2025 talk.
- Added a new section explaining the COSIDAG concept, its benefits, and its role in scientific pipelines.
- Updated the tutorial section to provide a comprehensive guide on writing and customizing COSIDAGs.
- Included a new section listing available DAGs and COSIDAGs, serving as a catalog for users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants