Skip to content

Conversation

@davramov
Copy link
Contributor

A CLI tool to monitor the health and run status of our self-hosted Prefect servers across multiple beamlines.

Features

  • Multi-auth support: Handles both splash_auth (BL7011, BL832) and Keycloak (dichroism, BL733, BL931) authentication
  • Deployment-level summaries: Shows run counts by state (completed, failed, running, scheduled) per deployment
  • Failure tracking: Displays failure rates as percentages and time since last failure
  • Color-coded output: Green for healthy, red for failures, blue for running, dim for scheduled/metadata
  • Progress indicator: Shows fetch progress for servers with many runs
  • Overall summary: Aggregates stats across all servers

Requirements
Environment Variables:

    PREFECT_API_KEY     API key for splash_auth servers (BL7011, BL832)
    KC_USERNAME         Keycloak username for keycloak-protected servers
    KC_PASSWORD         Keycloak password for keycloak-protected servers

Extending
Additional beamlines can be added to the PrefectServer(Enum) with the configured URL and Authentication method (splash_auth vs Keycloak).

Usage

python prefect_status.py -H 24       # Last 24 hours

Which returns:

dichroism (https://flow-dichroism.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 10 runs      
  10 runs, 4 failures (40.0%),                   2 unhealthy deployments
  ✗ new_file_402_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✗ new_file_631_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✓ new_file_402_flow: no runs
  ✓ new_file_631_flow: no runs
  ✓ prune_data402: no runs
  ✓ prune_data631: no runs
  ✓ run_dichroism_dispatcher: no runs

BL7011 (https://flow-xpcs.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, all healthy
  ✓ new_file_7011_flight_check: scheduled: 3, completed: 2
  ✓ new_file_7011_flow: no runs
  ✓ prune_data7011: no runs
  ✓ run_7011_dispatcher: no runs

BL733 (https://flow-733.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, all healthy
  ✓ new_file_733_flight_check: scheduled: 3, completed: 2
  ✓ new_file_733_flow: no runs
  ✓ prune_data733: no runs
  ✓ run_733_dispatcher: no runs

BL832 (https://flow-prd.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 286 runs      
  222 runs, 2 failures (0.9%),                   1 unhealthy deployments
  ✗ prune_spot832: completed: 42, failed: 2 (4.5%) (last failure: 53m ago)
  ✓ alcf_recon_flow: no runs
  ✓ ingest_dataset: no runs
  ✓ nersc_recon_flow: completed: 64
  ✓ nersc_streaming_flow: no runs
  ✓ new_file_832: no runs
  ✓ prune_alcf832_raw: no runs
  ✓ prune_alcf832_scratch: no runs
  ✓ prune_data832: completed: 49
  ✓ prune_data832_raw: no runs
  ✓ prune_data832_scratch: no runs
  ✓ prune_nersc832_alsdev_pscratch_raw: no runs
  ✓ prune_nersc832_alsdev_pscratch_scratch: no runs
  ✓ run_832_dispatcher: completed: 64
  ✓ test_transfers_832: completed: 1
  ✓ test_transfers_832_grafana: no runs

BL931 (https://flow-931.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, 2 failures (40.0%),                   1 unhealthy deployments
  ✗ new_file_931_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✓ new_file_931_flow: no runs
  ✓ prune_data931: no runs
  ✓ run_931_dispatcher: no runs

==================================================
OVERALL SUMMARY
==================================================
2/5 servers healthy
247 total runs, 8 failures (3.2%)

@davramov
Copy link
Contributor Author

I added an argument (-f) to print out the run names of failed jobs under the deployment. Example:

  ✗ prune_spot832: completed: 42, failed: 2 (4.5%) (last failure: 1h ago)
      └ delete spot832: 20251221_121209_T48_PC7_d.h5 [failed] (1h ago)
      └ delete spot832: 20251221_113153_T45_PC6_10.h5 [failed] (2h ago)

…destination, in the case that a dispatcher flow was not triggered for a scan, and it is unclear what the scan was
@davramov davramov marked this pull request as ready for review December 23, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant