Prefect Server Status Checker #102

davramov · 2025-12-22T21:31:32Z

A CLI tool to monitor the health and run status of our self-hosted Prefect servers across multiple beamlines.

Features

Multi-auth support: Handles both splash_auth (BL7011, BL832) and Keycloak (dichroism, BL733, BL931) authentication
Deployment-level summaries: Shows run counts by state (completed, failed, running, scheduled) per deployment
Failure tracking: Displays failure rates as percentages and time since last failure
Color-coded output: Green for healthy, red for failures, blue for running, dim for scheduled/metadata
Progress indicator: Shows fetch progress for servers with many runs
Overall summary: Aggregates stats across all servers

Requirements
Environment Variables:

    PREFECT_API_KEY     API key for splash_auth servers (BL7011, BL832)
    KC_USERNAME         Keycloak username for keycloak-protected servers
    KC_PASSWORD         Keycloak password for keycloak-protected servers

Extending
Additional beamlines can be added to the PrefectServer(Enum) with the configured URL and Authentication method (splash_auth vs Keycloak).

Usage

python prefect_status.py -H 24       # Last 24 hours

Which returns:

dichroism (https://flow-dichroism.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 10 runs      
  10 runs, 4 failures (40.0%),                   2 unhealthy deployments
  ✗ new_file_402_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✗ new_file_631_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✓ new_file_402_flow: no runs
  ✓ new_file_631_flow: no runs
  ✓ prune_data402: no runs
  ✓ prune_data631: no runs
  ✓ run_dichroism_dispatcher: no runs

BL7011 (https://flow-xpcs.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, all healthy
  ✓ new_file_7011_flight_check: scheduled: 3, completed: 2
  ✓ new_file_7011_flow: no runs
  ✓ prune_data7011: no runs
  ✓ run_7011_dispatcher: no runs

BL733 (https://flow-733.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, all healthy
  ✓ new_file_733_flight_check: scheduled: 3, completed: 2
  ✓ new_file_733_flow: no runs
  ✓ prune_data733: no runs
  ✓ run_733_dispatcher: no runs

BL832 (https://flow-prd.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 286 runs      
  222 runs, 2 failures (0.9%),                   1 unhealthy deployments
  ✗ prune_spot832: completed: 42, failed: 2 (4.5%) (last failure: 53m ago)
  ✓ alcf_recon_flow: no runs
  ✓ ingest_dataset: no runs
  ✓ nersc_recon_flow: completed: 64
  ✓ nersc_streaming_flow: no runs
  ✓ new_file_832: no runs
  ✓ prune_alcf832_raw: no runs
  ✓ prune_alcf832_scratch: no runs
  ✓ prune_data832: completed: 49
  ✓ prune_data832_raw: no runs
  ✓ prune_data832_scratch: no runs
  ✓ prune_nersc832_alsdev_pscratch_raw: no runs
  ✓ prune_nersc832_alsdev_pscratch_scratch: no runs
  ✓ run_832_dispatcher: completed: 64
  ✓ test_transfers_832: completed: 1
  ✓ test_transfers_832_grafana: no runs

BL931 (https://flow-931.als.lbl.gov) [last 1d]
--------------------------------------------------
  fetched 5 runs      
  5 runs, 2 failures (40.0%),                   1 unhealthy deployments
  ✗ new_file_931_flight_check: scheduled: 3, failed: 2 (40.0%) (last failure: 1h ago)
  ✓ new_file_931_flow: no runs
  ✓ prune_data931: no runs
  ✓ run_931_dispatcher: no runs

==================================================
OVERALL SUMMARY
==================================================
2/5 servers healthy
247 total runs, 8 failures (3.2%)

… recent flow statuses (RUNNING/CANCELLED/FAILED/COMPLETED, etc) for each deployment on each server.

davramov · 2025-12-22T22:18:20Z

I added an argument (-f) to print out the run names of failed jobs under the deployment. Example:

  ✗ prune_spot832: completed: 42, failed: 2 (4.5%) (last failure: 1h ago)
      └ delete spot832: 20251221_121209_T48_PC7_d.h5 [failed] (1h ago)
      └ delete spot832: 20251221_113153_T45_PC6_10.h5 [failed] (2h ago)

…destination, in the case that a dispatcher flow was not triggered for a scan, and it is unclear what the scan was

davramov added 5 commits December 22, 2025 13:25

Adding script for checking the Prefect server health and a summary of…

1bc09b4

… recent flow statuses (RUNNING/CANCELLED/FAILED/COMPLETED, etc) for each deployment on each server.

Adding option to list failed flow run names for easy identification

6a87e86

Adding docstring

b5aa6a8

Switching import order

10ecbc4

Docstrings

98e0d0b

davramov added 2 commits December 22, 2025 14:27

Printing error messages for failed flows when the -f flag is applied

2195509

Adding script to check for file-matching between source endpoint and …

b0177f7

…destination, in the case that a dispatcher flow was not triggered for a scan, and it is unclear what the scan was

davramov mentioned this pull request Dec 23, 2025

[8.3.2] pscratch raw pruning #103

Closed

davramov marked this pull request as ready for review December 23, 2025 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefect Server Status Checker #102

Prefect Server Status Checker #102

Uh oh!

davramov commented Dec 22, 2025

Uh oh!

davramov commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Prefect Server Status Checker #102

Are you sure you want to change the base?

Prefect Server Status Checker #102

Uh oh!

Conversation

davramov commented Dec 22, 2025

Uh oh!

davramov commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant