python pull_agency_info_api.py --output-dir metadata_output --overwrite=False --verboseThis will output the agency info and correpsonding documents to the metadata_output directory.
The default behavior will output all available documents in both json and csv formats.
ls metadata_output
#> 2025-10-30_agency_info.csv
#> 2025-10-30_all_agency_info.json
#> 2025-10-30_combined_pdf_content_details.csvpython get_download_list.py --download-folder Downloads --available-files "metadata_output/$(date +"%Y-%m-%d")_combined_pdf_content_details.csv"ls metadata_output
#> 2025-10-30_agency_info.csv
#> 2025-10-30_all_agency_info.json
#> 2025-10-30_combined_pdf_content_details.csv
#> extra_files.txt
#> missing_files.csvextra_files.txtcontains files that are inDownloadsbut are not found from the API (most likely due to naming discrepancies)missing_Files.csvcontains missing files in the csv format with header:
generated_filename,agency_name,agency_id,FileExtension,CreatedDate,Title,ContentBodyId,Id,ContentDocumentId
python download_all_pdfs.py --csv metadata_output/missing_files.csv --output-dir Downloads$ ls downloads/ | head
# 42ND_CIRCUIT_COURT_-_FAMILY_DIVISION_42ND_CIRCUIT_COURT_-_FAMILY_DIVISION_Interim_2025_2025-07-18_069cs0000104BR0AAM.pdf
# ADOPTION_AND_FOSTER_CARE_SPECIALISTS,_INC._CB440295542_INSP_201_2020-03-14_0698z000005Hpu5AAC.pdf
# ADOPTION_AND_FOSTER_CARE_SPECIALISTS,_INC._CB440295542_ORIG.pdf_2008-06-24_0698z000005HozQAAS.pdf
# ADOPTION_ASSOCIATES,_INC_Adoption_Associates_INC_Renewal_2025_2025-08-20_069cs0000163byMAAQ.pdf
# ADOPTION_OPTION,_INC._CB560263403_ORIG.pdf_2004-05-08_0698z000005Hp18AAC.pdfcheck the md5sums
Extract text from PDFs and save to parquet files:
python3 pdf_parsing/extract_pdf_text.py --pdf-dir Downloads --parquet-dir pdf_parsing/parquet_filesExtract basic document information from parquet files to CSV:
python3 pdf_parsing/extract_document_info.py --parquet-dir pdf_parsing/parquet_files -o document_info.csvThe output CSV contains:
- Agency ID (License #)
- Agency name
- Document title (extracted from document content, e.g., "Special Investigation Report", "Renewal Inspection Report")
- Inspection/report date
- Special Investigation Report indicator (whether document is a SIR)
After running the document extraction script, you can investigate random documents to see the original text alongside the parsed information:
cd pdf_parsing
python3 investigate_violations.pyCategories:
sir- Special Investigation Reports only (default)all- Any document
To investigate a specific document by its SHA256 hash:
python3 pdf_parsing/investigate_sha.py <sha256>Example:
python3 pdf_parsing/investigate_sha.py 6e5b899cf078b4bf0829e4dce8113aaac61edfa5bc0958efa725ae8607008f68This will display:
- Parsed violation information (agency, date, violations found)
- Original document text from the parquet file
- Useful for debugging parsing issues or verifying specific documents
See pdf_parsing/README.md for more details.
A lightweight web dashboard is included to visualize agency documents and reports.
The website can be built with a single command:
cd website
./build.shThis will:
- Generate document info CSV from parquet files
- Create JSON data files from the document info (deriving agency info automatically)
- Build the static website with Vite
The built website will be in the dist/ directory.
# Install dependencies
cd website
npm install
# Start development server
npm run devThe site is configured for automatic deployment on Netlify:
- Push changes to your repository
- Netlify will automatically run the build process from the
websitedirectory - The site will be deployed from the
dist/directory
Configuration is in website/netlify.toml.
See website/README.md for more details about the dashboard.
Automatically generate and maintain AI summaries for Special Investigation Reports (SIRs) using OpenRouter API (DeepSeek v3.2).
All AI queries use prompt caching to reduce costs when making multiple queries about the same document. The document text is sent as a common prefix, allowing OpenRouter to cache it across queries:
- First query: Full cost
- Subsequent queries: Significant savings via
cache_discount - Typical savings: Up to 10x on large documents
See CACHING_INVESTIGATION.md for details on implementation and verification.
A GitHub Actions workflow automatically:
- Scans parquet files for new SIRs
- Compares against existing summaries in
pdf_parsing/sir_summaries.csv - Generates AI summaries for up to 100 new SIRs weekly
- Commits results to the repository
To trigger manually: Go to Actions → "Update SIR Summaries" → Run workflow
cd pdf_parsing
export OPENROUTER_KEY="your-api-key"
python3 update_summaryqueries.py --count 100The AI analyzes each report to provide:
- Summary: Incident description and culpability assessment
- Violation status: Whether allegations were substantiated (y/n)
Results are appended to pdf_parsing/sir_summaries.csv with complete metadata including token usage, cost, and cache discount information.
See pdf_parsing/README.md for complete documentation.