MCYJ Parsing Script

1. Get all the available documents from the Michigan Welfare public search API

python pull_agency_info_api.py --output-dir metadata_output --overwrite=False --verbose

This will output the agency info and correpsonding documents to the metadata_output directory. The default behavior will output all available documents in both json and csv formats.

1. Output

ls metadata_output
#> 2025-10-30_agency_info.csv
#> 2025-10-30_all_agency_info.json
#> 2025-10-30_combined_pdf_content_details.csv

2. Get a list of extra and missing files in the downloaded files

python get_download_list.py --download-folder Downloads --available-files "metadata_output/$(date +"%Y-%m-%d")_combined_pdf_content_details.csv"

2. Output

ls metadata_output
#> 2025-10-30_agency_info.csv
#> 2025-10-30_all_agency_info.json
#> 2025-10-30_combined_pdf_content_details.csv
#> extra_files.txt
#> missing_files.csv

extra_files.txt contains files that are in Downloads but are not found from the API (most likely due to naming discrepancies)
missing_Files.csv contains missing files in the csv format with header:

generated_filename,agency_name,agency_id,FileExtension,CreatedDate,Title,ContentBodyId,Id,ContentDocumentId

3. Download missing documents

python download_all_pdfs.py --csv metadata_output/missing_files.csv --output-dir Downloads

3. Output

$ ls downloads/ | head
# 42ND_CIRCUIT_COURT_-_FAMILY_DIVISION_42ND_CIRCUIT_COURT_-_FAMILY_DIVISION_Interim_2025_2025-07-18_069cs0000104BR0AAM.pdf
# ADOPTION_AND_FOSTER_CARE_SPECIALISTS,_INC._CB440295542_INSP_201_2020-03-14_0698z000005Hpu5AAC.pdf
# ADOPTION_AND_FOSTER_CARE_SPECIALISTS,_INC._CB440295542_ORIG.pdf_2008-06-24_0698z000005HozQAAS.pdf
# ADOPTION_ASSOCIATES,_INC_Adoption_Associates_INC_Renewal_2025_2025-08-20_069cs0000163byMAAQ.pdf
# ADOPTION_OPTION,_INC._CB560263403_ORIG.pdf_2004-05-08_0698z000005Hp18AAC.pdf

4. Check duplicates and update file metadata

check the md5sums

5. Extract text from PDFs and extract basic document info

Extract text from PDFs and save to parquet files:

python3 pdf_parsing/extract_pdf_text.py --pdf-dir Downloads --parquet-dir pdf_parsing/parquet_files

Extract basic document information from parquet files to CSV:

python3 pdf_parsing/extract_document_info.py --parquet-dir pdf_parsing/parquet_files -o document_info.csv

The output CSV contains:

Agency ID (License #)
Agency name
Document title (extracted from document content, e.g., "Special Investigation Report", "Renewal Inspection Report")
Inspection/report date
Special Investigation Report indicator (whether document is a SIR)

6. Investigate documents

After running the document extraction script, you can investigate random documents to see the original text alongside the parsed information:

cd pdf_parsing
python3 investigate_violations.py

Categories:

sir - Special Investigation Reports only (default)
all - Any document

Investigate a specific document by SHA

To investigate a specific document by its SHA256 hash:

python3 pdf_parsing/investigate_sha.py <sha256>

Example:

python3 pdf_parsing/investigate_sha.py 6e5b899cf078b4bf0829e4dce8113aaac61edfa5bc0958efa725ae8607008f68

This will display:

Parsed violation information (agency, date, violations found)
Original document text from the parquet file
Useful for debugging parsing issues or verifying specific documents

See pdf_parsing/README.md for more details.

7. Web Dashboard

A lightweight web dashboard is included to visualize agency documents and reports.

Building the Website

The website can be built with a single command:

cd website
./build.sh

This will:

Generate document info CSV from parquet files
Create JSON data files from the document info (deriving agency info automatically)
Build the static website with Vite

The built website will be in the dist/ directory.

Local Development

# Install dependencies
cd website
npm install

# Start development server
npm run dev

Netlify Deployment

The site is configured for automatic deployment on Netlify:

Push changes to your repository
Netlify will automatically run the build process from the website directory
The site will be deployed from the dist/ directory

Configuration is in website/netlify.toml.

See website/README.md for more details about the dashboard.

8. AI-Powered SIR Summaries

Automatically generate and maintain AI summaries for Special Investigation Reports (SIRs) using OpenRouter API (DeepSeek v3.2).

Prompt Caching Optimization

All AI queries use prompt caching to reduce costs when making multiple queries about the same document. The document text is sent as a common prefix, allowing OpenRouter to cache it across queries:

First query: Full cost
Subsequent queries: Significant savings via cache_discount
Typical savings: Up to 10x on large documents

See CACHING_INVESTIGATION.md for details on implementation and verification.

Automated Updates

A GitHub Actions workflow automatically:

Scans parquet files for new SIRs
Compares against existing summaries in pdf_parsing/sir_summaries.csv
Generates AI summaries for up to 100 new SIRs weekly
Commits results to the repository

To trigger manually: Go to Actions → "Update SIR Summaries" → Run workflow

Local Usage

cd pdf_parsing
export OPENROUTER_KEY="your-api-key"
python3 update_summaryqueries.py --count 100

The AI analyzes each report to provide:

Summary: Incident description and culpability assessment
Violation status: Whether allegations were substantiated (y/n)

Results are appended to pdf_parsing/sir_summaries.csv with complete metadata including token usage, cost, and cache discount information.

See pdf_parsing/README.md for complete documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
facility_information		facility_information
pdf_parsing		pdf_parsing
website		website
.gitignore		.gitignore
README.md		README.md
download_all_pdfs.py		download_all_pdfs.py
download_pdf.py		download_pdf.py
get_download_list.py		get_download_list.py
mcyj_download.py		mcyj_download.py
parse_available_files.py		parse_available_files.py
pull_agency_info_api.py		pull_agency_info_api.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MCYJ Parsing Script

1. Get all the available documents from the Michigan Welfare public search API

1. Output

2. Get a list of extra and missing files in the downloaded files

2. Output

3. Download missing documents

3. Output

4. Check duplicates and update file metadata

5. Extract text from PDFs and extract basic document info

6. Investigate documents

Investigate a specific document by SHA

7. Web Dashboard

Building the Website

Local Development

Netlify Deployment

8. AI-Powered SIR Summaries

Prompt Caching Optimization

Automated Updates

Local Usage

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

statcom-um/MCYJ-Datapipeline

Folders and files

Latest commit

History

Repository files navigation

MCYJ Parsing Script

1. Get all the available documents from the Michigan Welfare public search API

1. Output

2. Get a list of extra and missing files in the downloaded files

2. Output

3. Download missing documents

3. Output

4. Check duplicates and update file metadata

5. Extract text from PDFs and extract basic document info

6. Investigate documents

Investigate a specific document by SHA

7. Web Dashboard

Building the Website

Local Development

Netlify Deployment

8. AI-Powered SIR Summaries

Prompt Caching Optimization

Automated Updates

Local Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages