This repository contains a production-grade payroll automation system that extracts, parses, and uploads payroll data using LLMs (Google Gemini) and browser automation (Playwright). Designed to process 1,000+ semi-structured files and auto-fill payroll templates with high accuracy.
- 🔍 LLM-powered Parsing: Converts messy payroll reports (PDF, RTF, Excel) into structured JSON.
- 📄 CSV Auto-Fill: Populates Accountant's World-compatible CSV templates per client.
- 🤖 Portal Automation: Automates CSV uploads and tax payments to Accountant’s World (AW).
- 🧠 Smart Field Detection: Dynamically maps earnings, deductions, and tax categories.
- 📂 Batch Payroll Support: Processes multiple client folders in one click.
- 📊 Streamlit Interface: Simple UI to review extracted data and approve uploads.
Bookkeeping firms handling 50+ client payrolls
Automation-first accountants aiming to cut labor costs
LLM startups showcasing GenAI for operations
Any business tired of manual AW uploads
- Python 3.10+
- Streamlit – UI
- Playwright – Headless browser automation
- Google Gemini API – Large Language Model parsing
- pandas, PyMuPDF, python-docx, striprtf – Data + doc processing
flowchart LR
%% Direction: Left to Right for better readability
%% Updated to reflect current pipeline: extract → LLM parse → validate → CSV → AW automation → outputs
subgraph A[Ingestion]
A1[Select or Upload Files]
A2[Detect Type: PDF / RTF / Excel]
A3[Extract Raw Text]
A1 --> A2 --> A3
end
subgraph B[Chunking + LLM Parsing]
B1[Split Into Employee Chunks]
B2[Send Chunks to Gemini API]
B3[Store Parsed JSON]
A3 --> B1 --> B2 --> B3
end
subgraph C[Post‑Processing]
C1[Analyze Labels: Earnings / Deductions / Taxes]
C2[Validate Fields and Totals]
B3 --> C1 --> C2
end
subgraph D[CSV Mapping + Output]
D1[Map Emp# and Fields to CSV Template]
D2[Populate Final CSV]
D3[Review in UI]
C2 --> D1 --> D2 --> D3
end
subgraph E[Approval + Automation]
E1{Approve Payroll?}
E2[Upload Payroll CSV to Accountant's World]
E3[Fill Tax Payments on Accountant's World]
D3 --> E1
E1 -- Yes --> E2 --> E3
E1 -- No --> D3
end
subgraph F[Logging + Reports]
F1[Write Audit Log and Status]
F2[Save Reports and CSV]
F3[Optional: Upload Reports to SharePoint]
E2 --> F1
E3 --> F1
F1 --> F2 --> F3
end
%% Final User Actions
F2 --> G[Download Files]
Payroll-LLM-Extractor/
│
├── streamlit_app.py # Main Streamlit app
├── upload_runner.py # Automates payroll CSV uploads to AW
├── upload_tax.py # Automates tax form filling on AW
│
├── src/
│ ├── send_chunk_llm.py # Gemini API integration
│ ├── excel_raw_text_chunk.py # Raw text extraction from Excel
│ └── populate_csv_template.py # Populates CSV template
│
├── utils/
│ ├── gemni_parser.py # Gemini prompt + parsing logic
│ ├── extract_rtf_pdf.py # PDF/RTF chunker
│ ├── label_analysis.py # Detects suspicious earnings
│ ├── populate_csv.py # Final CSV output generator
│
├── agent_project/
│ ├── agents.py # Playwright login + upload logic
│ └── opt.py # Browser headless setup
│
├── misc/ # Experimental scripts + prompts
├── requirements.txt
└── README.md- Upload a client’s payroll file (PDF, RTF, Excel).
- Extracts text chunks →
src/excel_raw_text_chunk.py - Sends chunks to Gemini →
src/send_chunk_llm.py - Post-processes earnings/deductions →
utils/label_analysis.py - Auto-fills CSV using template →
src/populate_csv_template.py - Uploads CSV and fills taxes on AW →
upload_runner.py,upload_tax.py
pip install -r requirements.txt
streamlit run streamlit_app.py
Make sure you have a .env file with your Gemini API key:
GEMINI_API_KEY=your_gemini_key_hereAW credentials are managed securely within the agent_project/ automation scripts or set manually before automation.
Aryan Jain MS Data Science @ Drexel | Python | AI Automation | Fintech 🔗 LinkedIn • 📬 aryanflory@gmail.com
✅ SharePoint sync for final reports
✅ AI-driven field mapping & validation
✅ Full audit trail + override logging
✅ Email notifications for payroll events
MIT — free to use and modify with credit.