Skip to content

This repository contains a production-grade payroll automation system that extracts, parses, and uploads payroll data using LLMs (Google Gemini) and browser automation (Playwright)

Notifications You must be signed in to change notification settings

aryanj10/Payroll-LLM-Extractor

Repository files navigation

💼 Payroll LLM Extractor & Automation System

Python Streamlit App Gemini API Playwright Status License

This repository contains a production-grade payroll automation system that extracts, parses, and uploads payroll data using LLMs (Google Gemini) and browser automation (Playwright). Designed to process 1,000+ semi-structured files and auto-fill payroll templates with high accuracy.


🚀 Key Features

  • 🔍 LLM-powered Parsing: Converts messy payroll reports (PDF, RTF, Excel) into structured JSON.
  • 📄 CSV Auto-Fill: Populates Accountant's World-compatible CSV templates per client.
  • 🤖 Portal Automation: Automates CSV uploads and tax payments to Accountant’s World (AW).
  • 🧠 Smart Field Detection: Dynamically maps earnings, deductions, and tax categories.
  • 📂 Batch Payroll Support: Processes multiple client folders in one click.
  • 📊 Streamlit Interface: Simple UI to review extracted data and approve uploads.

📌 Use Cases

Bookkeeping firms handling 50+ client payrolls

Automation-first accountants aiming to cut labor costs

LLM startups showcasing GenAI for operations

Any business tired of manual AW uploads


🧠 Tech Stack

  • Python 3.10+
  • Streamlit – UI
  • Playwright – Headless browser automation
  • Google Gemini API – Large Language Model parsing
  • pandas, PyMuPDF, python-docx, striprtf – Data + doc processing

📊 Workflow

flowchart LR
    %% Direction: Left to Right for better readability
    %% Updated to reflect current pipeline: extract → LLM parse → validate → CSV → AW automation → outputs

    subgraph A[Ingestion]
        A1[Select or Upload Files]
        A2[Detect Type: PDF / RTF / Excel]
        A3[Extract Raw Text]
        A1 --> A2 --> A3
    end

    subgraph B[Chunking + LLM Parsing]
        B1[Split Into Employee Chunks]
        B2[Send Chunks to Gemini API]
        B3[Store Parsed JSON]
        A3 --> B1 --> B2 --> B3
    end

    subgraph C[Post‑Processing]
        C1[Analyze Labels: Earnings / Deductions / Taxes]
        C2[Validate Fields and Totals]
        B3 --> C1 --> C2
    end

    subgraph D[CSV Mapping + Output]
        D1[Map Emp# and Fields to CSV Template]
        D2[Populate Final CSV]
        D3[Review in UI]
        C2 --> D1 --> D2 --> D3
    end

    subgraph E[Approval + Automation]
        E1{Approve Payroll?}
        E2[Upload Payroll CSV to Accountant's World]
        E3[Fill Tax Payments on Accountant's World]
        D3 --> E1
        E1 -- Yes --> E2 --> E3
        E1 -- No  --> D3
    end

    subgraph F[Logging + Reports]
        F1[Write Audit Log and Status]
        F2[Save Reports and CSV]
        F3[Optional: Upload Reports to SharePoint]
        E2 --> F1
        E3 --> F1
        F1 --> F2 --> F3
    end

    %% Final User Actions
    F2 --> G[Download Files]
Loading

🧰 Project Structure

Payroll-LLM-Extractor/
│
├── streamlit_app.py # Main Streamlit app
├── upload_runner.py # Automates payroll CSV uploads to AW
├── upload_tax.py # Automates tax form filling on AW
│
├── src/
│ ├── send_chunk_llm.py # Gemini API integration
│ ├── excel_raw_text_chunk.py # Raw text extraction from Excel
│ └── populate_csv_template.py # Populates CSV template
│
├── utils/
│ ├── gemni_parser.py # Gemini prompt + parsing logic
│ ├── extract_rtf_pdf.py # PDF/RTF chunker
│ ├── label_analysis.py # Detects suspicious earnings
│ ├── populate_csv.py # Final CSV output generator
│
├── agent_project/
│ ├── agents.py # Playwright login + upload logic
│ └── opt.py # Browser headless setup
│
├── misc/ # Experimental scripts + prompts
├── requirements.txt
└── README.md

🧪 How It Works

  1. Upload a client’s payroll file (PDF, RTF, Excel).
  2. Extracts text chunks → src/excel_raw_text_chunk.py
  3. Sends chunks to Gemini → src/send_chunk_llm.py
  4. Post-processes earnings/deductions → utils/label_analysis.py
  5. Auto-fills CSV using template → src/populate_csv_template.py
  6. Uploads CSV and fills taxes on AW → upload_runner.py, upload_tax.py

▶️ Streamlit Demo

pip install -r requirements.txt
streamlit run streamlit_app.py

Make sure you have a .env file with your Gemini API key:

GEMINI_API_KEY=your_gemini_key_here

AW credentials are managed securely within the agent_project/ automation scripts or set manually before automation.


👨‍💻 About the Author

Aryan Jain MS Data Science @ Drexel | Python | AI Automation | Fintech 🔗 LinkedIn • 📬 aryanflory@gmail.com


🧭 Future Enhancements

✅ SharePoint sync for final reports

✅ AI-driven field mapping & validation

✅ Full audit trail + override logging

✅ Email notifications for payroll events

⚖️ License

MIT — free to use and modify with credit.

About

This repository contains a production-grade payroll automation system that extracts, parses, and uploads payroll data using LLMs (Google Gemini) and browser automation (Playwright)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages