SEC EDGAR data ingestion system. Processes 580+ companies with automated parsing pipeline.
GlassBox is a data engineering system that ingests and processes SEC EDGAR filings for financial analysis. The system automates the collection and parsing of regulatory filings.
Key Highlights :
- β Data Ingestion : Automated SEC EDGAR filing collection
- β Parsing Pipeline : HTML β structured JSON conversion
- β Scale : Processes 580+ companies
- β Automation : Fully automated pipeline
- Python 3.11+
- Web Scraping (SEC EDGAR access)
- HTML Parsing (Structured data extraction)
- JSON (Data storage format)
ββββββββββββββββ
β SEC EDGAR β (580+ companies)
ββββββββ¬ββββββββ
β
ββββββββΌββββββββ
β Ingestion β (Automated collection)
ββββββββ¬ββββββββ
β
ββββββββΌββββββββ
β Parsing β (HTML β JSON)
ββββββββ¬ββββββββ
β
ββββββββΌββββββββ
β Storage β (Structured JSON)
ββββββββββββββββ
- SEC EDGAR : Automated filing collection
- Scale : 580+ companies processed
- Automation : Fully automated pipeline
- HTML Parsing : Extracts structured data from HTML
- JSON Output : Clean, structured data format
- Reliability : Robust error handling
- Scalable : Handles large-scale data processing
- Automated : Minimal manual intervention
- Reliable : Error handling and retry logic
- Companies Processed : 580+
- Data Quality : Clean, structured JSON output
- Reliability : Automated, robust pipeline
Phase : Completed - Functional system
Completed :
- β SEC EDGAR ingestion (580+ companies)
- β HTML parsing pipeline
- β JSON output format
- β Automated processing
- Portfolio : fabienpierret.github.io/projects/glassbox
- Author : Fabien Pierret
MIT License - See LICENSE file for details
Built with β€οΈ for demonstrating data engineering capabilities