Skip to content

fabienpierret/glassbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

GlassBox - SEC EDGAR Data Ingestion

SEC EDGAR data ingestion system. Processes 580+ companies with automated parsing pipeline.

Python License


🎯 Overview

GlassBox is a data engineering system that ingests and processes SEC EDGAR filings for financial analysis. The system automates the collection and parsing of regulatory filings.

Key Highlights :

  • βœ… Data Ingestion : Automated SEC EDGAR filing collection
  • βœ… Parsing Pipeline : HTML β†’ structured JSON conversion
  • βœ… Scale : Processes 580+ companies
  • βœ… Automation : Fully automated pipeline

πŸ› οΈ Technologies

  • Python 3.11+
  • Web Scraping (SEC EDGAR access)
  • HTML Parsing (Structured data extraction)
  • JSON (Data storage format)

πŸ“Š Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SEC EDGAR   β”‚  (580+ companies)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚  Ingestion   β”‚  (Automated collection)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚  Parsing      β”‚  (HTML β†’ JSON)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚  Storage      β”‚  (Structured JSON)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Key Features

1. Data Ingestion

  • SEC EDGAR : Automated filing collection
  • Scale : 580+ companies processed
  • Automation : Fully automated pipeline

2. Parsing Pipeline

  • HTML Parsing : Extracts structured data from HTML
  • JSON Output : Clean, structured data format
  • Reliability : Robust error handling

3. Architecture

  • Scalable : Handles large-scale data processing
  • Automated : Minimal manual intervention
  • Reliable : Error handling and retry logic

πŸ“ˆ Results

  • Companies Processed : 580+
  • Data Quality : Clean, structured JSON output
  • Reliability : Automated, robust pipeline

πŸ“ Status

Phase : Completed - Functional system

Completed :

  • βœ… SEC EDGAR ingestion (580+ companies)
  • βœ… HTML parsing pipeline
  • βœ… JSON output format
  • βœ… Automated processing

πŸ”— Links


πŸ“„ License

MIT License - See LICENSE file for details


Built with ❀️ for demonstrating data engineering capabilities

About

SEC EDGAR data ingestion system. Processes 580+ companies with automated parsing pipeline

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published