This project implements a robust, three-tier hybrid log classification framework designed to process system logs with high accuracy and efficiency. By combining deterministic rules, local machine learning, and advanced Large Language Models (LLMs), the system ensures that every log is categorized correctly, regardless of its complexity.
To balance speed, cost, and reasoning capability, the system processes logs through the following hierarchy:
- Tier 1: Regular Expressions (Regex)
- Purpose: Instant classification for predictable, high-frequency patterns.
- Logic: If a log matches a predefined regex rule, it is labeled immediately, bypassing heavier models to save compute resources.
- Tier 2: BERT / Sentence Transformers
- Purpose: Local machine learning for complex but standard log patterns.
- Logic: Utilizing
Sentence-Transformersto generate embeddings and aLogistic Regressionclassifier (built withscikit-learn 1.8.0) to handle logs that pass the Regex tier. This provides high-speed inference without external API costs.
- Tier 3: Llama 3.3 70B (LLM)
- Purpose: High-reasoning fallback for "Unclassified" or legacy system logs.
- Logic: Any log not confidently caught by Tier 1 or 2 (or specifically designated from
LegacyCRM) is sent to the Llama-3.3-70b-versatile model via the Groq API. This ensures even the most ambiguous logs are labeled with human-like reasoning.
log_classification/
├── server.py # FastAPI backend entry point
├── classify.py # Main coordination logic (the "Master Brain")
├── bert_processor.py # Tier 2: Sentence Transformer logic
├── llm_processor.py # Tier 3: Groq / Llama 3.3 API logic
├── regex_processor.py # Tier 1: Pattern-based rules
├── log_classifier.joblib # Trained local ML model
├── requirements.txt # Project dependencies (locked to dev_env)
├── .env # Private API keys (Excluded from Git)
├── resources/ # Directory for output CSV files
└── test.csv # Sample input data for testing
Ensure you are using Python with the necessary dependencies installed. It is recommended to use an environment manager like Anaconda.
pip install -r requirements.txt
Create a .env file in the root directory and add your Groq API key:
GROQ_API_KEY=your_actual_key_here
Launch the backend using the Python interpreter. Ensure your virtual environment (e.g., dev_env) is active before running:
# run the server script directly
python server.py
Note: The script is configured to initialize the Uvicorn server automatically on 127.0.0.1:8000.
Once the server is running, you can interact with the classification pipeline through the following interfaces:
- Interactive Swagger UI: http://127.0.0.1:8000/docs — Recommended for testing CSV uploads.
- Alternative ReDoc: http://127.0.0.1:8000/redoc
- Navigate to the Swagger UI.
- Expand the
POST /classify/endpoint. - Click "Try it out" and upload your
test.csv. - The system will process the logs through the Regex ➔ BERT ➔ LLM pipeline and return a downloadable CSV with a
target_labelcolumn.
- Endpoint:
POST /classify/ - Payload: A
.csvfile containing two required columns:sourceandlog_message. - Output: A processed
.csvfile containing an additionaltarget_labelcolumn.
This project was inspired by and built upon the foundational concepts from the Codebasics Hybrid Log Classification project.
- Upgraded LLM Tier: Replaced default LLM logic with Llama 3.3 70B via the Groq API for state-of-the-art reasoning.
- Modernized Stack: Updated the pipeline to be compatible with scikit-learn 1.8.0 and FastAPI 0.125.0.
- Deployment Ready: Integrated a full FastAPI backend with dedicated endpoint handling for CSV batch processing.