Hybrid Log Classification System

This project implements a robust, three-tier hybrid log classification framework designed to process system logs with high accuracy and efficiency. By combining deterministic rules, local machine learning, and advanced Large Language Models (LLMs), the system ensures that every log is categorized correctly, regardless of its complexity.

🚀 The Three-Tier Architecture

To balance speed, cost, and reasoning capability, the system processes logs through the following hierarchy:

Tier 1: Regular Expressions (Regex)

Purpose: Instant classification for predictable, high-frequency patterns.
Logic: If a log matches a predefined regex rule, it is labeled immediately, bypassing heavier models to save compute resources.

Tier 2: BERT / Sentence Transformers

Purpose: Local machine learning for complex but standard log patterns.
Logic: Utilizing Sentence-Transformers to generate embeddings and a Logistic Regression classifier (built with scikit-learn 1.8.0) to handle logs that pass the Regex tier. This provides high-speed inference without external API costs.

Tier 3: Llama 3.3 70B (LLM)

Purpose: High-reasoning fallback for "Unclassified" or legacy system logs.
Logic: Any log not confidently caught by Tier 1 or 2 (or specifically designated from LegacyCRM) is sent to the Llama-3.3-70b-versatile model via the Groq API. This ensures even the most ambiguous logs are labeled with human-like reasoning.

📂 Project Structure

log_classification/
├── server.py              # FastAPI backend entry point
├── classify.py            # Main coordination logic (the "Master Brain")
├── bert_processor.py      # Tier 2: Sentence Transformer logic
├── llm_processor.py       # Tier 3: Groq / Llama 3.3 API logic
├── regex_processor.py     # Tier 1: Pattern-based rules
├── log_classifier.joblib  # Trained local ML model
├── requirements.txt       # Project dependencies (locked to dev_env)
├── .env                   # Private API keys (Excluded from Git)
├── resources/             # Directory for output CSV files
└── test.csv               # Sample input data for testing

🛠️ Installation & Setup

1. Environment Setup

Ensure you are using Python with the necessary dependencies installed. It is recommended to use an environment manager like Anaconda.

pip install -r requirements.txt

2. Configure API Keys

Create a .env file in the root directory and add your Groq API key:

GROQ_API_KEY=your_actual_key_here

3. Start the API Server

Launch the backend using the Python interpreter. Ensure your virtual environment (e.g., dev_env) is active before running:

# run the server script directly
python server.py

Note: The script is configured to initialize the Uvicorn server automatically on 127.0.0.1:8000.

💻 API Usage & Testing

Once the server is running, you can interact with the classification pipeline through the following interfaces:

Interactive Swagger UI: http://127.0.0.1:8000/docs — Recommended for testing CSV uploads.
Alternative ReDoc: http://127.0.0.1:8000/redoc

How to Classify a File

Navigate to the Swagger UI.
Expand the POST /classify/ endpoint.
Click "Try it out" and upload your test.csv.
The system will process the logs through the Regex ➔ BERT ➔ LLM pipeline and return a downloadable CSV with a target_label column.

Classifying Logs

Endpoint: POST /classify/
Payload: A .csv file containing two required columns: source and log_message.
Output: A processed .csv file containing an additional target_label column.

🎓 Credits & Attribution

This project was inspired by and built upon the foundational concepts from the Codebasics Hybrid Log Classification project.

Key Modifications & Enhancements:

Upgraded LLM Tier: Replaced default LLM logic with Llama 3.3 70B via the Groq API for state-of-the-art reasoning.
Modernized Stack: Updated the pipeline to be compatible with scikit-learn 1.8.0 and FastAPI 0.125.0.
Deployment Ready: Integrated a full FastAPI backend with dedicated endpoint handling for CSV batch processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Log Classification System

🚀 The Three-Tier Architecture

📂 Project Structure

🛠️ Installation & Setup

1. Environment Setup

2. Configure API Keys

3. Start the API Server

💻 API Usage & Testing

How to Classify a File

Classifying Logs

🎓 Credits & Attribution

Key Modifications & Enhancements:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
bert_processor.py		bert_processor.py
classify.py		classify.py
llm_processor.py		llm_processor.py
log_classification.ipynb		log_classification.ipynb
log_classifier.joblib		log_classifier.joblib
output.csv		output.csv
regex_processor.py		regex_processor.py
requirements.txt		requirements.txt
server.py		server.py
synthetic_logs.csv		synthetic_logs.csv
test.csv		test.csv

Folders and files

Latest commit

History

Repository files navigation

Hybrid Log Classification System

🚀 The Three-Tier Architecture

📂 Project Structure

🛠️ Installation & Setup

1. Environment Setup

2. Configure API Keys

3. Start the API Server

💻 API Usage & Testing

How to Classify a File

Classifying Logs

🎓 Credits & Attribution

Key Modifications & Enhancements:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages