This repository contains code for MLOps experiments using DVC for data tracking and MLflow for model management.
Building machine learning systems at scale requires more than a notebook and a saved model. You must track parameters, monitor performance, and maintain quality over time. MLOps defines practices that make this process efficient, such as running data processing, training, and evaluation pipelines. Each step contributes a distinct and essential role.
The development of an ML system is cyclical, with frequent iteration between steps (source: ML Systems – Chip Huyen).
DVC tracks every version of your data. In this project, datasets like data_00 and data_01 are combined into combined_data, followed by a preprocessed version of the Yahoo Questions dataset.
MLflow tracks all experiments, making it possible to review metrics, parameters, and model versions through its UI.
The initial data exploration included identifying unusual characters, emojis, and missing values. This step helps ensure the dataset is well understood before modeling.
The first model used was MultinomialNB, a strong baseline for text classification. Before modeling, the project followed standard NLP steps. Class distribution analysis showed no significant imbalance.
- data cleaning
- tokenization
- lemmatization or stemming
Improving data quality led to better results. The final output included a classification report similar to the one below:
The model was built with scikit-learn, and hyperparameters were tuned using GridSearch. Tuning was kept minimal, focusing instead on data quality rather than intensive model tweaking.
@misc{Carlos2025FlowQ,
author = {Lima, Carlos},
title = {FlowQ},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/CllsPy/FlowQ}},
}