Skip to content
/ FlowQ Public

FlowQ — An MLOps-driven NLP system for modeling and analyzing Yahoo Questions data.

Notifications You must be signed in to change notification settings

CllsPy/FlowQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlowQ

This repository contains code for MLOps experiments using DVC for data tracking and MLflow for model management.

What's MLOps

Building machine learning systems at scale requires more than a notebook and a saved model. You must track parameters, monitor performance, and maintain quality over time. MLOps defines practices that make this process efficient, such as running data processing, training, and evaluation pipelines. Each step contributes a distinct and essential role.

image

The development of an ML system is cyclical, with frequent iteration between steps (source: ML Systems – Chip Huyen).

DVC

DVC tracks every version of your data. In this project, datasets like data_00 and data_01 are combined into combined_data, followed by a preprocessed version of the Yahoo Questions dataset.

image

MLflow

MLflow tracks all experiments, making it possible to review metrics, parameters, and model versions through its UI.

image

NLP

Exploratory Data Analysis

The initial data exploration included identifying unusual characters, emojis, and missing values. This step helps ensure the dataset is well understood before modeling.

Modeling

The first model used was MultinomialNB, a strong baseline for text classification. Before modeling, the project followed standard NLP steps. Class distribution analysis showed no significant imbalance.

  • data cleaning
  • tokenization
  • lemmatization or stemming
image

Improving data quality led to better results. The final output included a classification report similar to the one below:

image

The model was built with scikit-learn, and hyperparameters were tuned using GridSearch. Tuning was kept minimal, focusing instead on data quality rather than intensive model tweaking.

Citation

@misc{Carlos2025FlowQ,
  author = {Lima, Carlos},
  title = {FlowQ},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/CllsPy/FlowQ}},
}

About

FlowQ — An MLOps-driven NLP system for modeling and analyzing Yahoo Questions data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published