Welcome to my GitHub profile! I'm passionate about:
- Data engineering
- Data science
- Machine learning
I love solving real-world problems through code
- Iβm currently working on signature detection in documents : Signature Dectection in Documents
-
Γcole Polytechnique, Master in Data Science
- September 2025 - December 2026 (Ongoing)
- Relevant Courses:
- Optimization for Data Science
- Deep Learning (pytorch, Keras)
- Advanced AI for text and graphs (LoRa, RAG, graph AI)
-
Insa Lyon, Sfotware Engineer
- September 2020 - July 2025 (Validated)
- Relevant Courses:
- Foundation of Data Engineering
- Machine Learning and Data Analytics
- Object Oriented Programming (c++)
- 2 years of STEM classes
Here are some of the notable projects Iβve worked on during my academic journey:
Bridging Structured Chemical Graphs and Natural Language A sophisticated multi-modal system that translates 2D molecular structures into human-readable scientific descriptions. It aligns symbolic graph representations with semantic text using a dual-tower architecture and contrastive learning to automate chemical database enrichment and drug discovery reporting.
- π§ AI: ChEmbed (BASF-AI) & Graph Transformer with Global Self-Attention.
- Architecture: Dual-Tower Encoder with Trainable Adapter Layers.
- Techniques: InfoNCE Contrastive Loss, Hard Negative Mining via Tanimoto Similarity, and Matryoshka Representation Learning.
- Focus: Multi-modal Alignment, Graph Representation Learning, and Domain-Specific NLP.
π Check out the Code
Real-time RAG Agent for Public Transport
A robust AI assistant that helps users navigate the Paris transport network. It uses LLM tool-calling to query live APIs for itineraries and traffic disruptions.
- π§ AI: Llama 3.1 405B (via NVIDIA NIM)
- Architecture: Microservices (Docker Compose)
- Data Pipeline: Apache Kafka (KRaft) & ElasticSearch
- Focus: Tool-calling, RAG, & Real-time Data Streaming
π Check out the Code
Dataset: 2025 Roland Garros Final (Time-series sequences)
Goal: Classify "Hit", "Bounce", and "Air" states from raw (x,y) coordinates.
Methodology:
- Feature Engineering: Transformed raw coordinates into kinematic features (Acceleration, Jerk, Turn Angle) to capture physical "shocks."
- Supervised Learning: Implemented an optimized LightGBM model, outperforming CatBoost and XGBoost baselines in handling extreme class imbalance.
- Unsupervised Learning: Developed a pipeline using UMAP embeddings + Gaussian Mixture Models (GMM) to cluster events without labels.
π Check out the Code
Click to view Other projects
- Description:
The main purpose of this project is to implement an ETL (Extract, Transform, Load) pipeline to collect and process data from social media platforms like Reddit and HealthUnlocked. The project focuses on extracting data about ADHD and aims to extract demographics using an agentic/LLM approach to enrich the dataset with valuable insights. - Technologies Used: Python, Pandas, Airflow, Docker, Redis, Reddit API, HealthUnlocked API, MongoDB, Mistral LLM, Ollama 1B LLMs
- Highlights:
- Data Ingestion & Storage: Implemented a robust data ingestion pipeline to scrape posts from Reddit and HealthUnlocked. Ensured data quality by checking for duplicates using Redis and storing the data in MongoDB.
- Data Augmentation: Utilized Large Language Models (LLMs) like Mistral and Ollama 1B to perform sentiment analysis, keyword extraction, gender inference, and detection of self-diagnosis and self-medication mentions, enriching the dataset with valuable insights.
- Data Cleaning & Staging: Processed and cleaned the augmented data using pandas, ensuring consistency and accuracy before transferring it to the staging database.
- Production Database: Designed a common database schema to facilitate efficient querying and visualization, preparing the data for in-depth analysis and reporting.
- Description:
The main purpose of the project was to put in place a protocole of data collection within the production lines of the Geberit's factory in Haldensleben, Germany. - Technologies Used: MS SQL Server, C#, CSHTML, OPC-UA , SAP Plant Connectivity (pco), ASP.NET.
- Highlights:
- Stakeholder Engagement & Database Design: Collaborated with stakeholders to align solutions with business goals, contributing to the design of the database schema in SQL Server. Furthermore it helped deciding an appropriate communication protocols (OPC-UA and SAP Plant Connectivity).
- Dashboards & Real-Time Data Viewing: Designed interactive dashboards in C# and CSHTML using the MVC model to highlight key performance indicators (KPIs) and provide real-time data visualization.
- Programming Languages: [C++, C, Python, Java, JavaScript]
- Frameworks & Tools: [Docker, vue.js, Git, Pandas, Tenserflow, Sklearn, pytorch]
- Databases: [e.g., MySQL, MongoDB, Redis, Neo4j, Kafka]
- Email: youssefsidhom92@gmail.com
- LinkedIn: Youssef SIDHOM
π I have played basketball all my life I am only 175cm (5.7ft) tall π


