Skip to content
View fasfous92's full-sized avatar

Highlights

  • Pro

Block or report fasfous92

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
fasfous92/README.md

πŸ‘‹ Hi, I'm Youssef Sidhom

Welcome to my GitHub profile! I'm passionate about:

  • Data engineering
  • Data science
  • Machine learning

I love solving real-world problems through code


πŸ”­ Current Work


🏫 Education

  • Γ‰cole Polytechnique, Master in Data Science

    • September 2025 - December 2026 (Ongoing)
    • Relevant Courses:
      • Optimization for Data Science
      • Deep Learning (pytorch, Keras)
      • Advanced AI for text and graphs (LoRa, RAG, graph AI)
  • Insa Lyon, Sfotware Engineer

    • September 2020 - July 2025 (Validated)
    • Relevant Courses:
      • Foundation of Data Engineering
      • Machine Learning and Data Analytics
      • Object Oriented Programming (c++)
      • 2 years of STEM classes

πŸ’» Projects

Here are some of the notable projects I’ve worked on during my academic journey:

Bridging Structured Chemical Graphs and Natural Language A sophisticated multi-modal system that translates 2D molecular structures into human-readable scientific descriptions. It aligns symbolic graph representations with semantic text using a dual-tower architecture and contrastive learning to automate chemical database enrichment and drug discovery reporting.

  • 🧠 AI: ChEmbed (BASF-AI) & Graph Transformer with Global Self-Attention.
  • Architecture: Dual-Tower Encoder with Trainable Adapter Layers.
  • Techniques: InfoNCE Contrastive Loss, Hard Negative Mining via Tanimoto Similarity, and Matryoshka Representation Learning.
  • Focus: Multi-modal Alignment, Graph Representation Learning, and Domain-Specific NLP.

πŸ” Check out the Code

Real-time RAG Agent for Public Transport

A robust AI assistant that helps users navigate the Paris transport network. It uses LLM tool-calling to query live APIs for itineraries and traffic disruptions.

  • 🧠 AI: Llama 3.1 405B (via NVIDIA NIM)
  • Architecture: Microservices (Docker Compose)
  • Data Pipeline: Apache Kafka (KRaft) & ElasticSearch
  • Focus: Tool-calling, RAG, & Real-time Data Streaming

πŸ” Check out the Code

Dataset: 2025 Roland Garros Final (Time-series sequences)

Goal: Classify "Hit", "Bounce", and "Air" states from raw (x,y) coordinates.

Methodology:

  • Feature Engineering: Transformed raw coordinates into kinematic features (Acceleration, Jerk, Turn Angle) to capture physical "shocks."
  • Supervised Learning: Implemented an optimized LightGBM model, outperforming CatBoost and XGBoost baselines in handling extreme class imbalance.
  • Unsupervised Learning: Developed a pipeline using UMAP embeddings + Gaussian Mixture Models (GMM) to cluster events without labels.

πŸ” Check out the Code

Click to view Other projects
  • Description:
    The main purpose of this project is to implement an ETL (Extract, Transform, Load) pipeline to collect and process data from social media platforms like Reddit and HealthUnlocked. The project focuses on extracting data about ADHD and aims to extract demographics using an agentic/LLM approach to enrich the dataset with valuable insights.
  • Technologies Used: Python, Pandas, Airflow, Docker, Redis, Reddit API, HealthUnlocked API, MongoDB, Mistral LLM, Ollama 1B LLMs
  • Highlights:
    • Data Ingestion & Storage: Implemented a robust data ingestion pipeline to scrape posts from Reddit and HealthUnlocked. Ensured data quality by checking for duplicates using Redis and storing the data in MongoDB.
    • Data Augmentation: Utilized Large Language Models (LLMs) like Mistral and Ollama 1B to perform sentiment analysis, keyword extraction, gender inference, and detection of self-diagnosis and self-medication mentions, enriching the dataset with valuable insights.
    • Data Cleaning & Staging: Processed and cleaned the augmented data using pandas, ensuring consistency and accuracy before transferring it to the staging database.
    • Production Database: Designed a common database schema to facilitate efficient querying and visualization, preparing the data for in-depth analysis and reporting.
  • Description:
    The main purpose of the project was to put in place a protocole of data collection within the production lines of the Geberit's factory in Haldensleben, Germany.
  • Technologies Used: MS SQL Server, C#, CSHTML, OPC-UA , SAP Plant Connectivity (pco), ASP.NET.
  • Highlights:
    • Stakeholder Engagement & Database Design: Collaborated with stakeholders to align solutions with business goals, contributing to the design of the database schema in SQL Server. Furthermore it helped deciding an appropriate communication protocols (OPC-UA and SAP Plant Connectivity).
    • Dashboards & Real-Time Data Viewing: Designed interactive dashboards in C# and CSHTML using the MVC model to highlight key performance indicators (KPIs) and provide real-time data visualization.

🌟 Skills

  • Programming Languages: [C++, C, Python, Java, JavaScript]
  • Frameworks & Tools: [Docker, vue.js, Git, Pandas, Tenserflow, Sklearn, pytorch]
  • Databases: [e.g., MySQL, MongoDB, Redis, Neo4j, Kafka]

πŸ“« How to Reach Me


⚑ Fun Fact

πŸ€ I have played basketball all my life I am only 175cm (5.7ft) tall πŸ€

Pinned Loading

  1. NourJadiri/mental_health_disorders_analysis NourJadiri/mental_health_disorders_analysis Public

    Analysis of the frequency of self diagnosed people with mental health disorders

    Jupyter Notebook

  2. public_transport_RAG public_transport_RAG Public

    Python 2

  3. QSA_tennis_bounce_hit QSA_tennis_bounce_hit Public

    Python

  4. Geberit_WebHMI Geberit_WebHMI Public archive

    HTML 1

  5. Molecular_graph_captionning Molecular_graph_captionning Public

    Python

  6. Object_detection_in_documents Object_detection_in_documents Public

    Python