This repository is prepared to display some of the projects that I have completed in my career. If you have reached this far, it means that you are interested in my work. I am appreciated. Please let me know your thoughts if you would like to help me.
You can reach me through my email. Please visit my personal homepage or just send me a message through halit_vural ( at ) techno.study.
- NLP state-of-the-art project for clone detection on coding data from developers - CodeBERT, UniXcoder, Doc2Vec optimizations - word embeddings, transformers - (ongoing..)
- NLP for Customer Satisfaction Analysis of e-commerce commentary data - BERT fine-tuning, Adamw optimization - lemmatization, stemming, stopwords, normalization
- Price Prediction on Autoscout 2019 data scraped from an online trading company - Linear, Ridge, Lasso Regression, AdaBoost, XGBoost - Pandas, Numpy, Matplotlib, Seaborn
- EDA and RFM with Cohort Analysis for Customer Segmentation - KMeans clustering, Silhouette Analysis - Pandas, Numpy, Matplotlib, Seaborn
- EDA on Heart Stroke multivariate Biological data from several countries - KNN, Logistic Regression - Pandas, Numpy, Seaborn, Yellowbrick
- Microarray gene-expression analysis for cancer classification - PCA & SVD dimensionality reduction, Feature selection - Machine Learning Algorithms ANN, KNN, DT, RF, SVM
The dataset contains details of a bank's customers. The target of the project is to predict the churn of the customer. The target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues.
Deep Learning with Tensorflow,
Pandas,
Class weights
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where it has 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
The aim of this project is to predict whether a credit card transaction is fraudulent. Of course, this is not easy to do. First of all, we needed to analyze and recognize our data well in order to draw our roadmap and choose the correct arguments we use. Accordingly, we examined the frequency distributions of variables. We then observed variable correlations and tried to explore multicollinearity. The distribution of the target variable classes over other variables was visualized accordingly.
We had take care of missing values and outliers in the following section. After these procedures, we moved on to the model building stage by doing the basic data pre-processing. Starting with Logistic Regression and evaluate model performance, we applied the Unbalanced Data Techniques used to increase the performance. Next, we observed their effects. Then, we used four different algorithms in the model building phase. In the final step, we deployed the model using Streamlit API.
Logistic Regression, Random Forest, XGBoost,and Neural Network algorithms
Unbalanced Data Techniques
Seaborn, Matplotlib and Yellowbrick
Streamlit API
This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
- Cleveland: 303 observations
- Hungarian: 294 observations
- Switzerland: 123 observations
- Long Beach VA: 200 observations
- Stalog (Heart) Data Set: 270 observations
Total: 1190 observations
Duplicated: 272 observations
KNN, Logistic Regression
Pandas, Numpy,
Seaborn, Yellowbrick
Linear, Ridge, Lasso Regression, AdaBoost, XGBoost
Pandas, Numpy, Matplotlib, Seaborn
Other projects that I have completed are not listed here. They are confidential at some level or not reported. I will try to add more from available ones later.