The purpose of this project is to build a Sentiment Analysis system that uses machine learning to flag negative comments. The dataset contains statements, IDs and corresponding labels to indicate if it's positive or negative. The model is built using logistic regression, Support Vector Machine (SVM) and Random Forest Algorithm and as an optimisation on the existing project done on Naive Bayes Algorithm. The libraries to be used in the project include sklearn and pandas.
Among the models used, Random Forest exhibits the highest accuracy(0.993). It is due to the following reasons:
- They are based on trees, so scaling of the variables doesn't matter. Any monotonic transformation of a single variable is implicitly captured by a tree.
- They use the random subspace method and bagging to prevent overfitting.
- If they are done well, you can have a random forest that deals with missing data easily.
- Automated feature selection is built in.
Automatic Detection of Cyberbullying on Social Networks based on Bullying Features - Rui Zhao, Anna Zhuo, Kezhi Mao
- Enhanced Bag of words concatenates BoW features, latent semantic features and bullying features together
- Linear SVM is adopted to detect bullying messages