Weakly Supervised Spam-Label Classification

DSC180 Quarter 2 Capstone Project

Using a list of categories and words that represent these categories, we classify harmful spam messages into categories such as insurance scams, medical sales, software sales, and more. Doing so, we hope to alleviate the burden on non technical people in todays world as spammers continue to get by detection systems - we want to find and highlight a pattern throughout them all. Leveraging models ranging from simple methods like TFIDF to complex large language models such as ConWea with BERT, we examine the differences between these models and if it is worth using such big, computation costly models.

You can see more details about our project on our website.

Data

The data is available on Google Drive
Please unzip and place the files into the following locations;
Annotated Spam Messages -> data/raw/spam/Annotated/
Unannotated Spam Messages -> data/raw/spam/Unannotated/
Non-spam (Ham) Messages -> data/raw/ham/

The dataset should contain the following files:

Annotated Spam Messages
ex) data/raw/spam/Annotated/medical-sales/xyz.txt
- Where xyz is any file name that was annotated to be medical sale spam
- Other folders follow same pattern for each category
Non-spam (ham) Messages
ex) data/raw/ham/xyz.txt
Seedwords JSON file
ex) data/out/seedwords.json

Running the Project

DSMLP Command

launch.sh -i gbirch11/dsc180b [-m d] [-g 1]

Note: -m is an optional argument to include more RAM on the machine; HIGLHLY RECOMMEND setting $d$ to 16 or 32 for faster processing
Also highly recommended to run with -g 1, especially if running ConWea model.

launch.sh -i gbirch11/dsc180b -m 32 -g 1

To run this project, execute the following command;

python run.py [test | data]

Note: If running python run.py test
Very simple set of test data will be used to produce results.
Result trend not consistent with running on full dataset.

If running python run.py data:
Whole dataset will be used to produce results.

Example commands include:
python run.py test
python run.py data

Note: The above commands only run on the TF-IDF, Word2Vec, and FastText models. To run our best model, ConWea, see the section below.

Running the ConWea Model

Since ConWea is a huge model using BERT, we have separated this model into the following separate commands;

Navigate to the ConWea model directory using
cd src/models/ConWea
To contextualize the corpus and seed words run
a) For testing: python contextualize.py --dataset_path "../../../test/testdata/" --temp_dir "temp/" --gpu_id 0
b) For full data: python contextualize.py --dataset_path "../../../data/raw/spam/Annotated/" --temp_dir "temp/" --gpu_id 0
To train model + observe results run
a) For testing: python train.py --dataset_path "../../../test/testdata/" --gpu_id 0
b) For full data: python train.py --dataset_path "../../../data/raw/spam/Annotated/" --gpu_id 0

Note: Be warned that running ConWea on the full dataset will ~ 3 hours to run. Running ConWea on test data runs in ~ 20 minutes.
Note: ConWea trains using multiple layers and tons of epochs, since our test data is small it is safe to interrupt the terminal (CTRL+C) after first iteration has occured. The layers are kept for consistency for full datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
data		data
images		images
notebooks		notebooks
src		src
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
submission.json		submission.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Weakly Supervised Spam-Label Classification

Data

Running the Project

Running the ConWea Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

zaxiang/Spam_Filter

Folders and files

Latest commit

History

Repository files navigation

Weakly Supervised Spam-Label Classification

Data

Running the Project

Running the ConWea Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages