SVM for Spam Classification

Training set and testing set are generated by transforming emails into binary feature vectors, which will be used by SVM training algorithm to generate an optimum model. The optimum model is them stored in 'model.mat' and can be be used to predict whether an email is a spam later by running 'prediction.m'.

This project is part of Stanford Machine Learning course on Coursera.

File structure

/spamTrain.mat and /spamTest.mat

contains 4000 training examples of spam

and non-spam email, while spamTest.mat contains 1000 test examples. Each

original email was processed using the processEmail and emailFeatures

functions and converted into a vector.
/vocab.txt

vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus,

resulting in a list of 1899 words.
/training.m

It trains a SVM with linear kernel for Spam Classification, and writes 'model' into model.mat aftering training.
/prediction.m

It reads an email(without headers) from 'input.txt' and predicts whether it is a spam or not.
/In processEmail.m,

we have implemented the following email prepro-cessing and normalization steps:

Lower-casing, stripping HTML, normalizing URLs, normalizing email addresses, normalizing numbers, normalizing Dollars, word stemming, removal of non-words.

Tutorial

1.Starting the Octave and move to the folder which contains all the source files.

2.Type

training

in the Octave to train a SVM.

3.Copy an email(without headers) you want to test into 'input.txt'

4.Type

prediction

in the Octave to predict whether an email is a spam or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SVM for Spam Classification

File structure

Tutorial

Files

README.md

Latest commit

History

README.md

File metadata and controls

SVM for Spam Classification

File structure

Tutorial