Training set and testing set are generated by transforming emails into binary feature vectors, which will be used by SVM training algorithm to generate an optimum model. The optimum model is them stored in 'model.mat' and can be be used to predict whether an email is a spam later by running 'prediction.m'.
This project is part of Stanford Machine Learning course on Coursera.
-
/spamTrain.mat and /spamTest.mat
contains 4000 training examples of spam
and non-spam email, while spamTest.mat contains 1000 test examples. Each
original email was processed using the processEmail and emailFeatures
functions and converted into a vector.
-
/vocab.txt
vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus,
resulting in a list of 1899 words.
-
/training.m
It trains a SVM with linear kernel for Spam Classification, and writes 'model' into model.mat aftering training.
-
/prediction.m
It reads an email(without headers) from 'input.txt' and predicts whether it is a spam or not.
-
/In processEmail.m,
we have implemented the following email prepro-cessing and normalization steps:
Lower-casing, stripping HTML, normalizing URLs, normalizing email addresses, normalizing numbers, normalizing Dollars, word stemming, removal of non-words.
1.Starting the Octave and move to the folder which contains all the source files.
2.Type
training
in the Octave to train a SVM.
3.Copy an email(without headers) you want to test into 'input.txt'
4.Type
prediction
in the Octave to predict whether an email is a spam or not.