Skip to content

Latest commit

 

History

History
61 lines (30 loc) · 1.71 KB

README.md

File metadata and controls

61 lines (30 loc) · 1.71 KB

SVM for Spam Classification

Training set and testing set are generated by transforming emails into binary feature vectors, which will be used by SVM training algorithm to generate an optimum model. The optimum model is them stored in 'model.mat' and can be be used to predict whether an email is a spam later by running 'prediction.m'.

This project is part of Stanford Machine Learning course on Coursera.

File structure

  • /spamTrain.mat and /spamTest.mat

    contains 4000 training examples of spam

    and non-spam email, while spamTest.mat contains 1000 test examples. Each

    original email was processed using the processEmail and emailFeatures

    functions and converted into a vector.

  • /vocab.txt

    vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus,

    resulting in a list of 1899 words.

  • /training.m

    It trains a SVM with linear kernel for Spam Classification, and writes 'model' into model.mat aftering training.

  • /prediction.m

    It reads an email(without headers) from 'input.txt' and predicts whether it is a spam or not.

  • /In processEmail.m,

    we have implemented the following email prepro-cessing and normalization steps:

    Lower-casing, stripping HTML, normalizing URLs, normalizing email addresses, normalizing numbers, normalizing Dollars, word stemming, removal of non-words.

Tutorial

1.Starting the Octave and move to the folder which contains all the source files.

2.Type

training

in the Octave to train a SVM.

3.Copy an email(without headers) you want to test into 'input.txt'

4.Type

prediction

in the Octave to predict whether an email is a spam or not.