Problem:

This breast cancer database was obtained from Dr. Wolberg’s office at the University of Wisconsin Hospitals, Madison. Each record here contains values for different morphological and pathological features of a tumor dissected from any given patient. The class column indicates whether the patient has been characterized as the benign tumor or a malignant tumor.

Question 1:Build a classifier to identify patients with benign or malignant tumor based on the tumor characteristics

Question 2:As an oncologist, you would want to reduce your false positives as well as false negatives.

a. Identify the number of false positive and false negatives

b. Improve your classification model to reduce patients who are being predicted as having benign tumor but actually have malignant tumor.

BreastCancerWisconsinDataSet

Analysis and Machine Learning Algorithms for https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original) dataset.

Data Preprocessing

The most important in this dataset is data preprocessing as "Bare Nuclei" column has ? Values which are replaced by 0. Random Forest Classifier is used to view the feature importances. Sample Code Number and Mitoses can be considered as not important.

Algorithm

We can use Random Forest on the processed data. First use Grid Search CV to find out the best parameters for the Random Forest Classifier. After getting the best params we can run cross_val_score with cv=10 to get a score of around 96% . Then using train_test_split we can try to find out the confusion matrix. True Negatives 133 +- 2 False Positives 5 +- 2 True Positives 68 +-2 False Negatives 4 +-2

More Analysis

As Tree algorithms have higher rate of false negatives as they cannot properly handle imbalanced datasets. So we try to use Naïve Bayes Algorithm (Gaussian NB). This improves our confusion matrix t True Negatives 132 +- 2 False Positives 6 +- 2 True Positives 69 +- 1 False Negatives 2 +- 1

Further Work

We can proceed to improve accuracy by using different algorithms Support Vector Machines with some improvement in feature selection. More time devotion on this dataset may lead to an accuracy of 98% with very less number of false negatives.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
1.csv		1.csv
Bayesian.ipynb		Bayesian.ipynb
Classifier-imp.ipynb		Classifier-imp.ipynb
Classifier.ipynb		Classifier.ipynb
FeatureImportance.ipynb		FeatureImportance.ipynb
README.md		README.md
download (2).png		download (2).png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem:

BreastCancerWisconsinDataSet

Data Preprocessing

Algorithm

More Analysis

Further Work

About

Releases

Packages

Languages

aryankhandal0/BreastCancerWisconsinDataSet

Folders and files

Latest commit

History

Repository files navigation

Problem:

BreastCancerWisconsinDataSet

Data Preprocessing

Algorithm

More Analysis

Further Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages