The inspiration of this code came to me since I am too lazy to write the code again and again. Like for example doing EDA in a notbook and changing column name in the cell and run it so and so forth. What if we have 10's of features I will be too lazy to do things on my own. So I wanted to create a script that can perform some basic EDA , Feature Selection and Engineering, Cross Validatoin and Train a basic Model and generate all kinds of report.
So I wanted to create a ML pipeline that can quickly tell me various things about my data and also can try the basic Linear and Logistic Regression and set a base line for the my initial performance
So now I can just look at the various reports and test runs and gain insights about my data and work on specific use case things for the data and not waste time on performing the inital analysis like EDA.
The idea is not to solve the problem at hand entirely rather a quick sneak peak inside the data and see what all things are possible with it.
├── README.md
├── config.yaml
├── docs
│ ├── index.html
│ ├── search.js
│ ├── src
│ │ ├── categorical_features.html
│ │ ├── cross_validation.html
│ │ ├── dataset.html
│ │ ├── dispatcher.html
│ │ ├── eda.html
│ │ ├── engine.html
│ │ ├── feature_generator.html
│ │ ├── feature_selection.html
│ │ ├── loss.html
│ │ ├── matrices.html
│ │ ├── numerical_features.html
│ │ ├── pipeline.html
│ │ ├── predict.html
│ │ ├── train.html
│ │ └── utils.html
│ └── src.html
├── requirements.txt
├── run.sh
├── src
│ ├── __init__.py
│ ├── categorical_features.py
│ ├── cross_validation.py
│ ├── dataset.py
│ ├── dispatcher.py
│ ├── eda.py
│ ├── engine.py
│ ├── feature_generator.py
│ ├── feature_selection.py
│ ├── loss.py
│ ├── matrices.py
│ ├── numerical_features.py
│ ├── pipeline.py
│ ├── predict.py
│ ├── train.py
│ └── utils.py
- config.yaml --- all the configuration related to the pipeline in entered here all at once. Will be explained in detail later on.
- docs/src.html ---- the documentation of the all the source code. It contains the entire code along with the Doc String for each funtion inside the code which talks about what that function is, it's arguments and return values.
- run.sh ---- file that you actually need to run to run the pipeline.
- src --- folder with all the code. The file name are pretty much self-explanatory so I will skip them.
The Config file should have the following key-value pair
input: {
train_file (str): path to the train csv,
test_file (str): path to the test csv,
target_cols (list of str) : list of what is the target columns,
for example : ["target"]
output_path (str) : path where you want to store all the reports and model generated by the pipeline
}
feature_selection: {
categorical_features: {
enc_types(string):
"label" for Label Encoding
"ohe" for One Hot Encoding
"binary" for Binarization
handle_na(bool): if you want to code to handle the NAN values then True else False.
num_best(int): Number of best features to select if select_best is True.
},
numerical_features: {
Currently EMPTY
},
cols_to_drop(list): List of the columns that you are sure you would drop them from the database.
They can be both Categorical and Numerical Columns.
run_tests(bool): If you want to run and see the various results for The Faetures then True else False.
select_best(bool): If you want the code to decide the best features of the given data then True else False.
}
cross_validation: {
problem_type(string): supported
"binary_classification",
"multiclass_classification",
"single_column_regression",
"multi_column_regression",
"multilabel_classification"
multilabel_delimiter(string, optional): the character that seperates your multilabel in the input data,
shuffle(bool, optional): if you want to shuffle the input data,
num_folds(int, optional): num of folds you want yo split the input data,
random_state(int, optional): random state you want to shuffle the data
}
training: {
model:
"logistic" for Logistic Regression
"linear" for Liner Regression
}
ml_flow: {
experiment_name (str): name of the experiment to want to log the results too,
experiment_exist (bool): if the experiment already exists or not,
tracking_uri (str): "" currently set to empty,
run_name (str): run name under which you want to do the tracking
}
- Step 1
pip3 install -r requirements.txt
- Step 2
Change the config file to suit your setting and use case
- Step 3
Run the following
./run.sh
- Step 4
Sit Back Enjoy :)
This is the best video that github allowed me to upload hope it helps...
pipeline_compressed.mov
The Base code is taken from Abhishek Takhur a Kaggle GrandMaster. This code can be found on this youtube tutorials and on his github profile as well, as well in his book.
But ofcourse I have made some tweaks and additions to the original code as per my taste and have made it a much better and robust and refined it to suit my pupose.
For any references you can check out the links below for this profile. Youtube Github
Please keep in mind there are lot of things that are still need to done and testing needs to be done on a lot of use cases. But unfortunately my time to this project is limited. So a lot time the code might break or crash or give unwanted results. I would appreciate is you can raise an Issue for the same.