Skip to content

yash276/mlTemplate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mlTemplate

The inspiration of this code came to me since I am too lazy to write the code again and again. Like for example doing EDA in a notbook and changing column name in the cell and run it so and so forth. What if we have 10's of features I will be too lazy to do things on my own. So I wanted to create a script that can perform some basic EDA , Feature Selection and Engineering, Cross Validatoin and Train a basic Model and generate all kinds of report.

So I wanted to create a ML pipeline that can quickly tell me various things about my data and also can try the basic Linear and Logistic Regression and set a base line for the my initial performance

So now I can just look at the various reports and test runs and gain insights about my data and work on specific use case things for the data and not waste time on performing the inital analysis like EDA.

The idea is not to solve the problem at hand entirely rather a quick sneak peak inside the data and see what all things are possible with it.

Folder Structure

├── README.md
├── config.yaml
├── docs
│   ├── index.html
│   ├── search.js
│   ├── src
│   │   ├── categorical_features.html
│   │   ├── cross_validation.html
│   │   ├── dataset.html
│   │   ├── dispatcher.html
│   │   ├── eda.html
│   │   ├── engine.html
│   │   ├── feature_generator.html
│   │   ├── feature_selection.html
│   │   ├── loss.html
│   │   ├── matrices.html
│   │   ├── numerical_features.html
│   │   ├── pipeline.html
│   │   ├── predict.html
│   │   ├── train.html
│   │   └── utils.html
│   └── src.html
├── requirements.txt
├── run.sh
├── src
│   ├── __init__.py
│   ├── categorical_features.py
│   ├── cross_validation.py
│   ├── dataset.py
│   ├── dispatcher.py
│   ├── eda.py
│   ├── engine.py
│   ├── feature_generator.py
│   ├── feature_selection.py
│   ├── loss.py
│   ├── matrices.py
│   ├── numerical_features.py
│   ├── pipeline.py
│   ├── predict.py
│   ├── train.py
│   └── utils.py

Important Files and Folders

  • config.yaml --- all the configuration related to the pipeline in entered here all at once. Will be explained in detail later on.
  • docs/src.html ---- the documentation of the all the source code. It contains the entire code along with the Doc String for each funtion inside the code which talks about what that function is, it's arguments and return values.
  • run.sh ---- file that you actually need to run to run the pipeline.
  • src --- folder with all the code. The file name are pretty much self-explanatory so I will skip them.

Config.yaml

The Config file should have the following key-value pair

input: {
        train_file (str): path to the train csv,
        test_file (str): path to the test csv,
        target_cols (list of str) : list of what is the target columns,
                                    for example : ["target"]
        output_path (str) : path where you want to store all the reports and model generated by the pipeline
        }
feature_selection: {
                    categorical_features: {
                                        enc_types(string): 
                                        "label" for Label Encoding
                                        "ohe" for One Hot Encoding
                                        "binary" for Binarization
                                        handle_na(bool): if you want to code to handle the NAN values then True else False.
                                        num_best(int): Number of best features to select if select_best is True.
                                        },
                    numerical_features: {
                                          Currently EMPTY
                                    },
                    
                    cols_to_drop(list): List of the columns that you are sure you would drop them from the database.
                                        They can be both Categorical and Numerical Columns.
                    run_tests(bool): If you want to run and see the various results for The Faetures then True else False.
                    select_best(bool): If you want the code to decide the best features of the given data then True else False.

                }
cross_validation: {
                problem_type(string): supported
                "binary_classification",
                "multiclass_classification",
                "single_column_regression",
                "multi_column_regression",
                "multilabel_classification"
                
                multilabel_delimiter(string, optional): the character that seperates your multilabel in the input data,
                shuffle(bool, optional): if you want to shuffle the input data,
                num_folds(int, optional): num of folds you want yo split the input data,
                random_state(int, optional): random state you want to shuffle the data
                }
training: {
                    model: 
                    "logistic" for Logistic Regression
                    "linear" for Liner Regression
                }
ml_flow: {
        experiment_name (str): name of the experiment to want to log the results too,
        experiment_exist (bool): if the experiment already exists or not,
        tracking_uri (str): "" currently set to empty,
        run_name (str): run name under which you want to do the tracking 
}

How To Run

  • Step 1

Installation

pip3 install -r requirements.txt
  • Step 2

Change the config file to suit your setting and use case

  • Step 3

Run the following

./run.sh
  • Step 4

Sit Back Enjoy :)

Example Video

This is the best video that github allowed me to upload hope it helps...

pipeline_compressed.mov

Full disclousre

The Base code is taken from Abhishek Takhur a Kaggle GrandMaster. This code can be found on this youtube tutorials and on his github profile as well, as well in his book.

But ofcourse I have made some tweaks and additions to the original code as per my taste and have made it a much better and robust and refined it to suit my pupose.

For any references you can check out the links below for this profile. Youtube Github

Collaboration

Please keep in mind there are lot of things that are still need to done and testing needs to be done on a lot of use cases. But unfortunately my time to this project is limited. So a lot time the code might break or crash or give unwanted results. I would appreciate is you can raise an Issue for the same.

Also I am currently looking for Collaborators to work on this together. So if anyone interested DM me.

If you like the Repo I would appreciate if you could leave a Star :)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published