Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

LightGBM with smarter features

Kamil A. Kaczmarek edited this page Jul 11, 2018 · 10 revisions

Tulip 🌷

🌷 code

We continue working with single model - LightGBM. Our primary focus is on features engineering. Thanks to this approach we obtained significant gains on local CV and LB πŸ†.

Validation

  • 5-fold stratified K-fold (5 is parameter in the configuration file: neptune.yaml#L38)

Preprocessing

Application data

        X['CODE_GENDER'].replace('XNA', np.nan, inplace=True)
        X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
        X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)
        X['ORGANIZATION_TYPE'].replace('XNA', np.nan, inplace=True)

Bureau data

        X['AMT_CREDIT_SUM'].fillna(self.fill_value, inplace=True)
        X['AMT_CREDIT_SUM_DEBT'].fillna(self.fill_value, inplace=True)
        X['AMT_CREDIT_SUM_OVERDUE'].fillna(self.fill_value, inplace=True)
        X['CNT_CREDIT_PROLONG'].fillna(self.fill_value, inplace=True)

Previous Application data

        X['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace=True)
        X['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace=True)
        X['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace=True)
        X['DAYS_LAST_DUE'].replace(365243, np.nan, inplace=True)
        X['DAYS_TERMINATION'].replace(365243, np.nan, inplace=True)

Feature Extraction

Application data -> eda-application.ipynb πŸ“

  • Diff aggregated features. Simply take a diff and absolute diff of groubpy value with the feature value.
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_3_diff                                           0.180676
OCCUPATION_TYPE_mean_EXT_SOURCE_3_diff                                                               0.180197
OCCUPATION_TYPE_mean_EXT_SOURCE_2_diff                                                               0.162917
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_2_diff                                           0.159719
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_2_diff        0.157282
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_2_diff                                               0.153519
CODE_GENDER_ORGANIZATION_TYPE_mean_EXT_SOURCE_1_diff                                                 0.145575
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_1_diff                                               0.142056
OCCUPATION_TYPE_mean_EXT_SOURCE_1_diff                                                               0.141020
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_1_diff                                           0.137948
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1_diff                    0.135217
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1_diff        0.133843
...
OCCUPATION_TYPE_mean_DAYS_BIRTH_abs_diff                                                             0.003032
OCCUPATION_TYPE_mean_DAYS_ID_PUBLISH_abs_diff                                                        0.002765
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_INCOME_TOTAL_abs_diff                                         0.002290
OCCUPATION_TYPE_mean_CNT_FAM_MEMBERS_diff                                                            0.001500

Installment Payments data -> eda-installments.ipynb πŸ“

  • Hand crafted features:
instalment_paid_late_mean                 0.028205
last_instalment_paid_late_mean            0.026989
instalment_paid_late_sum                  0.024870
instalment_paid_late_in_days_min          0.014083
last_instalment_paid_late_in_days_min     0.012402
last_instalment_paid_late_in_days_mean    0.011999
instalment_paid_late_in_days_mean         0.011797
instalment_paid_late_in_days_sum          0.011590
last_instalment_paid_late_in_days_sum     0.011505
last_instalment_paid_late_in_days_max     0.011501
last_instalment_paid_over_amount_std      0.010351
instalment_paid_over_amount_sum           0.009034
instalment_paid_late_in_days_max          0.008772
instalment_paid_late_in_days_std          0.008682
instalment_paid_over_sum                  0.008642
last_instalment_paid_over_mean            0.008530
last_instalment_paid_over_amount_sum      0.008483
last_instalment_paid_over_amount_max      0.008421
instalment_paid_over_amount_std           0.008273
instalment_paid_over_amount_max           0.008270
last_instalment_paid_over_amount_mean     0.008213
instalment_paid_over_amount_mean          0.007711
last_instalment_paid_over_amount_min      0.007622
instalment_paid_over_mean                 0.007265
instalment_paid_over_amount_min           0.006588
last_instalment_paid_over_count           0.004973
last_instalment_paid_late_count           0.004973
last_instalment_paid_late_in_days_std     0.002640

Previous application data -> eda-previous_application.ipynb πŸ“

  • added nan_count
previous_application_prev_was_refused                           0.056848
previous_application_days_first_drawing_last_5_credits_mean     0.050464
previous_application_days_first_drawing_last_3_credits_mean     0.048791
previous_application_days_decision_about_last_5_credits_mean    0.042162
previous_application_days_first_drawing_last_1_credits_mean     0.038439
previous_application_prev_was_approved                          0.036611
previous_application_days_decision_about_last_3_credits_mean    0.033704
previous_application_term_of_last_5_credits_mean                0.027091
previous_application_term_of_last_3_credits_mean                0.021879
previous_application_number_of_prev_application                 0.019762
previous_application_days_decision_about_last_1_credits_mean    0.016399
previous_application_term_of_last_1_credits_mean                0.013643

Model

  • We continue working with single Light-GBM model implemented here: models.py#L80 πŸ’»
  • Results for new set of features are simple great:
    • CV 0.7905 πŸŽ†
    • LB 0.801 πŸŽ‰
  • We trained the model with following hyper-parameters (check config file πŸ“’):
  lgbm__boosting_type: gbdt
  lgbm__objective: binary
  lgbm__metric: auc
  lgbm__number_boosting_rounds: 5000
  lgbm__early_stopping_rounds: 100
  lgbm__learning_rate: 0.1
  lgbm__max_bin: 300
  lgbm__max_depth: -1
  lgbm__num_leaves: 35
  lgbm__min_child_samples: 50
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.2
  lgbm__min_gain_to_split: 0.5
  lgbm__reg_lambda: 100.0
  lgbm__reg_alpha: 0.0
  lgbm__scale_pos_weight: 1

Pipeline diagram

Since the diagram below is quite wide (it uses multiple input files), here is a link to the larger version.

HC-solution-4