This repository has been archived by the owner on Jun 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 170
LightGBM with smarter features
Kamil A. Kaczmarek edited this page Jul 11, 2018
·
10 revisions
We continue working with single model - LightGBM. Our primary focus is on features engineering. Thanks to this approach we obtained significant gains on local CV and LB π.
- 5-fold stratified K-fold (
5
is parameter in the configuration file: neptune.yaml#L38)
X['CODE_GENDER'].replace('XNA', np.nan, inplace=True)
X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)
X['ORGANIZATION_TYPE'].replace('XNA', np.nan, inplace=True)
X['AMT_CREDIT_SUM'].fillna(self.fill_value, inplace=True)
X['AMT_CREDIT_SUM_DEBT'].fillna(self.fill_value, inplace=True)
X['AMT_CREDIT_SUM_OVERDUE'].fillna(self.fill_value, inplace=True)
X['CNT_CREDIT_PROLONG'].fillna(self.fill_value, inplace=True)
X['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace=True)
X['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace=True)
X['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace=True)
X['DAYS_LAST_DUE'].replace(365243, np.nan, inplace=True)
X['DAYS_TERMINATION'].replace(365243, np.nan, inplace=True)
Application data -> eda-application.ipynb π
-
Diff
aggregated features. Simply take adiff
andabsolute diff
of groubpy value with the feature value.
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_3_diff 0.180676
OCCUPATION_TYPE_mean_EXT_SOURCE_3_diff 0.180197
OCCUPATION_TYPE_mean_EXT_SOURCE_2_diff 0.162917
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_2_diff 0.159719
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_2_diff 0.157282
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_2_diff 0.153519
CODE_GENDER_ORGANIZATION_TYPE_mean_EXT_SOURCE_1_diff 0.145575
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_1_diff 0.142056
OCCUPATION_TYPE_mean_EXT_SOURCE_1_diff 0.141020
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_1_diff 0.137948
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1_diff 0.135217
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1_diff 0.133843
...
OCCUPATION_TYPE_mean_DAYS_BIRTH_abs_diff 0.003032
OCCUPATION_TYPE_mean_DAYS_ID_PUBLISH_abs_diff 0.002765
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_INCOME_TOTAL_abs_diff 0.002290
OCCUPATION_TYPE_mean_CNT_FAM_MEMBERS_diff 0.001500
Installment Payments data -> eda-installments.ipynb π
- Hand crafted features:
instalment_paid_late_mean 0.028205
last_instalment_paid_late_mean 0.026989
instalment_paid_late_sum 0.024870
instalment_paid_late_in_days_min 0.014083
last_instalment_paid_late_in_days_min 0.012402
last_instalment_paid_late_in_days_mean 0.011999
instalment_paid_late_in_days_mean 0.011797
instalment_paid_late_in_days_sum 0.011590
last_instalment_paid_late_in_days_sum 0.011505
last_instalment_paid_late_in_days_max 0.011501
last_instalment_paid_over_amount_std 0.010351
instalment_paid_over_amount_sum 0.009034
instalment_paid_late_in_days_max 0.008772
instalment_paid_late_in_days_std 0.008682
instalment_paid_over_sum 0.008642
last_instalment_paid_over_mean 0.008530
last_instalment_paid_over_amount_sum 0.008483
last_instalment_paid_over_amount_max 0.008421
instalment_paid_over_amount_std 0.008273
instalment_paid_over_amount_max 0.008270
last_instalment_paid_over_amount_mean 0.008213
instalment_paid_over_amount_mean 0.007711
last_instalment_paid_over_amount_min 0.007622
instalment_paid_over_mean 0.007265
instalment_paid_over_amount_min 0.006588
last_instalment_paid_over_count 0.004973
last_instalment_paid_late_count 0.004973
last_instalment_paid_late_in_days_std 0.002640
Previous application data -> eda-previous_application.ipynb π
- added
nan_count
previous_application_prev_was_refused 0.056848
previous_application_days_first_drawing_last_5_credits_mean 0.050464
previous_application_days_first_drawing_last_3_credits_mean 0.048791
previous_application_days_decision_about_last_5_credits_mean 0.042162
previous_application_days_first_drawing_last_1_credits_mean 0.038439
previous_application_prev_was_approved 0.036611
previous_application_days_decision_about_last_3_credits_mean 0.033704
previous_application_term_of_last_5_credits_mean 0.027091
previous_application_term_of_last_3_credits_mean 0.021879
previous_application_number_of_prev_application 0.019762
previous_application_days_decision_about_last_1_credits_mean 0.016399
previous_application_term_of_last_1_credits_mean 0.013643
- We continue working with single Light-GBM model implemented here: models.py#L80 π»
- Results for new set of features are simple great:
- CV 0.7905 π
- LB 0.801 π
- We trained the model with following hyper-parameters (check config file π):
lgbm__boosting_type: gbdt
lgbm__objective: binary
lgbm__metric: auc
lgbm__number_boosting_rounds: 5000
lgbm__early_stopping_rounds: 100
lgbm__learning_rate: 0.1
lgbm__max_bin: 300
lgbm__max_depth: -1
lgbm__num_leaves: 35
lgbm__min_child_samples: 50
lgbm__subsample: 1.0
lgbm__subsample_freq: 1
lgbm__colsample_bytree: 0.2
lgbm__min_gain_to_split: 0.5
lgbm__reg_lambda: 100.0
lgbm__reg_alpha: 0.0
lgbm__scale_pos_weight: 1
Since the diagram below is quite wide (it uses multiple input files), here is a link to the larger version.
check our GitHub organization https://github.com/neptune-ml for more cool stuff π
Kamil & Kuba, core contributors
- chestnut π°: LightGBM and basic features
- seedling π±: Sklearn and XGBoost algorithms and groupby features
- blossom πΌ: LightGBM on selected features
- tulip π·: LightGBM with smarter features
- sunflower π»: LightGBM clean dynamic features
- four leaf clover π: Stacking by feature diversity and model diversity