-
Notifications
You must be signed in to change notification settings - Fork 170
LightGBM clean dynamic features
We continue working with single model - LightGBM. Our primary focus is on features engineering. Thanks to this approach we obtained significant gains on local CV and LB π.
- 5-fold stratified K-fold (
5
is parameter in the configuration file: neptune.yaml#L38)
We realized that before we were cleaning tables only for hand-crafted features and not aggregations which is obviously a mistake. We have fixed it in this release.
X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
bureau['DAYS_CREDIT_ENDDATE'][bureau['DAYS_CREDIT_ENDDATE'] < -40000] = np.nan
bureau['DAYS_CREDIT_UPDATE'][bureau['DAYS_CREDIT_UPDATE'] < -40000] = np.nan
bureau['DAYS_ENDDATE_FACT'][bureau['DAYS_ENDDATE_FACT'] < -40000] = np.nan
credit_card['AMT_DRAWINGS_ATM_CURRENT'][credit_card['AMT_DRAWINGS_ATM_CURRENT'] < 0] = np.nan
credit_card['AMT_DRAWINGS_CURRENT'][credit_card['AMT_DRAWINGS_CURRENT'] < 0] = np.nan
Altogether we are using 1092 features in our solution (it does need some pruning:) )
Application data -> eda-application.ipynb π
- Hand Crafted features
credit_per_child 0.033503
credit_per_person 0.023462
child_to_non_child_ratio 0.020943
credit_per_non_child 0.020244
cnt_non_child 0.012195
income_per_non_child 0.001947
- Aggregation features: Adding more numerical values over which aggregations are performed helped considerably. We are now using the following:
cols_to_agg = ['AMT_CREDIT',
'AMT_ANNUITY',
'AMT_INCOME_TOTAL',
'AMT_GOODS_PRICE',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3',
'OWN_CAR_AGE',
'REGION_POPULATION_RELATIVE',
'DAYS_REGISTRATION',
'CNT_CHILDREN',
'CNT_FAM_MEMBERS',
'DAYS_ID_PUBLISH',
'DAYS_BIRTH',
'DAYS_EMPLOYED'
]
Installment Payments data -> eda-installments.ipynb π
- Hand crafted features:
Dividing the short-term by long-term features was added. It may not give much in terms of correlation but it helps when using tree-based models.
def last_k_installment_features_with_fractions(gr, periods, period_fractions):
features = InstallmentPaymentsFeatures.last_k_installment_features(gr, periods)
for short_period, long_period in period_fractions:
short_feature_names = get_feature_names_by_period(features, short_period)
long_feature_names = get_feature_names_by_period(features, long_period)
for short_feature, long_feature in zip(short_feature_names, long_feature_names):
old_name_chunk = '_{}_'.format(short_period)
new_name_chunk = '_{}by{}_fraction_'.format(short_period, long_period)
fraction_feature_name = short_feature.replace(old_name_chunk, new_name_chunk)
features[fraction_feature_name] = safe_div(features[short_feature], features[long_feature])
return features
POS Cash Balance application data -> eda-pos_cash_balance.ipynb π
Below 3 feature groups generated for pos cash balance
last_10_pos_cash_paid_late_with_tolerance_mean 0.052731
last_50_pos_cash_paid_late_with_tolerance_mean 0.048322
all_installment_pos_cash_paid_late_with_tolerance_mean 0.047050
last_10_pos_cash_paid_late_mean 0.043763
last_50_pos_cash_paid_late_with_tolerance_count 0.043304
last_50_pos_cash_paid_late_count 0.043304
all_installment_pos_cash_paid_late_count 0.035632
all_installment_pos_cash_paid_late_with_tolerance_count 0.035632
last_50_pos_cash_paid_late_mean 0.032820
all_installment_pos_cash_paid_late_mean 0.030616
...
last_1_SK_DPD_DEF_max 0.003592
last_50_SK_DPD_min 0.003585
last_50_SK_DPD_DEF_min 0.002355
last_1_pos_cash_paid_late_count NaN
last_1_pos_cash_paid_late_with_tolerance_count NaN
last_loan_pos_cash_paid_late_with_tolerance_mean 0.049801
last_loan_pos_cash_paid_late_mean 0.042730
last_loan_pos_cash_paid_late_with_tolerance_sum 0.028442
last_loan_pos_cash_paid_late_sum 0.015614
last_loan_SK_DPD_std 0.007400
last_loan_SK_DPD_DEF_std 0.007180
last_loan_SK_DPD_max 0.006939
last_loan_SK_DPD_DEF_max 0.006845
last_loan_SK_DPD_mean 0.006002
last_loan_SK_DPD_sum 0.004737
last_loan_pos_cash_paid_late_count 0.003446
last_loan_SK_DPD_DEF_mean 0.003391
last_loan_SK_DPD_DEF_min 0.003115
last_loan_SK_DPD_min 0.002458
last_loan_SK_DPD_DEF_sum 0.002150
60_period_trend_SK_DPD_DEF 0.010600
60_period_trend_SK_DPD 0.009394
6_period_trend_SK_DPD_DEF 0.004313
12_period_trend_SK_DPD_DEF 0.004157
30_period_trend_SK_DPD_DEF 0.003879
30_period_trend_SK_DPD 0.003474
6_period_trend_SK_DPD 0.001397
12_period_trend_SK_DPD 0.000429
1_period_trend_SK_DPD NaN
1_period_trend_SK_DPD_DEF NaN
- We continue working with single Light-GBM model implemented here: models.py#L80 π»
- Results for new set of features are rather nice π:
- CV 0.7950 π
- LB 0.804 π
- We trained the model with following hyper-parameters (check config file π):
# Light GBM
lgbm__boosting_type: gbdt
lgbm__objective: binary
lgbm__metric: auc
lgbm__number_boosting_rounds: 5000
lgbm__early_stopping_rounds: 100
lgbm__learning_rate: 0.02
lgbm__max_bin: 300
lgbm__max_depth: -1
lgbm__num_leaves: 30
lgbm__min_child_samples: 70
lgbm__subsample: 1.0
lgbm__subsample_freq: 1
lgbm__colsample_bytree: 0.05
lgbm__min_gain_to_split: 0.5
lgbm__reg_lambda: 100
lgbm__reg_alpha: 0.0
lgbm__scale_pos_weight: 1
lgbm__is_unbalance: False
Since the diagram below is quite wide (it uses multiple input files), here is a link to the larger version.
[![HC-solution-4]larger version](https://gist.githubusercontent.com/jakubczakon/cac72983726a970690ba7c33708e100b/raw/36b1da8a23f9226e4e936c176d7d928024ed7c00/home_credit_solution_5_pipeline.png)
check our GitHub organization https://github.com/neptune-ml for more cool stuff π
Kamil & Kuba, core contributors
- chestnut π°: LightGBM and basic features
- seedling π±: Sklearn and XGBoost algorithms and groupby features
- blossom πΌ: LightGBM on selected features
- tulip π·: LightGBM with smarter features
- sunflower π»: LightGBM clean dynamic features
- four leaf clover π: Stacking by feature diversity and model diversity