This repository has been archived by the owner on Jun 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 170
LightGBM on selected features
Kamil A. Kaczmarek edited this page Jul 11, 2018
·
27 revisions
In this solution we put focus on feature engineering. We made use of the several files in the dataset: previous-application
, application
, pos_cash_balance
, installments_payments
, bureau
.
- CODE_GENDER replace
XNA
withnp.nan
- DAYS_EMPLOYED replace
365243
withnp.nan
- NAME_FAMILY_STATUS replace
Unknown
withnp.nan
- ORGANIZATION_TYPE replace
XNA
withnp.nan
- No missing value imputation
- Encode as categorical the following columns:
CATEGORICAL_COLUMNS = ['CODE_GENDER', 'EMERGENCYSTATE_MODE', 'FLAG_CONT_MOBILE',
'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_18',
'FLAG_EMAIL', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_PHONE', 'FLAG_WORK_PHONE',
'FONDKAPREMONT_MODE', 'HOUR_APPR_PROCESS_START', 'HOUSETYPE_MODE',
'LIVE_CITY_NOT_WORK_CITY', 'LIVE_REGION_NOT_WORK_REGION',
'NAME_CONTRACT_TYPE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE', 'ORGANIZATION_TYPE',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'REG_REGION_NOT_LIVE_REGION',
'REG_REGION_NOT_WORK_REGION',
'WALLSMATERIAL_MODE', 'WEEKDAY_APPR_PROCESS_START']
Application data -> eda-application.ipynb π
- Raw numerical columns
- Raw categorical columns
- Engineered features
X['annuity_income_percentage'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
X['car_to_birth_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_BIRTH']
X['car_to_employ_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_EMPLOYED']
X['children_ratio'] = X['CNT_CHILDREN'] / X['CNT_FAM_MEMBERS']
X['credit_to_annuity_ratio'] = X['AMT_CREDIT'] / X['AMT_ANNUITY']
X['credit_to_goods_ratio'] = X['AMT_CREDIT'] / X['AMT_GOODS_PRICE']
X['credit_to_income_ratio'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
X['days_employed_percentage'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
X['ext_sources_mean'] = X[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
X['income_credit_percentage'] = X['AMT_INCOME_TOTAL'] / X['AMT_CREDIT']
X['income_per_child'] = X['AMT_INCOME_TOTAL'] / (1 + X['CNT_CHILDREN'])
X['income_per_person'] = X['AMT_INCOME_TOTAL'] / X['CNT_FAM_MEMBERS']
X['payment_rate'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
X['phone_to_birth_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_BIRTH']
X['phone_to_employ_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_EMPLOYED']
-
external_sources_mean
has the strongest correlation with the target. Check below:
ext_sources_mean 0.222052
credit_to_goods_ratio 0.069427
car_to_birth_ratio 0.048824
days_employed_percentage 0.042206
phone_to_birth_ratio 0.033991
credit_to_annuity_ratio 0.032102
car_to_employ_ratio 0.030553
children_ratio 0.021223
annuity_income_percentage 0.014265
payment_rate 0.012704
income_per_child 0.012529
credit_to_income_ratio 0.007727
income_per_person 0.006571
phone_to_employ_ratio 0.004562
income_credit_percentage 0.001817
- aggregated features Aggregations are constructed from recipes (check pipeline_config.py π») like this:
AGGREGATION_RECIPIES = [
(['CODE_GENDER', 'NAME_EDUCATION_TYPE'], [('AMT_ANNUITY', 'max'),
('AMT_CREDIT', 'max'),
('EXT_SOURCE_1', 'mean'),
('EXT_SOURCE_2', 'mean'),
('OWN_CAR_AGE', 'max'),
('OWN_CAR_AGE', 'sum')]),
]
Again features constructed from EXT_SOURCE_X
are the most important. Check correlations below:
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1 0.089964
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_2 0.089235
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1 0.086676
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_1 0.083520
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_2 0.082742
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_ELEVATORS_AVG 0.078057
OCCUPATION_TYPE_mean_EXT_SOURCE_1 0.076587
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_REQ_CREDIT_BUREAU_YEAR 0.074528
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_YEARS_BUILD_AVG 0.073816
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_NONLIVINGAREA_AVG 0.073730
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_OWN_CAR_AGE 0.073535
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_APARTMENTS_AVG 0.072854
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_BASEMENTAREA_AVG 0.072231
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_1 0.071557
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_CREDIT 0.071023
OCCUPATION_TYPE_mean_EXT_SOURCE_2 0.070659
CODE_GENDER_ORGANIZATION_TYPE_mean_EXT_SOURCE_1 0.070028
CODE_GENDER_NAME_EDUCATION_TYPE_max_OWN_CAR_AGE 0.067620
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_CNT_CHILDREN 0.066059
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_CREDIT 0.065317
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_ANNUITY 0.064173
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_2 0.063390
CODE_GENDER_NAME_EDUCATION_TYPE_sum_OWN_CAR_AGE 0.062637
CODE_GENDER_ORGANIZATION_TYPE_mean_DAYS_REGISTRATION 0.052398
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_ANNUITY 0.052341
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_INCOME_TOTAL 0.050268
OCCUPATION_TYPE_mean_DAYS_EMPLOYED 0.050074
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_AMT_ANNUITY 0.048534
OCCUPATION_TYPE_mean_AMT_ANNUITY 0.046566
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_DAYS_ID_PUBLISH 0.040932
OCCUPATION_TYPE_mean_DAYS_REGISTRATION 0.035164
OCCUPATION_TYPE_mean_CNT_CHILDREN 0.019836
OCCUPATION_TYPE_mean_EXT_SOURCE_3 0.007225
OCCUPATION_TYPE_mean_CNT_FAM_MEMBERS 0.005959
OCCUPATION_TYPE_mean_DAYS_BIRTH 0.003795
OCCUPATION_TYPE_mean_DAYS_ID_PUBLISH 0.002663
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_3 0.001598
Bureau data -> eda-bureau.ipynb π
- Hand crafted features (with correlation with target):
bureau_credit_active_binary 0.105735
bureau_debt_credit_ratio 0.096372
bureau_credit_enddate_percentage 0.053573
bureau_total_customer_overdue 0.052995
bureau_total_customer_credit 0.041768
bureau_total_customer_debt 0.019435
bureau_number_of_loan_types 0.018792
bureau_average_of_past_loans_per_type 0.014492
bureau_average_creditdays_prolonged 0.011719
bureau_overdue_debt_ratio 0.008374
bureau_number_of_past_loans 0.006160
- statistical aggregations (mean, sum, min, max)
SK_ID_CURR_mean_DAYS_CREDIT 0.089729
SK_ID_CURR_min_DAYS_CREDIT 0.075248
SK_ID_CURR_mean_DAYS_CREDIT_UPDATE 0.068927
SK_ID_CURR_sum_DAYS_CREDIT_ENDDATE 0.053735
SK_ID_CURR_max_DAYS_CREDIT 0.049782
SK_ID_CURR_mean_DAYS_CREDIT_ENDDATE 0.046983
SK_ID_CURR_min_DAYS_CREDIT_UPDATE 0.042864
SK_ID_CURR_sum_DAYS_CREDIT 0.042000
SK_ID_CURR_sum_DAYS_CREDIT_UPDATE 0.041404
...
SK_ID_CURR_min_AMT_CREDIT_SUM_DEBT 0.000242
SK_ID_CURR_min_CNT_CREDIT_PROLONG 0.000182
SK_ID_CURR_min_AMT_CREDIT_SUM_OVERDUE 0.000003
Credit Card data -> eda-credit_card.ipynb π
- Hand crafted features (with correlation with target):
credit_card_drawings_atm 0.038106
credit_card_installments_per_loan 0.031622
credit_card_total_instalments 0.031304
credit_card_drawings_total 0.023680
credit_card_number_of_loans 0.004388
credit_card_average_of_days_past_due 0.003195
credit_card_avg_loading_of_credit_limit 0.002944
credit_card_cash_card_ratio 0.002414
- statistical aggregations (mean, sum, min, max)
SK_ID_CURR_mean_CNT_DRAWINGS_ATM_CURRENT 0.107692
SK_ID_CURR_max_CNT_DRAWINGS_CURRENT 0.101389
SK_ID_CURR_mean_AMT_BALANCE 0.087177
SK_ID_CURR_mean_CNT_DRAWINGS_CURRENT 0.082520
SK_ID_CURR_max_AMT_BALANCE 0.068798
SK_ID_CURR_min_AMT_BALANCE 0.064163
SK_ID_CURR_max_CNT_DRAWINGS_ATM_CURRENT 0.063729
SK_ID_CURR_var_CNT_DRAWINGS_CURRENT 0.062892
SK_ID_CURR_mean_MONTHS_BALANCE 0.062081
SK_ID_CURR_min_MONTHS_BALANCE 0.061359
SK_ID_CURR_var_CNT_DRAWINGS_ATM_CURRENT 0.061123
SK_ID_CURR_mean_AMT_DRAWINGS_ATM_CURRENT 0.059925
SK_ID_CURR_sum_MONTHS_BALANCE 0.059051
SK_ID_CURR_var_MONTHS_BALANCE 0.058817
SK_ID_CURR_mean_AMT_DRAWINGS_CURRENT 0.058732
SK_ID_CURR_max_AMT_DRAWINGS_CURRENT 0.052318
SK_ID_CURR_sum_CNT_DRAWINGS_CURRENT 0.050685
SK_ID_CURR_sum_CNT_DRAWINGS_ATM_CURRENT 0.049970
SK_ID_CURR_sum_AMT_CREDIT_LIMIT_ACTUAL 0.045460
SK_ID_CURR_sum_CNT_INSTALMENT_MATURE_CUM 0.042363
...
SK_ID_CURR_max_AMT_PAYMENT_CURRENT 0.000438
SK_ID_CURR_sum_CNT_DRAWINGS_OTHER_CURRENT 0.000227
SK_ID_CURR_max_CNT_DRAWINGS_OTHER_CURRENT 0.000008
SK_ID_CURR_min_SK_DPD NaN
SK_ID_CURR_min_SK_DPD_DEF NaN
Installments Payments data -> eda-installments.ipynb π
- statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_DAYS_ENTRY_PAYMENT 0.058794
SK_ID_CURR_min_DAYS_INSTALMENT 0.058648
SK_ID_CURR_var_DAYS_INSTALMENT 0.052273
SK_ID_CURR_var_DAYS_ENTRY_PAYMENT 0.052071
SK_ID_CURR_mean_DAYS_ENTRY_PAYMENT 0.043992
SK_ID_CURR_mean_DAYS_INSTALMENT 0.043509
SK_ID_CURR_sum_DAYS_ENTRY_PAYMENT 0.035227
SK_ID_CURR_sum_DAYS_INSTALMENT 0.035064
SK_ID_CURR_min_NUM_INSTALMENT_VERSION 0.032039
SK_ID_CURR_sum_NUM_INSTALMENT_VERSION 0.030063
SK_ID_CURR_mean_NUM_INSTALMENT_VERSION 0.027323
SK_ID_CURR_min_AMT_PAYMENT 0.025724
SK_ID_CURR_sum_AMT_PAYMENT 0.024375
SK_ID_CURR_mean_AMT_PAYMENT 0.023169
SK_ID_CURR_min_AMT_INSTALMENT 0.020257
SK_ID_CURR_sum_AMT_INSTALMENT 0.019811
SK_ID_CURR_max_NUM_INSTALMENT_VERSION 0.018611
SK_ID_CURR_mean_AMT_INSTALMENT 0.018409
SK_ID_CURR_sum_NUM_INSTALMENT_NUMBER 0.017441
SK_ID_CURR_var_NUM_INSTALMENT_VERSION 0.011427
SK_ID_CURR_mean_NUM_INSTALMENT_NUMBER 0.009537
SK_ID_CURR_max_NUM_INSTALMENT_NUMBER 0.006304
SK_ID_CURR_var_AMT_PAYMENT 0.003841
SK_ID_CURR_max_DAYS_INSTALMENT 0.003231
SK_ID_CURR_min_NUM_INSTALMENT_NUMBER 0.002334
SK_ID_CURR_max_AMT_INSTALMENT 0.002324
SK_ID_CURR_max_DAYS_ENTRY_PAYMENT 0.002298
SK_ID_CURR_var_AMT_INSTALMENT 0.002151
SK_ID_CURR_max_AMT_PAYMENT 0.001554
SK_ID_CURR_var_NUM_INSTALMENT_NUMBER 0.001040
Pos Cash Balance -> eda-pos_cash_balance.ipynb π
- statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_MONTHS_BALANCE 0.055307
SK_ID_CURR_var_MONTHS_BALANCE 0.048760
SK_ID_CURR_sum_MONTHS_BALANCE 0.040570
SK_ID_CURR_mean_MONTHS_BALANCE 0.034543
SK_ID_CURR_max_SK_DPD_DEF 0.009580
SK_ID_CURR_mean_SK_DPD_DEF 0.006496
SK_ID_CURR_min_SK_DPD 0.005444
SK_ID_CURR_mean_SK_DPD 0.005436
SK_ID_CURR_sum_SK_DPD_DEF 0.004950
SK_ID_CURR_max_SK_DPD 0.004763
SK_ID_CURR_sum_SK_DPD 0.004740
SK_ID_CURR_min_SK_DPD_DEF 0.004702
SK_ID_CURR_max_MONTHS_BALANCE 0.004321
SK_ID_CURR_var_SK_DPD_DEF 0.004076
SK_ID_CURR_var_SK_DPD 0.003361
Previous application data -> eda-previous_application.ipynb π
- statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_DAYS_DECISION 0.053434
SK_ID_CURR_var_DAYS_DECISION 0.048513
SK_ID_CURR_mean_DAYS_DECISION 0.046864
SK_ID_CURR_var_CNT_PAYMENT 0.041960
SK_ID_CURR_sum_RATE_DOWN_PAYMENT 0.041693
SK_ID_CURR_max_RATE_DOWN_PAYMENT 0.040096
SK_ID_CURR_mean_HOUR_APPR_PROCESS_START 0.035927
SK_ID_CURR_mean_AMT_ANNUITY 0.034871
SK_ID_CURR_mean_RATE_DOWN_PAYMENT 0.033601
SK_ID_CURR_min_AMT_ANNUITY 0.032249
SK_ID_CURR_min_HOUR_APPR_PROCESS_START 0.031427
SK_ID_CURR_max_HOUR_APPR_PROCESS_START 0.030847
SK_ID_CURR_max_CNT_PAYMENT 0.029439
...
SK_ID_CURR_sum_AMT_GOODS_PRICE 0.004662
SK_ID_CURR_sum_AMT_APPLICATION 0.004607
SK_ID_CURR_var_AMT_DOWN_PAYMENT 0.002022
- 5-fold stratified K-fold (
5
is parameter in the configuration file: neptune.yaml#L38)
- We continue to use LightGBM (check models.py#L80 π»)
- Results -> CV 0.784, LB 0.790 π were obtained for the following set of hyper-parameters (check config ποΈ):
# Light GBM
lgbm_random_search_runs: 0
lgbm__device: cpu # gpu cpu
lgbm__boosting_type: gbdt
lgbm__objective: binary
lgbm__metric: auc
lgbm__number_boosting_rounds: 500
lgbm__early_stopping_rounds: 50
lgbm__learning_rate: 0.1
lgbm__max_bin: 300
lgbm__max_depth: -1
lgbm__num_leaves: 100
lgbm__min_child_samples: 600
lgbm__subsample: 1.0
lgbm__subsample_freq: 1
lgbm__colsample_bytree: 0.1
lgbm__min_gain_to_split: 0.5
lgbm__reg_lambda: 50.0
lgbm__reg_alpha: 0.0
lgbm__scale_pos_weight: 1
- Diagrams below shows that there is not a lot of diversity between folds (which is good) and quite a large gap between train and valid (which is badish π).
- Tweaking the parameters responsible for regularization may help:
lgbm__min_child_samples: 600
lgbm__subsample: 1.0
lgbm__subsample_freq: 1
lgbm__colsample_bytree: 0.1
lgbm__min_gain_to_split: 0.5
lgbm__reg_lambda: 50.0
lgbm__reg_alpha: 0.0
- One can also investigate a bit different tree π΄ architectures. As of now it is deep and wide:
lgbm__max_depth: -1
lgbm__num_leaves: 100
For your reference we put entire pipeline here.
check our GitHub organization https://github.com/neptune-ml for more cool stuff π
Kamil & Kuba, core contributors
- chestnut π°: LightGBM and basic features
- seedling π±: Sklearn and XGBoost algorithms and groupby features
- blossom πΌ: LightGBM on selected features
- tulip π·: LightGBM with smarter features
- sunflower π»: LightGBM clean dynamic features
- four leaf clover π: Stacking by feature diversity and model diversity