-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely randomized trees #2671
Conversation
Here is a short script to test the performance on the data from Porto Seguro competition on Kaggle. Overfitting is reduced slightly using extra-trees. However it's a little slower, I'm not sure why this is.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@btrotta As always, great contribution! Thank you very much!
If it's not hard for you, can you please add to your benchmark extra trees forest from scikit-learn and LightGBM with {'boosting': 'rf', 'extra_trees': True}
? I think it can be interesting to compare.
@btrotta for random trees, I think it is not need to construct histograms, which is the most time-consuming part in lightgbm. you can simply generate the random trees, and then predict over these trees, and then boosting, and next tree, ... |
@guolinke Thanks for the advice! I will try implementing it that way. |
@guolinke I made an attempt at implementing your suggestion but it actually made the performance worse, so I think I must be doing something wrong. Code is on a new branch (https://github.com/btrotta/LightGBM/tree/extra2) if you want to take a look. Here's a summary of how I tried to implement it. When finding each new node, we need to try a random split of each feature on
Then choose the feature having the best random split and split the node on that feature. In theory, it seems to me that this should be at least as fast as the normal GBDT algorithm, since instead of constructing the full histogram (where we need to aggregate gradients and hessians for many bins and save the result in memory), we only have to sum up gradients and hessians for one split. But in fact it takes around twice as long (using the example script above), and I don't understand why. I'd be grateful if you have any insights. Note: I didn't re-implement |
@btrotta sorry for the late response.
In this procedure, you don't need to call |
also refer to LightGBM/src/boosting/gbdt.cpp Lines 298 to 321 in 516bd37
you may need to get leaf index preds before refit. However, leaf index preds over feature bin values is not implemented, you can refer to |
If I understand correctly, you're suggesting that for each new node we just randomly choose 1 feature and 1 threshold, and then split. I think this is not the usual definition of extremely randomized trees. For example see the sklearn docs (https://scikit-learn.org/stable/modules/ensemble.html#forest)
So it chooses 1 random threshold for each feature, but it evaluates many features then splits on the best one. I think if we only choose 1 random feature, the algorithm may not fit the data well. Indeed, in the original paper on extremely randomized trees (https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf), they experiment with varying the size |
@btrotta I see. |
Updated benchmark script:
Output:
|
min_constraint, max_constraint, meta_->monotone_type); | ||
// gain with split is worse than without split | ||
if (current_gain <= min_gain_shift) continue; | ||
if (!meta_->config->extra_trees || t - 1 + offset == rand_threshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change FindBestThresholdSequence
to the template method: template<bool is_rand> FindBestThresholdSequence
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the template, the function will be expanded at compiling, and don't affect the run-time performance.
@StrikerRUS is this ready to merge? |
Yes, I think so. But we should wait for CI fixes. |
@btrotta |
@StrikerRUS @guolinke Thanks for your reviews! |
Option to use extremely randomized trees as base learner, as requested in #2583.