Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation: Clarify n_subfeatures in build_tree? #224

Closed
mlesnoff opened this issue Mar 22, 2023 · 4 comments
Closed

documentation: Clarify n_subfeatures in build_tree? #224

mlesnoff opened this issue Mar 22, 2023 · 4 comments

Comments

@mlesnoff
Copy link

I have a question about a point that I did not find in the documentation of DecistionTree.jl

For function build_tree, it is indicated for argument n_subfeatures:

n_subfeatures: number of features to select at random (default: 0, keep all)

Is the features random selection done at each split of the tree or only one single time before to build the tree?

For function build_forest, it is indicated that the selection is done at each split (as in usual RF):

n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))

therefore I presume that it is the same for build_tree, but I am not sure. Could you confirm (and eventually add it to the doc)?

Another question - What is the method used to split: the "exact" method or is it an approximated histogram-based method? I did not find indications in the doc.

@ablaom
Copy link
Member

ablaom commented Mar 22, 2023

Is the features random selection done at each split of the tree or only one single time before to build the tree?

Yes, at each split.

What is the method used to split: the "exact" method or is it an approximated histogram-based method? I did not find indications in the doc.

The implementation is CART, which means exact.

You may want to keep in mind that all splits assume the feature is ordered and uses that ordering in the splitting algorithm, which means certain splits are never considered if the feature is unordered (but accepted by the algorithm because it is encoded using a type with an order, such as Int).

Another common gotcha is that setting n_subfeatures to the maximum number of features does not strictly recreate the classic CART algorithm because the features are still shuffled, leading to RNG-dependent resolution of feature ties (draws).

@ablaom
Copy link
Member

ablaom commented Mar 22, 2023

Closed as tracked.

@ablaom ablaom closed this as completed Mar 22, 2023
@mlesnoff
Copy link
Author

Thanks for the infos @ablaom, and congrats for this package that has become very fast

@ablaom
Copy link
Member

ablaom commented Mar 22, 2023

Your welcome.

congrats for this package that has become very fast

I'm just a maintainer. The main work was carried out by @bensadeghi and others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants