Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Output model to a pandas DataFrame #2592

Merged
merged 7 commits into from
Jan 10, 2020

Conversation

pford221
Copy link
Contributor

Added trees_to_dataframe method to Booster class in python API. Based on issue #2578.

@msftclas
Copy link

msftclas commented Nov 26, 2019

CLA assistant check
All CLA requirements met.

@StrikerRUS
Copy link
Collaborator

You can simply add new commits in the df_trees branch and they will appear here.
And it seems that your new unit test fails itself. Please fix it along with PEP8 errors (you can find them here https://travis-ci.org/microsoft/LightGBM/jobs/616989890#L502). Thanks!

@pford221
Copy link
Contributor Author

pford221 commented Nov 28, 2019

Hi,

I have formatted according to pep8 standard and the unit test I added is working. Thanks!

@StrikerRUS
Copy link
Collaborator

Tests fail due to bad indents in the file. One more reminder: you are NOT restricted to 80 chars line length. I guess it was the reason you decided to change indents.
https://github.com/microsoft/LightGBM/tree/master/python-package#development-guide

@pford221
Copy link
Contributor Author

Hi. I appreciate your patience! I'm not too familiar with PEP 8 so I relied on the PEP 8 auto formatting python library and maybe it doesn't exactly conform to LightGBM's standards or maybe something happened in the copy and paste from the PEP 8 formatted .py file to basic.py.

I went through all the travis-ci logs and corrected the formatting issues that I found. I then committed and squashed all the commits into one for easier review.

Thanks!

@pford221 pford221 force-pushed the df_trees branch 2 times, most recently from 5d86213 to 6e89d19 Compare November 29, 2019 18:53
@StrikerRUS
Copy link
Collaborator

No problem! Thank you very much for this PR! I'll review it shortly.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pford221 Sorry for the delay. Please address some comments below.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
feature_names=feature_names))

if PANDAS_INSTALLED:
return DataFrame(m_list, columns=m_list[0].keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_list[0].keys() is not expected to be sorted and it can cause problems in different Python versions/implementations:

Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order for Python 3.6 and later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

I guess you can simply omit columns argument and pandas will name columns properly by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this suggestion. Not passing in the columns from m_list[0].keys() will cause the columns to be sorted alphabetically in Pandas versions prior to 0.25.0. Therefore, to ensure consistency in all versions of Pandas, we explicitly pass the order of the columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! But now there is no particular order at all, as m_list[0].keys() is a set-like collection. Maybe it should be sorted(m_list[0].keys()) to ensure that columns are always sorted alphabetically for any Python implementation and pandas version?

Copy link
Contributor Author

@pford221 pford221 Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is not to have the columns be alphabetically sorted even though that would be consistent across versions of python using as you suggest. I think the ordering of the columns is intuitive to how one might think of a tree's hierarchy: tree_index, node_index, ...,leaf count.

A coupe options:
1.) For users with a pre-3.6 version of Python, their column ordering will be non-deterministic. I'll defer to you as a library maintainer to judge whether that is too unappealing.

2.) Instead of using a dict , we use a collections.OrderedDict to preserve the order of the key,value pairs. Not sure if this will have any performance or other downstream effects, but it should make sure the ordering is consistent.

3.) I hard-code the order of the columns with a non-dynamic list. I don't love this option because if dictionary keys in the model_list ever change, then we'll have to remember to update this static list of column names/dictionary keys.

4.) Just sort the column names/dictionary keys alphabetically and forego the "natural" order I specified above. As I mentioned, I personally think the ordering matters for aiding in understanding the structure of the model, so I think this is the least preferred option from my perspective.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I love your way of ordering too. But we should be as much consistent as it possible. So, I personally see only №2 as a good workaround. Can we try it? I don't think that there will be any significant overhead in comparison to simple dict. If pandas is compatible with OrderedDict, then everything should be OK, I hope.

Copy link
Contributor Author

@pford221 pford221 Dec 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I went with option #2. We import OrderedDict() at the top of basic.py. I hope that's OK.

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@pford221
Copy link
Contributor Author

pford221 commented Dec 7, 2019

Hello. I can't debug why the Travis CI build failed. It appears to be related to Too Many Requests network error. Please let me know if it's something in this PR that's causing the issue.

Thanks!

@StrikerRUS
Copy link
Collaborator

Thank you for your updates! I'll review them soon.

Speaking about Travis, I'm sorry for making you confused with failed test. That test checks all URLs in our docs and is quite unstable due to obvious reasons. I've simply re-run it and now everything seems to be OK.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
node['weight'] = tree['internal_weight']
node['count'] = tree['internal_count']
else:
node['leaf_value'] = tree['leaf_value']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe merge leaf_value and internal_value into simple value as it was done for count and weight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I can do this, but I have a couple concerns. The first is that I don't know what internal_value represents. I originally excluded it from the output of this method because I didn't think/know if it would be useful. Second, users will intuitively be able to interpret leaf_value column as the terminal node predictions for each tree, so they might be confused to find a single value column for every node and have trouble interpreting its meaning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*! \brief Output of non-leaf nodes */
std::vector<double> internal_value_;

// save current leaf value to internal node before change
internal_weight_[new_node_idx] = leaf_weight_[leaf];
internal_value_[new_node_idx] = leaf_value_[leaf];

I suppose that it is leaf_value from previous boosting stages. So I guess it should be merged, because it represents the same thing and now confuses even more with significant number of NaN values in the corresponding columns. With S/L encoding in node_index field I think that users will easily find actual leaf output values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have combined leaf_value and internal_value to be value. However, it seems like internal_value is on a different scale than leaf_value. Here's a screenshot of the output from the breast cancer dataset for a single tree model with max_depth = 2.

image

if PANDAS_INSTALLED:
return DataFrame(model_list, columns=model_list[0].keys())
else:
return model_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: what are the cases when this kind of output is useful? To be a source for any other dataframe-like object or what for? In other words, in what terms it's better than raw output from dump_model()? I'm asking because it contradicts with the method name a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very convenient way to look at things like maximum and minimum split gains which is useful for selecting hyperparamters or ranges of hyperparameters such as min_gain_to_split. You would have to write a lot of custom code to that from dump_model() output, but with pandas it's very easy. Also, it's a convenient way to see how much trees are being pruned, how imbalanced they are, etc. Finally, it's useful for figuring out which variables tend to have "interaction" effects which can tell you something about your data. All of those things are much more difficult when the data is in nested key:value structure like from dump.model().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for providing a rationale of this PR, but I was referring to a case when there is no pandas installed: return model_list. Shouldn't we simply raise an error when pandas was not detected?

... but with pandas it's very easy.

Copy link
Contributor Author

@pford221 pford221 Dec 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree, I don't think a list format would be useful at all. In the latest commit, it's now raising a LightGBM exception if pandas is not installed.

@StrikerRUS StrikerRUS changed the title Output model to a pandas DataFrame [python] Output model to a pandas DataFrame Dec 13, 2019
@pford221 pford221 force-pushed the df_trees branch 2 times, most recently from 3ca484a to e0daa16 Compare December 15, 2019 23:54
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for quick fixes! I hope the last round of review:

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Show resolved Hide resolved
mod_split = bst.feature_importance('split')
mod_gains = bst.feature_importance('gain')
np.testing.assert_equal(tree_split, mod_split)
np.testing.assert_allclose(tree_gains, mod_gains)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some more tests? For instance, we can check that there are exactly 10 trees in the DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two more tests. One to make sure the node count in the top-level node of each of the trees is the same length as the data and the other to ensure we have 10 trees (which makes me a bit worried as I've seen whole trees get pruned but with min_gain_to_split at 0 in this test, we should be fine).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very awesome! Many thanks!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job! Thanks a lot for implementing this!

I think we should wait for a second review as some moments are seems to be discussable.

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

For auto-closing mechanism: fixed #2578. Also I think it can be treated as a fix for #2320, because with this PR it would be possible to obtain actual values as following: actual_min_gain_to_split = model.trees_to_dataframe()['split_gain'].min()

@StrikerRUS
Copy link
Collaborator

@jameslamb Were your comments fully addressed?

@jameslamb
Copy link
Collaborator

@jameslamb Were your comments fully addressed?

ah, totally missed the followup comment and commit in my GitHub emails! Yes they were, thank you for the fix @pford221

@jameslamb jameslamb self-requested a review January 8, 2020 15:28
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pford221 Thanks a lot for quickly fixing the edge case! I'd suggest to overwrite two codepieces for better efficiency below:

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pford221 Thank you very much for your contribution and patience! I think we are good to merge this now.

@StrikerRUS StrikerRUS merged commit 301402c into microsoft:master Jan 10, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants