[python] Output model to a pandas DataFrame #2592

pford221 · 2019-11-26T01:33:35Z

Added trees_to_dataframe method to Booster class in python API. Based on issue #2578.

msftclas · 2019-11-26T01:33:48Z

All CLA requirements met.

StrikerRUS · 2019-11-27T22:57:49Z

You can simply add new commits in the df_trees branch and they will appear here.
And it seems that your new unit test fails itself. Please fix it along with PEP8 errors (you can find them here https://travis-ci.org/microsoft/LightGBM/jobs/616989890#L502). Thanks!

pford221 · 2019-11-28T19:00:37Z

Hi,

I have formatted according to pep8 standard and the unit test I added is working. Thanks!

StrikerRUS · 2019-11-29T01:02:14Z

Tests fail due to bad indents in the file. One more reminder: you are NOT restricted to 80 chars line length. I guess it was the reason you decided to change indents.
https://github.com/microsoft/LightGBM/tree/master/python-package#development-guide

pford221 · 2019-11-29T18:30:22Z

Hi. I appreciate your patience! I'm not too familiar with PEP 8 so I relied on the PEP 8 auto formatting python library and maybe it doesn't exactly conform to LightGBM's standards or maybe something happened in the copy and paste from the PEP 8 formatted .py file to basic.py.

I went through all the travis-ci logs and corrected the formatting issues that I found. I then committed and squashed all the commits into one for easier review.

Thanks!

StrikerRUS · 2019-11-30T15:53:18Z

No problem! Thank you very much for this PR! I'll review it shortly.

StrikerRUS

@pford221 Sorry for the delay. Please address some comments below.

python-package/lightgbm/basic.py

StrikerRUS · 2019-12-05T18:39:03Z

python-package/lightgbm/basic.py

+                                                 feature_names=feature_names))
+
+        if PANDAS_INSTALLED:
+            return DataFrame(m_list, columns=m_list[0].keys())


m_list[0].keys() is not expected to be sorted and it can cause problems in different Python versions/implementations:

Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order for Python 3.6 and later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

I guess you can simply omit columns argument and pandas will name columns properly by default.

I don't understand this suggestion. Not passing in the columns from m_list[0].keys() will cause the columns to be sorted alphabetically in Pandas versions prior to 0.25.0. Therefore, to ensure consistency in all versions of Pandas, we explicitly pass the order of the columns.

Ah, I see! But now there is no particular order at all, as m_list[0].keys() is a set-like collection. Maybe it should be sorted(m_list[0].keys()) to ensure that columns are always sorted alphabetically for any Python implementation and pandas version?

My intention is not to have the columns be alphabetically sorted even though that would be consistent across versions of python using as you suggest. I think the ordering of the columns is intuitive to how one might think of a tree's hierarchy: tree_index, node_index, ...,leaf count.

A coupe options:
1.) For users with a pre-3.6 version of Python, their column ordering will be non-deterministic. I'll defer to you as a library maintainer to judge whether that is too unappealing.

2.) Instead of using a dict , we use a collections.OrderedDict to preserve the order of the key,value pairs. Not sure if this will have any performance or other downstream effects, but it should make sure the ordering is consistent.

3.) I hard-code the order of the columns with a non-dynamic list. I don't love this option because if dictionary keys in the model_list ever change, then we'll have to remember to update this static list of column names/dictionary keys.

4.) Just sort the column names/dictionary keys alphabetically and forego the "natural" order I specified above. As I mentioned, I personally think the ordering matters for aiding in understanding the structure of the model, so I think this is the least preferred option from my perspective.

Yes, I love your way of ordering too. But we should be as much consistent as it possible. So, I personally see only №2 as a good workaround. Can we try it? I don't think that there will be any significant overhead in comparison to simple dict. If pandas is compatible with OrderedDict, then everything should be OK, I hope.

Great. I went with option #2. We import OrderedDict() at the top of basic.py. I hope that's OK.

tests/python_package_test/test_basic.py

Post-review changes

pford221 · 2019-12-07T22:32:32Z

Hello. I can't debug why the Travis CI build failed. It appears to be related to Too Many Requests network error. Please let me know if it's something in this PR that's causing the issue.

Thanks!

StrikerRUS · 2019-12-07T23:40:51Z

Thank you for your updates! I'll review them soon.

Speaking about Travis, I'm sorry for making you confused with failed test. That test checks all URLs in our docs and is quite unstable due to obvious reasons. I've simply re-run it and now everything seems to be OK.

python-package/lightgbm/basic.py

StrikerRUS · 2019-12-13T01:35:28Z

python-package/lightgbm/basic.py

+                node['weight'] = tree['internal_weight']
+                node['count'] = tree['internal_count']
+            else:
+                node['leaf_value'] = tree['leaf_value']


Maybe merge leaf_value and internal_value into simple value as it was done for count and weight?

Hi. I can do this, but I have a couple concerns. The first is that I don't know what internal_value represents. I originally excluded it from the output of this method because I didn't think/know if it would be useful. Second, users will intuitively be able to interpret leaf_value column as the terminal node predictions for each tree, so they might be confused to find a single value column for every node and have trouble interpreting its meaning.

LightGBM/include/LightGBM/tree.h

Lines 395 to 396 in d7f8aa5

/*! \brief Output of non-leaf nodes */

std::vector<double> internal_value_;

LightGBM/include/LightGBM/tree.h

Lines 432 to 434 in d7f8aa5

// save current leaf value to internal node before change

internal_weight_[new_node_idx] = leaf_weight_[leaf];

internal_value_[new_node_idx] = leaf_value_[leaf];

I suppose that it is leaf_value from previous boosting stages. So I guess it should be merged, because it represents the same thing and now confuses even more with significant number of NaN values in the corresponding columns. With S/L encoding in node_index field I think that users will easily find actual leaf output values.

I have combined leaf_value and internal_value to be value. However, it seems like internal_value is on a different scale than leaf_value. Here's a screenshot of the output from the breast cancer dataset for a single tree model with max_depth = 2.

StrikerRUS · 2019-12-13T01:40:11Z

python-package/lightgbm/basic.py

+        if PANDAS_INSTALLED:
+            return DataFrame(model_list, columns=model_list[0].keys())
+        else:
+            return model_list


Just a question: what are the cases when this kind of output is useful? To be a source for any other dataframe-like object or what for? In other words, in what terms it's better than raw output from dump_model()? I'm asking because it contradicts with the method name a little bit.

This is very convenient way to look at things like maximum and minimum split gains which is useful for selecting hyperparamters or ranges of hyperparameters such as min_gain_to_split. You would have to write a lot of custom code to that from dump_model() output, but with pandas it's very easy. Also, it's a convenient way to see how much trees are being pruned, how imbalanced they are, etc. Finally, it's useful for figuring out which variables tend to have "interaction" effects which can tell you something about your data. All of those things are much more difficult when the data is in nested key:value structure like from dump.model().

Thank you very much for providing a rationale of this PR, but I was referring to a case when there is no pandas installed: return model_list. Shouldn't we simply raise an error when pandas was not detected?

... but with pandas it's very easy.

Ah, I see. I agree, I don't think a list format would be useful at all. In the latest commit, it's now raising a LightGBM exception if pandas is not installed.

StrikerRUS

Thank you very much for quick fixes! I hope the last round of review:

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

StrikerRUS · 2019-12-18T00:56:22Z

tests/python_package_test/test_basic.py

+        mod_split = bst.feature_importance('split')
+        mod_gains = bst.feature_importance('gain')
+        np.testing.assert_equal(tree_split, mod_split)
+        np.testing.assert_allclose(tree_gains, mod_gains)


Can we have some more tests? For instance, we can check that there are exactly 10 trees in the DataFrame.

I added two more tests. One to make sure the node count in the top-level node of each of the trees is the same length as the data and the other to ensure we have 10 trees (which makes me a bit worried as I've seen whole trees get pruned but with min_gain_to_split at 0 in this test, we should be fine).

Very awesome! Many thanks!

StrikerRUS

Awesome job! Thanks a lot for implementing this!

I think we should wait for a second review as some moments are seems to be discussable.

tests/python_package_test/test_basic.py

StrikerRUS · 2019-12-26T14:34:29Z

For auto-closing mechanism: fixed #2578. Also I think it can be treated as a fix for #2320, because with this PR it would be possible to obtain actual values as following: actual_min_gain_to_split = model.trees_to_dataframe()['split_gain'].min()

tests/python_package_test/test_basic.py

StrikerRUS · 2020-01-06T15:06:32Z

@jameslamb Were your comments fully addressed?

jameslamb · 2020-01-06T15:54:45Z

@jameslamb Were your comments fully addressed?

ah, totally missed the followup comment and commit in my GitHub emails! Yes they were, thank you for the fix @pford221

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

StrikerRUS

@pford221 Thanks a lot for quickly fixing the edge case! I'd suggest to overwrite two codepieces for better efficiency below:

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

python-package/lightgbm/basic.py

StrikerRUS

@pford221 Thank you very much for your contribution and patience! I think we are good to merge this now.

pford221 requested review from chivee, guolinke, henry0312, jameslamb, Laurae2, StrikerRUS and wxchan as code owners November 26, 2019 01:33

pford221 force-pushed the df_trees branch from 48e4c94 to fe17b08 Compare November 29, 2019 18:26

pford221 force-pushed the df_trees branch 2 times, most recently from 5d86213 to 6e89d19 Compare November 29, 2019 18:53

trees_to_df method and unit test added. PEP 8 fixes for integration.

c2f7ed8

pford221 force-pushed the df_trees branch from 6e89d19 to c2f7ed8 Compare November 29, 2019 19:10

StrikerRUS requested changes Dec 5, 2019

View reviewed changes

Co-Authored-By: Nikita Titov <[email protected]>

7ddfd27

Post-review changes

pford221 force-pushed the df_trees branch from f4b3cf1 to 7ddfd27 Compare December 7, 2019 22:17

StrikerRUS reviewed Dec 13, 2019

View reviewed changes

StrikerRUS changed the title ~~Output model to a pandas DataFrame~~ [python] Output model to a pandas DataFrame Dec 13, 2019

pford221 force-pushed the df_trees branch 2 times, most recently from 3ca484a to e0daa16 Compare December 15, 2019 23:54

changes from second round of reviews from striker

3938d81

pford221 force-pushed the df_trees branch from e0daa16 to 3938d81 Compare December 16, 2019 00:22

StrikerRUS reviewed Dec 18, 2019

View reviewed changes

third round of review. formatting and added 2 more tests

35cdb33

pford221 force-pushed the df_trees branch from f0730de to 35cdb33 Compare December 18, 2019 01:47

StrikerRUS approved these changes Dec 20, 2019

View reviewed changes

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

StrikerRUS added the awaiting review label Dec 28, 2019

jameslamb requested changes Dec 28, 2019

View reviewed changes

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

replaced pandas dot attribute accessor with string attribute accessor

f61825d

pford221 force-pushed the df_trees branch from e319175 to f61825d Compare December 29, 2019 17:53

jameslamb self-requested a review January 8, 2020 15:28

jameslamb approved these changes Jan 8, 2020

View reviewed changes

StrikerRUS requested changes Jan 8, 2020

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

StrikerRUS requested changes Jan 8, 2020

View reviewed changes

python-package/lightgbm/basic.py Show resolved Hide resolved

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

dealt with single tree edge case and minor refactor of tests

fea27d4

pford221 force-pushed the df_trees branch from f464045 to fea27d4 Compare January 9, 2020 03:45

StrikerRUS requested changes Jan 10, 2020

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

slight refactor for checking if tree is a single node

c527250

StrikerRUS approved these changes Jan 10, 2020

View reviewed changes

StrikerRUS merged commit 301402c into microsoft:master Jan 10, 2020

This was referenced Jan 10, 2020

Output tree structure to dataframe #2578

Closed

Example for calculating param values from model dump #2320

Closed

[python] fix trees_to_dataframe and enhance test #2690

Merged

guolinke added feature and removed awaiting review labels Mar 1, 2020

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Output model to a pandas DataFrame #2592

[python] Output model to a pandas DataFrame #2592

pford221 commented Nov 26, 2019

msftclas commented Nov 26, 2019 •

edited

Loading

StrikerRUS commented Nov 27, 2019

pford221 commented Nov 28, 2019 •

edited

Loading

StrikerRUS commented Nov 29, 2019

pford221 commented Nov 29, 2019

StrikerRUS commented Nov 30, 2019

StrikerRUS left a comment

StrikerRUS Dec 5, 2019

pford221 Dec 7, 2019

StrikerRUS Dec 13, 2019

pford221 Dec 13, 2019 •

edited

Loading

StrikerRUS Dec 15, 2019

pford221 Dec 15, 2019 •

edited

Loading

pford221 commented Dec 7, 2019

StrikerRUS commented Dec 7, 2019

StrikerRUS Dec 13, 2019

pford221 Dec 13, 2019

StrikerRUS Dec 15, 2019

pford221 Dec 15, 2019

StrikerRUS Dec 13, 2019

pford221 Dec 13, 2019

StrikerRUS Dec 15, 2019

pford221 Dec 15, 2019 •

edited

Loading

StrikerRUS left a comment

StrikerRUS Dec 18, 2019

pford221 Dec 18, 2019

StrikerRUS Dec 19, 2019

StrikerRUS left a comment

StrikerRUS commented Dec 26, 2019

StrikerRUS commented Jan 6, 2020

jameslamb commented Jan 6, 2020

StrikerRUS left a comment

StrikerRUS left a comment

	/! \brief Output of non-leaf nodes /
	std::vector<double> internal_value_;

	// save current leaf value to internal node before change
	internal_weight_[new_node_idx] = leaf_weight_[leaf];
	internal_value_[new_node_idx] = leaf_value_[leaf];

[python] Output model to a pandas DataFrame #2592

[python] Output model to a pandas DataFrame #2592

Conversation

pford221 commented Nov 26, 2019

msftclas commented Nov 26, 2019 • edited Loading

StrikerRUS commented Nov 27, 2019

pford221 commented Nov 28, 2019 • edited Loading

StrikerRUS commented Nov 29, 2019

pford221 commented Nov 29, 2019

StrikerRUS commented Nov 30, 2019

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pford221 Dec 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pford221 Dec 15, 2019 • edited Loading

Choose a reason for hiding this comment

pford221 commented Dec 7, 2019

StrikerRUS commented Dec 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pford221 Dec 15, 2019 • edited Loading

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Dec 26, 2019

StrikerRUS commented Jan 6, 2020

jameslamb commented Jan 6, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

msftclas commented Nov 26, 2019 •

edited

Loading

pford221 commented Nov 28, 2019 •

edited

Loading

pford221 Dec 13, 2019 •

edited

Loading

pford221 Dec 15, 2019 •

edited

Loading

pford221 Dec 15, 2019 •

edited

Loading