-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Output model to a pandas DataFrame #2592
Conversation
You can simply add new commits in the |
Hi, I have formatted according to pep8 standard and the unit test I added is working. Thanks! |
Tests fail due to bad indents in the file. One more reminder: you are NOT restricted to 80 chars line length. I guess it was the reason you decided to change indents. |
Hi. I appreciate your patience! I'm not too familiar with PEP 8 so I relied on the PEP 8 auto formatting python library and maybe it doesn't exactly conform to LightGBM's standards or maybe something happened in the copy and paste from the PEP 8 formatted .py file to basic.py. I went through all the travis-ci logs and corrected the formatting issues that I found. I then committed and squashed all the commits into one for easier review. Thanks! |
5d86213
to
6e89d19
Compare
No problem! Thank you very much for this PR! I'll review it shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pford221 Sorry for the delay. Please address some comments below.
python-package/lightgbm/basic.py
Outdated
feature_names=feature_names)) | ||
|
||
if PANDAS_INSTALLED: | ||
return DataFrame(m_list, columns=m_list[0].keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_list[0].keys()
is not expected to be sorted and it can cause problems in different Python versions/implementations:
Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order for Python 3.6 and later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
I guess you can simply omit columns
argument and pandas will name columns properly by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this suggestion. Not passing in the columns from m_list[0].keys()
will cause the columns to be sorted alphabetically in Pandas versions prior to 0.25.0. Therefore, to ensure consistency in all versions of Pandas, we explicitly pass the order of the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! But now there is no particular order at all, as m_list[0].keys()
is a set-like collection. Maybe it should be sorted(m_list[0].keys())
to ensure that columns are always sorted alphabetically for any Python implementation and pandas version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is not to have the columns be alphabetically sorted
even though that would be consistent across versions of python using as you suggest. I think the ordering of the columns is intuitive to how one might think of a tree's hierarchy: tree_index
, node_index
, ...,leaf count
.
A coupe options:
1.) For users with a pre-3.6 version of Python, their column ordering will be non-deterministic. I'll defer to you as a library maintainer to judge whether that is too unappealing.
2.) Instead of using a dict
, we use a collections.OrderedDict
to preserve the order of the key,value pairs. Not sure if this will have any performance or other downstream effects, but it should make sure the ordering is consistent.
3.) I hard-code the order of the columns with a non-dynamic list. I don't love this option because if dictionary keys in the model_list
ever change, then we'll have to remember to update this static list of column names/dictionary keys.
4.) Just sort the column names/dictionary keys alphabetically and forego the "natural" order I specified above. As I mentioned, I personally think the ordering matters for aiding in understanding the structure of the model, so I think this is the least preferred option from my perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I love your way of ordering too. But we should be as much consistent as it possible. So, I personally see only №2 as a good workaround. Can we try it? I don't think that there will be any significant overhead in comparison to simple dict. If pandas is compatible with OrderedDict, then everything should be OK, I hope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I went with option #2. We import OrderedDict()
at the top of basic.py. I hope that's OK.
Post-review changes
Hello. I can't debug why the Travis CI build failed. It appears to be related to Too Many Requests network error. Please let me know if it's something in this PR that's causing the issue. Thanks! |
Thank you for your updates! I'll review them soon. Speaking about Travis, I'm sorry for making you confused with failed test. That test checks all URLs in our docs and is quite unstable due to obvious reasons. I've simply re-run it and now everything seems to be OK. |
python-package/lightgbm/basic.py
Outdated
node['weight'] = tree['internal_weight'] | ||
node['count'] = tree['internal_count'] | ||
else: | ||
node['leaf_value'] = tree['leaf_value'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe merge leaf_value
and internal_value
into simple value
as it was done for count
and weight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. I can do this, but I have a couple concerns. The first is that I don't know what internal_value
represents. I originally excluded it from the output of this method because I didn't think/know if it would be useful. Second, users will intuitively be able to interpret leaf_value
column as the terminal node predictions for each tree, so they might be confused to find a single value
column for every node and have trouble interpreting its meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGBM/include/LightGBM/tree.h
Lines 395 to 396 in d7f8aa5
/*! \brief Output of non-leaf nodes */ | |
std::vector<double> internal_value_; |
LightGBM/include/LightGBM/tree.h
Lines 432 to 434 in d7f8aa5
// save current leaf value to internal node before change | |
internal_weight_[new_node_idx] = leaf_weight_[leaf]; | |
internal_value_[new_node_idx] = leaf_value_[leaf]; |
I suppose that it is
leaf_value
from previous boosting stages. So I guess it should be merged, because it represents the same thing and now confuses even more with significant number of NaN values in the corresponding columns. With S
/L
encoding in node_index
field I think that users will easily find actual leaf output values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python-package/lightgbm/basic.py
Outdated
if PANDAS_INSTALLED: | ||
return DataFrame(model_list, columns=model_list[0].keys()) | ||
else: | ||
return model_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: what are the cases when this kind of output is useful? To be a source for any other dataframe-like object or what for? In other words, in what terms it's better than raw output from dump_model()
? I'm asking because it contradicts with the method name a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very convenient way to look at things like maximum and minimum split gains which is useful for selecting hyperparamters or ranges of hyperparameters such as min_gain_to_split
. You would have to write a lot of custom code to that from dump_model()
output, but with pandas it's very easy. Also, it's a convenient way to see how much trees are being pruned, how imbalanced they are, etc. Finally, it's useful for figuring out which variables tend to have "interaction" effects which can tell you something about your data. All of those things are much more difficult when the data is in nested key:value structure like from dump.model()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for providing a rationale of this PR, but I was referring to a case when there is no pandas installed: return model_list
. Shouldn't we simply raise an error when pandas was not detected?
... but with pandas it's very easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I agree, I don't think a list format would be useful at all. In the latest commit, it's now raising a LightGBM exception if pandas is not installed.
3ca484a
to
e0daa16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for quick fixes! I hope the last round of review:
mod_split = bst.feature_importance('split') | ||
mod_gains = bst.feature_importance('gain') | ||
np.testing.assert_equal(tree_split, mod_split) | ||
np.testing.assert_allclose(tree_gains, mod_gains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some more tests? For instance, we can check that there are exactly 10 trees in the DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two more tests. One to make sure the node count in the top-level node of each of the trees is the same length as the data and the other to ensure we have 10 trees (which makes me a bit worried as I've seen whole trees get pruned but with min_gain_to_split
at 0 in this test, we should be fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very awesome! Many thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! Thanks a lot for implementing this!
I think we should wait for a second review as some moments are seems to be discussable.
@jameslamb Were your comments fully addressed? |
ah, totally missed the followup comment and commit in my GitHub emails! Yes they were, thank you for the fix @pford221 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pford221 Thanks a lot for quickly fixing the edge case! I'd suggest to overwrite two codepieces for better efficiency below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pford221 Thank you very much for your contribution and patience! I think we are good to merge this now.
Added
trees_to_dataframe
method to Booster class in python API. Based on issue #2578.