Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

Merged
merged 5 commits into from
Feb 23, 2020

Conversation

jameslamb
Copy link
Collaborator

@JoanFM thanks for contributing #2737 ! Unfortunately, there were some issues in the R side. This PR attempts to address them:

  • removes trailing _ from method names
  • fixes bug where lower_bound() method was storing its return value in variable upper_bound
  • adds unit tests to ensure the implementation is working
  • added new functions to lightgbm_R.h (without that, LGBM_BoosterGetUpperBoundValue_R and LGBM_BoosterGetLowerBoundValue_R are not callable from R)

I still need some help though @guolinke @JoanFM @StrikerRUS .... I am not getting the answers I'd expect. For example, I would expect lower bound to be 0 and upper bound to be 1 for binary classification, but running this:

data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test
nrounds <- 10L
  bst <- lightgbm(
    data = train$data
    , label = train$label
    , num_leaves = 5L
    , nrounds = nrounds
    , objective = "binary"
    , metric = "binary_error"
  )

I'm confused by the results:

bst$lower_bound()
[1] -1950774382
bst$upper_bound()
[1] 1196082214

I'll look back through #2737 , but maybe there is just something fundamental that I've misunderstood?

@jameslamb
Copy link
Collaborator Author

I'm confused by the results:

bst$lower_bound()
[1] -1950774382
bst$upper_bound()
[1] 1196082214

I'll look back through #2737 , but maybe there is just something fundamental that I've misunderstood?

Ok I realized one thing that was causing surprising results...the return type needs to be double, not int. Fixed in a0dc638

Now that same example returns values like this:

bst$lower_bound()
[1] -1.590853
bst$upper_bound()
[1] 1.871015

Is there something fundamental I'm missing? If the target in the training data is bounded between 0 and 1, how is it possible for something tree-based to predict a value lower than 0 or great than 1?

When I re-predict on all of the training data, I get results I'd expect

preds <- bst$predict(data = train$data)

summary(preds)

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2046 0.2048 0.2137 0.4821 0.7934 0.7934

@JoanFM
Copy link
Contributor

JoanFM commented Feb 20, 2020

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.

These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.

I hope it helps

@jameslamb
Copy link
Collaborator Author

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.

These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.

I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

@JoanFM
Copy link
Contributor

JoanFM commented Feb 20, 2020

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1

tree 1
leafs 0.5, -0.5.

The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

@jameslamb
Copy link
Collaborator Author

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1

tree 1
leafs 0.5, -0.5.

The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

Thanks @JoanFM . I was thinking about this the wrong way. Fixed the tests in ec9a941

I think we are good!

@JoanFM
Copy link
Contributor

JoanFM commented Feb 20, 2020

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1
tree 1
leafs 0.5, -0.5.
The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

Thanks @JoanFM . I was thinking about this the wrong way. Fixed the tests in ec9a941

I think we are good!

you are welcome, thank you and sorry for the screw up with the R-package in the previous PR

@jameslamb
Copy link
Collaborator Author

you are welcome, thank you and sorry for the screw up with the R-package in the previous PR

No problem! My fault for not giving you a review sooner. Thanks again for the contributions!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Thanks you very much for the prompt fixes!

R-package/tests/testthat/test_basic.R Outdated Show resolved Hide resolved
@jameslamb jameslamb merged commit 790c1e3 into microsoft:master Feb 23, 2020
@guolinke guolinke added the fix label Mar 1, 2020
@jameslamb jameslamb deleted the bugfix/bounds branch March 11, 2020 00:58
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants