[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

jameslamb · 2020-02-20T17:20:41Z

@JoanFM thanks for contributing #2737 ! Unfortunately, there were some issues in the R side. This PR attempts to address them:

removes trailing _ from method names
fixes bug where lower_bound() method was storing its return value in variable upper_bound
adds unit tests to ensure the implementation is working
added new functions to lightgbm_R.h (without that, LGBM_BoosterGetUpperBoundValue_R and LGBM_BoosterGetLowerBoundValue_R are not callable from R)

I still need some help though @guolinke @JoanFM @StrikerRUS .... I am not getting the answers I'd expect. For example, I would expect lower bound to be 0 and upper bound to be 1 for binary classification, but running this:

data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test
nrounds <- 10L
  bst <- lightgbm(
    data = train$data
    , label = train$label
    , num_leaves = 5L
    , nrounds = nrounds
    , objective = "binary"
    , metric = "binary_error"
  )

I'm confused by the results:

bst$lower_bound()
[1] -1950774382
bst$upper_bound()
[1] 1196082214

I'll look back through #2737 , but maybe there is just something fundamental that I've misunderstood?

jameslamb · 2020-02-20T17:29:41Z

I'm confused by the results:

bst$lower_bound()
[1] -1950774382
bst$upper_bound()
[1] 1196082214

I'll look back through #2737 , but maybe there is just something fundamental that I've misunderstood?

Ok I realized one thing that was causing surprising results...the return type needs to be double, not int. Fixed in a0dc638

Now that same example returns values like this:

bst$lower_bound()
[1] -1.590853
bst$upper_bound()
[1] 1.871015

Is there something fundamental I'm missing? If the target in the training data is bounded between 0 and 1, how is it possible for something tree-based to predict a value lower than 0 or great than 1?

When I re-predict on all of the training data, I get results I'd expect

preds <- bst$predict(data = train$data)

summary(preds)

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2046 0.2048 0.2137 0.4821 0.7934 0.7934

JoanFM · 2020-02-20T17:52:22Z

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.

These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.

I hope it helps

jameslamb · 2020-02-20T17:56:57Z

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.

These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.

I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

JoanFM · 2020-02-20T18:08:10Z

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1

tree 1
leafs 0.5, -0.5.

The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

jameslamb · 2020-02-20T18:53:31Z

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1

tree 1
leafs 0.5, -0.5.

The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

Thanks @JoanFM . I was thinking about this the wrong way. Fixed the tests in ec9a941

I think we are good!

JoanFM · 2020-02-20T19:01:04Z

Hello @jameslamb, there might be an error, but there is not direct relationship between the minimmum you get in your data and the bound values, the bound values may be totally unreachable.
These bounds are taken by looking at the min and max values of every tree's leaves and adding them up. But they are most likely not reachable values. They serve as a lower or upper bound that one can be 100% sure that the model will not surpass without taking a look at any kind of data.
I hope it helps

Thanks, Joan. Maybe there's something I've misunderstood, I still don't get how a tree-based model that has only ever seen data between 0 and 1 could ever predict a value outside of that range even theoretically, since the leaf values are taken by voting or averaging (depending on task). Right?

hey James, it does not need a theoretical value. It is just equivalent to parse the tree, and take all the time the min or max of leaf_values, and add them up. Imagine:
tree 0
leafs -1, 1
tree 1
leafs 0.5, -0.5.
The upper and lower bounds would be -1.5 and 1.5. But maybe these values are not reachable, they are just a conservative bound.

Thanks @JoanFM . I was thinking about this the wrong way. Fixed the tests in ec9a941

I think we are good!

you are welcome, thank you and sorry for the screw up with the R-package in the previous PR

jameslamb · 2020-02-20T19:31:47Z

you are welcome, thank you and sorry for the screw up with the R-package in the previous PR

No problem! My fault for not giving you a review sooner. Thanks again for the contributions!

StrikerRUS

@jameslamb Thanks you very much for the prompt fixes!

R-package/tests/testthat/test_basic.R

… for lgb.Booster

jameslamb requested a review from Laurae2 as a code owner February 20, 2020 17:20

jameslamb requested review from StrikerRUS and guolinke February 20, 2020 17:20

StrikerRUS approved these changes Feb 22, 2020

View reviewed changes

R-package/tests/testthat/test_basic.R Outdated Show resolved Hide resolved

jameslamb added 5 commits February 22, 2020 14:47

[R-package] Fixed R implementation of upper_bound() and lower_bound()…

720b512

… for lgb.Booster

[R-package] switched return type to double

dec56ef

fixed R tests on Booster upper_bound() and lower_bound()

44f0a17

fixed linting

5ac2de7

moved numeric tolerance into a global constant

1825ad7

jameslamb force-pushed the bugfix/bounds branch from 847bd1b to 1825ad7 Compare February 22, 2020 20:49

jameslamb merged commit 790c1e3 into microsoft:master Feb 23, 2020

guolinke added the fix label Mar 1, 2020

jameslamb deleted the bugfix/bounds branch March 11, 2020 00:58

lock bot locked as resolved and limited conversation to collaborators May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

jameslamb commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

StrikerRUS left a comment

[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

[R-package] Fixed R implementation of upper_bound() and lower_bound() for lgb.Booster #2785

Conversation

jameslamb commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

JoanFM commented Feb 20, 2020

jameslamb commented Feb 20, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment