Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gblinear non-determinism #4919

Closed
honzasterba opened this issue Oct 8, 2019 · 10 comments · Fixed by #4929
Closed

gblinear non-determinism #4919

honzasterba opened this issue Oct 8, 2019 · 10 comments · Fixed by #4929

Comments

@honzasterba
Copy link
Contributor

honzasterba commented Oct 8, 2019

The last calls of the code below produce very different results for each call. This is not reproducible when using gbtree booster.

import xgboost as xgb
import pandas as pd

# read in data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
dtrain = xgb.DMatrix(train, label=train["ID"])
dtest = xgb.DMatrix(test, label=test["ID"])
param = {
    'booster': 'gblinear',
    'max_depth': 5, 
    'response_column':'AGE',
    'objective': "reg:gamma",
    'verbosity': 3,
    'seed': 1
}
print(xgb.train(param, dtrain, 1).predict(dtest))
print(xgb.train(param, dtrain, 1).predict(dtest))
print(xgb.train(param, dtrain, 1).predict(dtest))
@honzasterba
Copy link
Contributor Author

data.zip

@trivialfis
Copy link
Member

Could you try providing a seed parameter?

@honzasterba
Copy link
Contributor Author

setting seed still ends up vastly different results for each run (i am updating the sample code)

@RAMitchell
Copy link
Member

gblinear uses the shotgun algorithm by default which is strongly nondeterministic. Try setting "updater":"coord_descent".

@honzasterba
Copy link
Contributor Author

thanks that makes it deterministic, in case you consider gblinear with seed set indeterminism not a bug we cal close this issue

@trivialfis
Copy link
Member

Closing. I forgot that the shotgun is a thread sanitizer killer.

@michalkurka
Copy link

@trivialfis imho this should not be closed without at least some fix - people expect the model will be reproducible if they set a seed - an improvement of documentation and perhaps a warning when gblinear is run with user defined seed seem appropriate in this case

@trivialfis
Copy link
Member

@michalkurka Good point.

@trivialfis trivialfis reopened this Oct 9, 2019
@RAMitchell
Copy link
Member

RAMitchell commented Oct 9, 2019

This is documented under xgboost parameters. @michalkurka where would you expect to see this?

Choice of algorithm to fit linear model

shotgun: Parallel coordinate descent algorithm based on shotgun algorithm. Uses ‘hogwild’ parallelism and therefore produces a nondeterministic solution on each run.

coord_descent: Ordinary coordinate descent algorithm. Also multithreaded but still produces a deterministic solution.```

@michalkurka
Copy link

@RAMitchell This is something I would ideally want to see in the python documentation, eg. here: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier

I would certainly not expect the default is non-deterministic, IMHO a warning on this page would make sense.

trivialfis added a commit to trivialfis/xgboost that referenced this issue Oct 11, 2019
trivialfis added a commit that referenced this issue Oct 12, 2019
* Remove nthread, seed, silent. Add tree_method, gpu_id, num_parallel_tree. Fix #4909.
* Check data shape. Fix #4896.
* Check element of eval_set is tuple. Fix #4875
*  Add doc for random_state with hogwild. Fixes #4919
@lock lock bot locked as resolved and limited conversation to collaborators Jan 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants