Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boosters that should be training identically are not #6711

Open
bitfilms opened this issue Feb 17, 2021 · 6 comments
Open

boosters that should be training identically are not #6711

bitfilms opened this issue Feb 17, 2021 · 6 comments

Comments

@bitfilms
Copy link

Hello -

We've found an issue where models that should end up identical after training (assuming a deterministic random number generator) do not under all conditions. See the following code in which we train 4 boosters with identical parameter sets on the same data for 3 rounds each.

Two are trained for 3 rounds each in series, aka, we train booster 1 for three rounds before moving on to booster 2. Their predictions are identical and always appear to come out that way (as expected because they start with the same seed even though there is random sampling going on with the colsample_* parameters).

However, two are trained in an "interleaved" fashion, aka, we train one booster for one round, then train the second booster for one round, then repeat until both have been trained for 3 rounds. Those boosters do not have identical predictions to either of the first two or each other. Even though they too were trained with the same seed, params, and training data.

It appears that any one of the colsample_* parameters being < 1.0 will trigger this effect. We suspect that a random seed in the module instead of the booster is preserving its state between calls to xgb.train() which puts the training out of sync when you train on different boosters.

If you uncomment the two lines in our code featuring the variable clear_the_pipes and re-run, you will see that calling xgb.train with xgb_model=None does seem to reset things a bit for the interleaved boosters. Indeed, this is the only explanation for why the first two models train identically. Because if the random state were not reset after the first 3 calls to xgb.train(), the second model would come out differently.

We feel that even with colsample_* params impacting the training outcomes, the results should be repeatable and all four of our boosters should end up the same. Perhaps a random seed should be kept on a per-booster basis so that every call to xgb.train() will set the seed accordingly.

Thanks in advance for your help with this one!

import xgboost as xgb
import pandas as pd
import numpy as np

train_df = pd.DataFrame( np.random.rand( 100, 10))
valid_df = pd.DataFrame( np.random.rand( 100, 1))

trainDM  = xgb.DMatrix( data=train_df, label=valid_df )


param_dict = {
 'colsample_bytree': .25,
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'seed': 4512}

boosters_trained_in_series = [ None, None ]
boosters_trained_interleaved = [ None, None ]

# clear_the_pipes = None

for i in range( 0, 2):
    for a in range( 0, 3):
        boosters_trained_in_series[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_in_series[ i])

for a in range(0, 3):
    for i in range( 0, 2):
        boosters_trained_interleaved[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_interleaved[ i])
        # clear_the_pipes = xgb.train(param_dict, trainDM, num_boost_round=1, xgb_model=None )

predictions_df = pd.DataFrame()

all_boosters = boosters_trained_in_series + boosters_trained_interleaved

for i in range( len( all_boosters)):
    predictions_df[ i] = all_boosters[ i].predict( trainDM)

predictions_df

This is all in python 3.79, xgboost 1.3.3.

@trivialfis
Copy link
Member

The random number engine is not reset during each run.

@trivialfis
Copy link
Member

If you remove the colsample_bytree the results should be the same (within floating point errors). Whether to reset the random engine for each run has different effects. For example, the result of:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

should equal to:

booster = train({...}, dtrain, num_boost_round=2)

If the random number engine were reset during each run, the above wouldn't hold.

@bitfilms
Copy link
Author

Yes, resetting the random number engine would be a bad idea. Because yes, you want to be able to interrupt training and reach the same results. In other words, we are in agreement that:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

should always come up with the same results as

booster = train({...}, dtrain, num_boost_round=2)

However, the current code does NOT guarantee this as my post demonstrates! Because IF the user makes any calls to xgb.train() in between the booster_0 and booster_1 calls above, on an entirely different booster instance, it can mess up the RNG state so that booster_1 does not end up the same as booster (again, within floating point errors).

In other words, this simple change will keep booster from equalling booster_1 as we both agree it should:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_irrelevant = train({ params with colsample_* values < 1.0}, dtrain, num_boost_round=1, xgb_model=booster_0)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

It 100% should not be the case that interrupting training on one booster by training another should change the outcome on the first booster. And yet, it does. Because that call to train booster_irrelevant is far from irrelevant: it tweaks the RNG state in xgb thereby changing the fate of booster_1 from what it should have been.

If you cannot train in bursts (let's say two bursts of 10 rounds each) and achieve the same results as if you trained in a single burst of 20, then that seems like a bug that should be fixed. Otherwise, the only way to guarantee the same model outcome in xgboost is to always start training from the very start. Which is untenable for boosters that require hundreds of rounds of training. And yet, in the current code, that is exactly what users must do if they want repeatable results (assuming they use any colsample_* parameters that tweak the RNG state).

My suggestion is not to reset the random number engine with each call to xgb.train() but instead to save the random state on a per-booster basis. Then, even if the param set includes colsample_* or other stochastic settings, when you return to training the model in question you reset the RNG to wherever it left off before and achieve repeatable results, even if you train in bursts.

I am not familiar with the internals of the code so I am not clear on whether my fix is easy or even possible. But right now anyone who trains boosters in bursts using xgboost, and who work with more than one booster at a time, is not guaranteed to get the same results between their runs unless they avoid using the parameters that adjust the RNG state!

@trivialfis
Copy link
Member

trivialfis commented Feb 17, 2021

Because IF the user makes any calls to xgb.train() in between the booster_0 and booster_1 calls above, on an entirely different booster instance

That's true.

Making the state local to the booster is possible with a few changes. But saving it might be more involved. Let me figure something out.

@bitfilms
Copy link
Author

Thank you!

@bitfilms
Copy link
Author

bitfilms commented May 25, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants