boosters that should be training identically are not #6711

bitfilms · 2021-02-17T06:35:20Z

Hello -

We've found an issue where models that should end up identical after training (assuming a deterministic random number generator) do not under all conditions. See the following code in which we train 4 boosters with identical parameter sets on the same data for 3 rounds each.

Two are trained for 3 rounds each in series, aka, we train booster 1 for three rounds before moving on to booster 2. Their predictions are identical and always appear to come out that way (as expected because they start with the same seed even though there is random sampling going on with the colsample_* parameters).

However, two are trained in an "interleaved" fashion, aka, we train one booster for one round, then train the second booster for one round, then repeat until both have been trained for 3 rounds. Those boosters do not have identical predictions to either of the first two or each other. Even though they too were trained with the same seed, params, and training data.

It appears that any one of the colsample_* parameters being < 1.0 will trigger this effect. We suspect that a random seed in the module instead of the booster is preserving its state between calls to xgb.train() which puts the training out of sync when you train on different boosters.

If you uncomment the two lines in our code featuring the variable clear_the_pipes and re-run, you will see that calling xgb.train with xgb_model=None does seem to reset things a bit for the interleaved boosters. Indeed, this is the only explanation for why the first two models train identically. Because if the random state were not reset after the first 3 calls to xgb.train(), the second model would come out differently.

We feel that even with colsample_* params impacting the training outcomes, the results should be repeatable and all four of our boosters should end up the same. Perhaps a random seed should be kept on a per-booster basis so that every call to xgb.train() will set the seed accordingly.

Thanks in advance for your help with this one!

import xgboost as xgb
import pandas as pd
import numpy as np

train_df = pd.DataFrame( np.random.rand( 100, 10))
valid_df = pd.DataFrame( np.random.rand( 100, 1))

trainDM  = xgb.DMatrix( data=train_df, label=valid_df )


param_dict = {
 'colsample_bytree': .25,
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'seed': 4512}

boosters_trained_in_series = [ None, None ]
boosters_trained_interleaved = [ None, None ]

# clear_the_pipes = None

for i in range( 0, 2):
    for a in range( 0, 3):
        boosters_trained_in_series[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_in_series[ i])

for a in range(0, 3):
    for i in range( 0, 2):
        boosters_trained_interleaved[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_interleaved[ i])
        # clear_the_pipes = xgb.train(param_dict, trainDM, num_boost_round=1, xgb_model=None )

predictions_df = pd.DataFrame()

all_boosters = boosters_trained_in_series + boosters_trained_interleaved

for i in range( len( all_boosters)):
    predictions_df[ i] = all_boosters[ i].predict( trainDM)

predictions_df

This is all in python 3.79, xgboost 1.3.3.

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-02-17T18:50:57Z

The random number engine is not reset during each run.

trivialfis · 2021-02-17T19:00:44Z

If you remove the colsample_bytree the results should be the same (within floating point errors). Whether to reset the random engine for each run has different effects. For example, the result of:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

should equal to:

booster = train({...}, dtrain, num_boost_round=2)

If the random number engine were reset during each run, the above wouldn't hold.

bitfilms · 2021-02-17T22:19:55Z

Yes, resetting the random number engine would be a bad idea. Because yes, you want to be able to interrupt training and reach the same results. In other words, we are in agreement that:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

should always come up with the same results as

booster = train({...}, dtrain, num_boost_round=2)

However, the current code does NOT guarantee this as my post demonstrates! Because IF the user makes any calls to xgb.train() in between the booster_0 and booster_1 calls above, on an entirely different booster instance, it can mess up the RNG state so that booster_1 does not end up the same as booster (again, within floating point errors).

In other words, this simple change will keep booster from equalling booster_1 as we both agree it should:

booster_0 = train({...}, dtrain, num_boost_round=1)
booster_irrelevant = train({ params with colsample_* values < 1.0}, dtrain, num_boost_round=1, xgb_model=booster_0)
booster_1 = train({...}, dtrain, num_boost_round=1, xgb_model=booster_0)

It 100% should not be the case that interrupting training on one booster by training another should change the outcome on the first booster. And yet, it does. Because that call to train booster_irrelevant is far from irrelevant: it tweaks the RNG state in xgb thereby changing the fate of booster_1 from what it should have been.

If you cannot train in bursts (let's say two bursts of 10 rounds each) and achieve the same results as if you trained in a single burst of 20, then that seems like a bug that should be fixed. Otherwise, the only way to guarantee the same model outcome in xgboost is to always start training from the very start. Which is untenable for boosters that require hundreds of rounds of training. And yet, in the current code, that is exactly what users must do if they want repeatable results (assuming they use any colsample_* parameters that tweak the RNG state).

My suggestion is not to reset the random number engine with each call to xgb.train() but instead to save the random state on a per-booster basis. Then, even if the param set includes colsample_* or other stochastic settings, when you return to training the model in question you reset the RNG to wherever it left off before and achieve repeatable results, even if you train in bursts.

I am not familiar with the internals of the code so I am not clear on whether my fix is easy or even possible. But right now anyone who trains boosters in bursts using xgboost, and who work with more than one booster at a time, is not guaranteed to get the same results between their runs unless they avoid using the parameters that adjust the RNG state!

trivialfis · 2021-02-17T22:40:18Z

Because IF the user makes any calls to xgb.train() in between the booster_0 and booster_1 calls above, on an entirely different booster instance

That's true.

Making the state local to the booster is possible with a few changes. But saving it might be more involved. Let me figure something out.

bitfilms · 2021-02-17T23:17:47Z

Thank you!

bitfilms · 2024-05-25T04:12:36Z

Still an outstanding issue as far as I know.

…

On Thu, May 23, 2024 at 8:27 PM andrew-esteban-imc ***@***.***> wrote: Hi there. Just wondering if there has been any progress on this issue? We have found that resuming hist training from a checkpoint when any of subsample, colsample_bytree, colsample_bylevel and colsample_bynode are set below 1 results in a non-deterministic result. With our current setup, it's fairly common for training jobs to be interrupted (they are running in a contested cluster without guaranteed resources), so we rely on checkpointing to reduce time. — Reply to this email directly, view it on GitHub <#6711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA5ETDE3NURQNRQHGN5MITZD2XQ3AVCNFSM4XXYZOTKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJSHA2DIOJUGY4A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

trivialfis added the type: bug label Mar 30, 2021

This was referenced May 28, 2021

Different dask vs. non-dask training even on 1 worker, 1 chunk, 1 n_jobs #7004

Closed

Remove auto configuration of seed_per_iteration. #7009

Merged

marcofavoritobi mentioned this issue Mar 18, 2023

Reproducibility issue of XGBoostSampler results for Windows and Linux bancaditalia/black-it#49

Open

andrew-esteban-imc mentioned this issue May 24, 2024

Hist training with checkpointing is non-deterministic based on subsample #10324

Open

trivialfis mentioned this issue May 31, 2024

[WIP] Remove global random engine. #10354

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boosters that should be training identically are not #6711

boosters that should be training identically are not #6711

bitfilms commented Feb 17, 2021

trivialfis commented Feb 17, 2021

trivialfis commented Feb 17, 2021

bitfilms commented Feb 17, 2021

trivialfis commented Feb 17, 2021 •

edited

Loading

bitfilms commented Feb 17, 2021

bitfilms commented May 25, 2024 via email

boosters that should be training identically are not #6711

boosters that should be training identically are not #6711

Comments

bitfilms commented Feb 17, 2021

trivialfis commented Feb 17, 2021

trivialfis commented Feb 17, 2021

bitfilms commented Feb 17, 2021

trivialfis commented Feb 17, 2021 • edited Loading

bitfilms commented Feb 17, 2021

bitfilms commented May 25, 2024 via email

trivialfis commented Feb 17, 2021 •

edited

Loading