-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non deterministic results with dask when using version 1.6.2 #8701
Comments
it takes some care to get deterministic results using dask. The issue you referred to is specific to the local GPU cluster where there's a single host and multiple GPUs occupying the same host address. With or without the fix, xgboost should be deterministic on other setups given data input is deterministic. In essence, one needs to ensure the data partitioning is exactly the same for each and every run. |
Hi @trivialfis Would you have any recommendations about how to ensure the partitioning is the exact same? I am repartitioning the dask dataframe before I fit my model but I am having it automatically partition the data into 500 MB chunks. I could change this to a specific number for the partitions if that would help to ensure consistency. Any help you can offer is much appreciated!! I can also post on the dask forum. Thank you! |
Hi @trivialfis, Here you can see the AUC for the model is different even though the input rows for each of the partitions is exactly the same. These partitions have been sorted on a datetime index then repartitioned and produced the same number of output rows so I would expect their content to be identical. I am hoping that there is still something I am doing wrong since for dask/xgboost to be practical for machine learning I do think it is very important for the results to be reproducible. Thank you so much for any help or clarification you can offer! Please let me know if I am mistaken in anything that I have said. |
let me try to reproduce it later. I'm currently on holiday. Could you please share a reproducible example that I can run? If no then could you please check the models generated by each training session are the same by:
So far, yes. We run benchmarks and tests. But you have opened an issue on this topic so we will have to double-check. Also, 0.49 AUC seems wrong. |
Hi @trivialfis, TLDR: Some Theories: Reproducible Example and Responses:
When I compare the model files using your sha256sum method they indeed are different even though the partitions for the dataframe are the exact same ( see below for more info).
I am running the process with a small dataset while I try to understand what is causing variation from run to run and this dataset is very unbalanced which is likely leading to poor performance. More Information: I was able to make a small reproducible example where instead of querying the data, preprocessing it, then fitting the model I am reading the preprocessed data from a parquet file then fitting the model. This helped me understand one type of variance in the xgboost results. If I didn't shutdown the dask cluster in between runs the results would be different every time. Once I started shutting down the cluster in between runs the results for this small example became deterministic. I then revisited my more complicated process where I am querying the data and preprocessing it. I tried shutting the cluster down between runs and there is still variation in the results. I am now trying to isolate what step of my process is creating this variation compared to the simpler example. To start I stopped querying the data and instead saved the data from the query to a parquet file and am reading from that instead. This did reduce the variation in the sense that the results now oscillate between a handful of options. I am now trying to isolate or remove parts of the preprocessing to find out where more of this could be coming from. In Summary Thank you so much for all of your help and I hope you are enjoying your holiday!! |
Thank you for sharing the detailed information! That provides lots of insight into the issue. Indeed, getting results to be deterministic sometimes can be quite difficult if not impossible. I have some observations myself during other workflows:
In the end, it's a delicate process to ensure reproducibility on distributed/parallel systems. |
Hi @trivialfis, I am currently persisting the dask array into cluster memory before fitting xgboost if that is helpful information. I cannot thank you enough for giving me more options to at least try out! If I can't get it to work then it might just be the reality for this process. |
Yup, sometimes I cache the ETL result to avoid repeating potential non-deterministic operations |
Closing as this is a feature request for dask instead of XGBoost. The non-deterministic behaviour comes from data distribution. |
Hello,
When using xgboost with dask I am getting non-deterministic results even when I set the random_state parameter. This is very similar to #7927 which says it was fixed with a merge on Aug 11, 2022. I was assuming since version 1.6.2 was released on Aug 23, 2022 it would be fixed in this version but every time I fit xgboost to the same dask array I get different feature importance and AUC scores. I am using distributed computation on a kubernetes cluster for this experimentation. Due to other required packages I can only go up to version 1.6.2 for xgboost but please let me know if this has been fixed in any later version or if I am mistaken on anything I have said.
It is difficult for me to provide an example because we use an outside companies software to start up the dask cluster through an API but any distributed dask fitting should show the same result.
Summary information:
xgboost version: 1.6.2
dask version: 2022.2.0
distributed version: 2021.11.2
The text was updated successfully, but these errors were encountered: