Dataset hashes can change with package upgrades #231

acwooding · 2021-09-22T18:24:12Z

The current way we handle data hashing doesn't survive package upgrades. For example, with pandas, we have been dumping dataframes and the hashes change (even if the data itself doesn't) with upgrades to pandas.

acwooding · 2021-09-22T18:42:45Z

Possible references:

Example: In https://github.com/acwooding/ReproAllTheThings, we get different hashes with pandas==1.0.5 and pandas 1.3.2 on MacOS.

hackalog · 2021-11-16T16:46:28Z

Another potential culprit: joblib/joblib#1136

The risk, (which is the reason, I assume, it was not done this way already) is that the pickle memoization process will interfere will hashing and create spurious changes in pickle string of dtypes with the final consequence of assigning different hash values for seemingly identical objects

I think there's a really deep issue here, and that's that in order to be truly reproducible here, we need a hash that's more aware of the data, as certain data formats will change version-to-version even through the underlying raw data is identical.

acwooding added bug Something isn't working critical labels Sep 22, 2021

acwooding mentioned this issue Sep 22, 2021

Update hashes acwooding/ReproAllTheThings#39

Merged

hackalog mentioned this issue Nov 16, 2021

Hash mismatch for wine_reviews_130k_varietals_75 acwooding/ReproAllTheThings#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset hashes can change with package upgrades #231

Dataset hashes can change with package upgrades #231

acwooding commented Sep 22, 2021

acwooding commented Sep 22, 2021

hackalog commented Nov 16, 2021

Dataset hashes can change with package upgrades #231

Dataset hashes can change with package upgrades #231

Comments

acwooding commented Sep 22, 2021

acwooding commented Sep 22, 2021

hackalog commented Nov 16, 2021