-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/all ids are hashes #36
Conversation
… file-content resp. to ensure reproducibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small things.
I think we need to clean up this module in the future, however, as it seems to be drowning in tiny details everywhere that are interlinked. But let's wait with that until the format is completely fixed, as it seems to still be in flux.
Co-authored-by: Dan Saattrup Nielsen <[email protected]>
LGTM! Just need the tests to pass now. Seems like removing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR fixes the following potential bug:
recording_id
's,speaker_id
's for a given entity might change when rerunning thebuild_coral_data.py
-script, due to these being assign concurrently. Now we create the ids as hashes of the recording, and the speaker name and mail respectively.We use the
adler32
-hashing algorithm because its fast, produces short hashes, and has a low collision rate (even when truncating). Due to this last upside, we are able to truncate the hashes to produces equally long hashes every time. This way of producing ids is rather time-costly; atm running the script takes ~15 min...Note that some files are unable to be decoded, in which case we produce their id from their filename, which is still unique.
We also automatically generate a short readme pertaining to the processed data, as this was requested Anna (DIKU).