Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/all ids are hashes #36

Merged
merged 8 commits into from
Oct 6, 2023
Merged

Feat/all ids are hashes #36

merged 8 commits into from
Oct 6, 2023

Conversation

AJDERS
Copy link
Contributor

@AJDERS AJDERS commented Sep 7, 2023

This PR fixes the following potential bug: recording_id's, speaker_id's for a given entity might change when rerunning the build_coral_data.py-script, due to these being assign concurrently. Now we create the ids as hashes of the recording, and the speaker name and mail respectively.

We use the adler32-hashing algorithm because its fast, produces short hashes, and has a low collision rate (even when truncating). Due to this last upside, we are able to truncate the hashes to produces equally long hashes every time. This way of producing ids is rather time-costly; atm running the script takes ~15 min...

Note that some files are unable to be decoded, in which case we produce their id from their filename, which is still unique.

We also automatically generate a short readme pertaining to the processed data, as this was requested Anna (DIKU).

@AJDERS AJDERS requested a review from saattrupdan September 7, 2023 16:48
Copy link
Collaborator

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things.

I think we need to clean up this module in the future, however, as it seems to be drowning in tiny details everywhere that are interlinked. But let's wait with that until the format is completely fixed, as it seems to still be in flux.

src/coral_models/prepare_raw_data.py Show resolved Hide resolved
src/coral_models/prepare_raw_data.py Outdated Show resolved Hide resolved
@AJDERS AJDERS requested a review from saattrupdan October 4, 2023 12:03
@saattrupdan
Copy link
Collaborator

LGTM! Just need the tests to pass now. Seems like removing the use_mps_device line in the wav2vec2 module, and replacing no_cuda with use_cpu in the same module should do the trick.

Copy link
Collaborator

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AJDERS AJDERS merged commit 29aa2a5 into main Oct 6, 2023
7 checks passed
@saattrupdan saattrupdan deleted the feat/all-ids-are-hashes branch October 26, 2023 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants