Feat/all ids are hashes #36

AJDERS · 2023-09-07T16:48:27Z

This PR fixes the following potential bug: recording_id's, speaker_id's for a given entity might change when rerunning the build_coral_data.py-script, due to these being assign concurrently. Now we create the ids as hashes of the recording, and the speaker name and mail respectively.

We use the adler32-hashing algorithm because its fast, produces short hashes, and has a low collision rate (even when truncating). Due to this last upside, we are able to truncate the hashes to produces equally long hashes every time. This way of producing ids is rather time-costly; atm running the script takes ~15 min...

Note that some files are unable to be decoded, in which case we produce their id from their filename, which is still unique.

We also automatically generate a short readme pertaining to the processed data, as this was requested Anna (DIKU).

… file-content resp. to ensure reproducibility

saattrupdan

A few small things.

I think we need to clean up this module in the future, however, as it seems to be drowning in tiny details everywhere that are interlinked. But let's wait with that until the format is completely fixed, as it seems to still be in flux.

src/coral_models/prepare_raw_data.py

Co-authored-by: Dan Saattrup Nielsen <[email protected]>

saattrupdan · 2023-10-04T12:08:03Z

LGTM! Just need the tests to pass now. Seems like removing the use_mps_device line in the wav2vec2 module, and replacing no_cuda with use_cpu in the same module should do the trick.

saattrupdan

LGTM!

AJDERS added 6 commits September 7, 2023 11:00

feat: change recorder_id and recording_id to hashes of name+email and…

aa458c4

… file-content resp. to ensure reproducibility

feat: add automatic generation of data readme.

be17027

fix: type indication

8f7f184

fix: changes to readme and comments

98259ce

fix: added prefix

0899fe4

fix: timestamp format, and correct noise column

a07165f

AJDERS requested a review from saattrupdan September 7, 2023 16:48

saattrupdan requested changes Oct 4, 2023

View reviewed changes

src/coral_models/prepare_raw_data.py Show resolved Hide resolved

src/coral_models/prepare_raw_data.py Outdated Show resolved Hide resolved

Update src/coral_models/prepare_raw_data.py

cefc068

Co-authored-by: Dan Saattrup Nielsen <[email protected]>

AJDERS requested a review from saattrupdan October 4, 2023 12:03

fix: handle missing recordings.

aeaa30b

saattrupdan approved these changes Oct 6, 2023

View reviewed changes

AJDERS merged commit 29aa2a5 into main Oct 6, 2023
7 checks passed

saattrupdan deleted the feat/all-ids-are-hashes branch October 26, 2023 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/all ids are hashes #36

Feat/all ids are hashes #36

AJDERS commented Sep 7, 2023

saattrupdan left a comment

saattrupdan commented Oct 4, 2023

saattrupdan left a comment

Feat/all ids are hashes #36

Feat/all ids are hashes #36

Conversation

AJDERS commented Sep 7, 2023

saattrupdan left a comment

Choose a reason for hiding this comment

saattrupdan commented Oct 4, 2023

saattrupdan left a comment

Choose a reason for hiding this comment