-
An AVES/HuBERT transformer based Elephant Rumble Detector for the FruitPunch AI "AI For Forest Elephants" project.
-
When trained and tested on the dataset The Cornell Lab's Elephant Listening Project provided FruitPunch, it scores well. (f1-scores, precision, recall, and accuracy all in the high 90%s -- details below).
From Central African soundscapes like this ...
... this model isolates the elephant-related sounds like this ...
... by efficiently classifying each fraction of a second of 24-hour audio files as "elephant" or "not elephant", and generating a structured "Raven Selection File" for the elephant-related time-intervals.
Challenging because:
- Such forests have complex background sounds from a wide variety of animals.
- Forests muffle sounds, and Elephants far away from the recording devices may have sounds barely above noise levels.
- High quality off-the-shelf audio classifiers tend to do best on human voices, birds, or man-made devices.
pip install git+https://github.com/ramayer/[email protected]
elephant-rumble-inference test.wav --visualizations=5 --save-raven *.wav
Installation note:
- TorchAudio has a dependency on ffmpeg versions below 7 according to torchaudio docs.
- On MacOS, you can get that using
brew install ffmpeg@6
. - On Windows, someone reported luck with
conda install -c conda-forge 'ffmpeg<7'
.
- On MacOS, you can get that using
Example Notebooks:
- A Google Colab notebook demoing the command line tool here.
- The Training Notebook that generated the highest scoring model so far can be seen here.
More detailed usage examples below.
- https://arxiv.org/abs/2210.14493
- AVES, in turn, is based on HuBERT, a self-supervised transformer architecture for modeling raw waveforms (not spectrograms), originally for human speech.
- AVES-bio was pre-trained on a wide-range of unlabeled biological sounds; and performs well when fine tuned for specific animals (cows, crows, bats, whales, mosquitos, fruit-bats, …), when compared against spectrogram-based models like resnets.
Unlike most other approaches in this challenge, this model directly analyzes the audio waveforms, without every generating spectrogram images or using any image-classification or object-detection models.
Because AVES an unsupervised model, trained on unannotated audio inputs, it needs to either be fine-tuned or augmented with additional classifiers trained to recognize animals of interest.
- AVES is great at creating clusters of similar biological sounds
- But doesn’t know which clusters go with which animals.
- Visualizing AVES embeddings shows this is a promising approach.
The image above shows a UMAP visualization of AVES embeddings of our data:
- Red = elephants in our labeled data.
- Blue = non-elephant sound clips from the same .wav files.
- (larger, interactive version here)
Observe that there are multiple distinct clusters. Listening to the sounds associated with each cluster suggests:
- The multiple distinct red clusters seem to represent different kinds of sounds Elephants can make, ranging from trumpeting and squeeking. Often, it seems, a rumble creschendos into roaring or trumpeting. "the fundamental frequency within a single elephant call may vary over 4 octaves, starting with a Rumble at 27 Hz and grading into a Roar at 470 Hz! "
- Different small blue clusters represent different animals in the background sounds. Monkeys, Buffalo, Birds and Insects are all present in the trainind data.
- The large blue cluster is non-specific background noise and other non-biological sound that AVES-bio was trained to be largely indifferent to; including wind, rain, thunder, etc.
- The path connecting the main elephant-rumble-cluster with the blue background noise cluster seems to represent fainter-and-fainter
While it's likely a simpler Support Vector Machine should have been able to separate those clusters; all my attempts failed, perhaps because elephant trumpeting is more like other animals than it is to elephant rumbles, and I'd need a kernel trick to make a SVM put the large variety of elephant sounds into a single class. Instead I gave up on that approach and added a simple two-fully-connected-layer model that performed well when given an appropriate mix of positive and negative training data.
-
Challenge - AVES didn’t have our frequencies or timescales in mind:
- AVES seems easily distracted by high pitched animals in the background noise. HuBERT - designed to recognize syllables in speech - generates feature vectors at a 20ms timescale - which would be expensive on our 24-hour-long clips.
-
Solution - resample and pitch-shift through re-tagging the sample rate.
- Upshifting the audio by 3-4 octaves shifts Elephant Rumbles into human hearing ranges, where most audio software operates best.
- Speeding the audio turns a ½ second elephant-rumble-syllable to 31ms (~ HuBERT’s timescale)
- Speeding up the audio by 16x and upshiftign 4 octaves reduces compute by a factor of 16.
And as a side-effect, it shifts elephant speech into human hearing ranges, and they sound awesome!!!
The version of the model being tested here was
- Trained on every single labeled rumble from FruitPunch.AI's
01. Data/cornell_data/Rumble/Training
and an equal duration of unlabeled segments from the same audio files assumed to be non-elephant-related. - Tested entirely against every labeled rumble from
01. Data/cornell_data/Rumble/Testing
and an equal duration uf unabled sounds from the same files.
Unfortunately these datasets are not publically but you might reach out to The Elephant Listening Project and/or FruitPunch.ai.
In both cases:
- the things being classified are 1/3-of-a-second slices of the audio.
- every single 1/3-of-a-second that was labeled in the Raven files was considered a "positive".
- about the same many 1/3s of a second from the files were used as "negative" labels. Those were picked to be near in time to the "positive" samples to try to capture similar background noise (buffalos, airplanes, bugs).
Many of the wrong guesses seem to be related to:
- Differences in opinions of where a rumble starts or ends.
If the Raven File thinks a rumble starts ends 2/3 of a second sooner than this classifier, that difference in overlap counts as a false negative and a false positive. - Some unusual background noises that I didn't yet add to the manual "negative" classes. If it's "animal-like", AVES may consider it "more elephant-like than not" unless I provide the companion model appropriate negative training samples.
Could I set up some time to have someone code-review my code? I'm hoping I'm doing this evaluation fairly. The notebook with the entire training run that produced this particular model can be found here: https://github.com/ramayer/elephant-rumble-inference/blob/main/notebooks/training_notebook.ipynb (edited)
This really wants a GPU.
- It processes 24 hours of audio in 22 seconds on a 2060 GPU
- It processes 24 hours of audio in 28 minutes on a CPU
so about a 70x speedup on a GPU.
This has not been tested on a GPU with less than 6GB of RAM.
- I was only able to make this work using conda
- On Windows, someone reported luck with
conda install -c conda-forge 'ffmpeg<7'
. - Documentation may assume linux-like paths and examples may need to be adjusted
- As per: https://pytorch.org/audio/stable/installation.html "TorchAudio official binary distributions are compatible with FFmpeg version 6, 5 and 4. (>=4.4, <7)." -- so it specifically needs an older version of ffmpeg.
- I needed to
brew install ffmpeg@6
for it to run properly
- TODO - add this.
- TODO - add this.
- Certain audio files not in the Test dataset had curious sounds in the background (bugs? different airplanes?) causing a lot of false postiives. Add such examples to the training data as "not a rumble" and retrain to improve its performance on those files.
- Despite performing well on the audio recorders in both this project's test and train datasets, the version of the model pretrained performs poorly in environments with different background noises like these South African elephant recordings. It regains its performance by re-training with negative "not an elephant" exampes of background noise from these other regions. Training a version with more diverse not-elephant sounds would produce a more robust version.
- Please submit issues to https://github.com/ramayer/elephant-rumble-inference .
- Also availble on the FruitPunch Gitlab here: https://gitlab.com/fruitpunch/projects/ai-for-forest-elephants-2/modelling/ai-for-forest-elephant-2/-/tree/aves-based-models?ref_type=heads