(Picture on the left: Rembrandt - Portrait of an Evangelist Writing, cited from Wikipedia)
This is a simple, fast, for live audio in realtime, customizable machine learning sound classifier.
- MobileNetV2; light weight CNN model for mobile is used.
- (new) AlexNet based lighter/faster CNN model is also provided.
- Tensorflow graph for runtime faster prediction, and for portability.
- Freesound Dataset Kaggle 2018 pre-built models are ready.
- (new) Raspberry Pi realtime inference supported.
- Ensembling sequence of predictions (moving average) for stable results.
- Normalizing samplewise for robustness to level drift.
- Easy to customize example.
This project is a spin-off from my solution for a competition Kaggle "Freesound General-Purpose Audio Tagging Challenge" for trying in real environment.
The competition itself have provided dataset as usual, then all the work have done with dataset's static recorded samples. Unlike competition, this project is developed as working code/exsample to record live audio and classify in realtime.
(new) AlexNet based lighter model is also added and that makes it possible to run on Raspberry Pi.
If you implemented on your mobile app, or any embedded app like detecting sound of door open/close and make notification with Raspberry Pi, I really appreciate if you could let me know. Or you can pull-request your application, post as an issue here. You can also reach me via twitter.
Install following modules.
- Python 3.6.
- Tensorflow, tested with 1.9.
- Keras, tested with 2.2.0.
- LibROSA, tested with 0.6.0.
- PyAudio, tested with 0.2.11.
- EasyDict, tested with 1.4.
- Pandas, tested with 0.23.0.
- Matplotlib, tested with 2.2.2.
- imbalance.learn (For training only), tested with 0.3.3.
- (some others, to be updated)
Simply running realtime_predictor.py
will start listening to the default audio input to predict. Following result is an example when typing computer keyboard on my laptop which was runnning this.
You can see 'Bus' after many 'Computer_keyboard'. It appears many times with usual home environmental noise like sound of car running on the road neay by, so it is expected.
$ python realtime_predictor.py
Using TensorFlow backend.
Fireworks 31 0.112626694
Computer_keyboard 8 0.13604443
Computer_keyboard 8 0.1880264
Computer_keyboard 8 0.20889345
Computer_keyboard 8 0.23447378
Computer_keyboard 8 0.27483797
Computer_keyboard 8 0.30053857
Computer_keyboard 8 0.31214702
Computer_keyboard 8 0.31657898
Computer_keyboard 8 0.30582297
Computer_keyboard 8 0.31986332
Computer_keyboard 8 0.289629
Computer_keyboard 8 0.22962372
Computer_keyboard 8 0.20274992
Computer_keyboard 8 0.2013374
Computer_keyboard 8 0.1868646
Computer_keyboard 8 0.17860062
Computer_keyboard 8 0.17743647
Computer_keyboard 8 0.18610674
Bus 22 0.13272804
Bus 22 0.14847495
Bus 22 0.15111914
Bus 22 0.16028993
Bus 22 0.18327181
Bus 22 0.1998493
Bus 22 0.2240002
Bus 22 0.2510223
Bus 22 0.28142408
Bus 22 0.32763514
Bus 22 0.35531467
:
- realtime_predictor.py - This is basic example, predicts 41 classes provided by Freesound dataset.
- deskwork_detector.py - This is customization example to predict 3 deskwork sounds only, uses Emoji to express prediction probabilities.
- premitive_file_predictor.py - This is very simple example. No prediction ensemble is used, predict from file only.
Example Applications are provided as examples for some datasets. These explains how to train models.
And cnn-laser-machine-listener
is an end to end example to showcase training, converting model to .pb and use it for prediction in realtime.
Check Example Applications when you try to train model for your datasets.
Following pretrained models are provided. 44.1kHz model is for high quality sound classification, and 16kHz model is for general purpose sounds.
File name | Dataset | Base Model | Audio input |
---|---|---|---|
mobilenetv2_fsd2018_41cls.h5 | FSDKaggle2018 | MobileNetV2 | 44.1kHz, 128 n_mels, 128 time hop |
mobilenetv2_small_fsd2018_41cls.h5 | FSDKaggle2018 | MobileNetV2 | 16kHz, 64 n_mels, 64 time hop |
alexbased_small_fsd2018_41cls.h5 | FSDKaggle2018 | AlexNet | 16kHz, 64 n_mels, 64 time hop |
About 9,000 samples from FSDKaggle2018 are used for training. This number of samples 9,000 would not be enough to compare with ImageNet. So we cannot expect that these models have enough capability of representation, but it's still better than starting from scratch in small problems.
See RaspberryPi.md for the detail.
Followings are important 3 parameters that you can fine-tune program behavior in your environment.
conf.rt_process_count = 1
conf.rt_oversamples = 10
conf.pred_ensembles = 10
This sets how many times raw audio capture will happen. If this is 10, it happens 10 times per one second. Responce will be quicker with larger value as long as your environment is powerful enough.
This sets duration of audio conversion and following prediction. This set count(s) of audio capture, 1 means audio conversion and prediction happens with each audio capture. If it is 2, once of prediction per twice of audio captures. Responce will also be quicker if you set smaller like 1, but usually prediction takes time and you might need to set this bigger if it's not so powerful.
On my MacBookPro with 4 core CPU, 1 is fine.
This sets number of prediction ensembles. Past to present predictions are held in fifo, and this number is used as the size of this fifo. Ensemble is done by calclating geometric mean of all the predictions in the fifo.
For example, if conf.pred_ensembles = 5
, past 4 predictions and present prediction will be used for calculating present ensemble prediction result.
Following examples introduce possible impacts when you apply this kind of classifier in your applications.
A sample fireworks.wav
is recorded on iPhone during my trip to local festival as a video. This audio is cut from the video by Quicktime.
So basically this is iPhone recorded audio, should be available in your iPhone app. So if you embed the prediction model, your app will understand 'Fireworks is going on'.
$ CUDA_VISIBLE_DEVICES= python premitive_file_predictor.py sample/fireworks.wav
Using TensorFlow backend.
Fireworks 0.7465853
Fireworks 0.7307739
Fireworks 0.9102228
Fireworks 0.83495927
Fireworks 0.8383171
Fireworks 0.87180334
Fireworks 0.67869043
Fireworks 0.8273291
Fireworks 0.9263128
Fireworks 0.6631776
Fireworks 0.2644468
Laughter 0.2599133
Simple dedicated example is ready, deskwork_detector.py
will detect 3 deskwork sounds; Writing, Using scissors and Computer keyboard typing.
And a sample desktop_sounds.wav
is recorded on my PC that captured sound of 3 deskwork actions in sequence.
So this is the similar impact as iPhone example, we are capable of detecting specific action.
$ python deskwork_detector.py -f sample/desktop_sounds.wav
Using TensorFlow backend.
π π π π π Writing 0.40303284
π π π π π π Writing 0.53238016
π π π π π π Writing 0.592431
π π π π π π Writing 0.5818318
π π π π π π π Writing 0.60172915
π π π π π π π Writing 0.6218055
π π π π π π π Writing 0.6176261
π π π π π π π Writing 0.6429986
π π π π π π π Writing 0.64290494
π π π π π π π Writing 0.6534221
π π π π π π π π Writing 0.70015717
π π π π π π π π Writing 0.7130502
π π π π π π π π Writing 0.72495073
π π π π π π π π Writing 0.75500214
π π π π π π π π Writing 0.7705893
π π π π π π π π Writing 0.7760113
π π π π π π π π Writing 0.79651505
π π π π π π π π Writing 0.7876685
π π π π π π π π π Writing 0.8026823
π π π π π π π π π Writing 0.80895096
π π π π π π π π π Writing 0.8053692
π π π π π π π π Writing 0.7975255
π π π π π π π π Writing 0.77262956
π π π π π π π π Writing 0.76053137
π π π π π π π π Writing 0.7428124
π π π π π π π π Writing 0.70416236
π π π π π π π Writing 0.6924045
π π π π π π π Writing 0.66129446
π π π π π π π Writing 0.6085751
π π π π π π Writing 0.5530443
π π π π π π Writing 0.50439316
π π π π Writing 0.36155683
π π π Writing 0.25736108
π π Writing 0.19337422
β¨ β¨ Computer_keyboard 0.17075704
β¨ β¨ β¨ Computer_keyboard 0.2984399
β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.42496625
β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.5532899
β¨ β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.64340276
β¨ β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.6802248
β¨ β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.66601527
β¨ β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.62935054
β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.58397347
β¨ β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.526138
β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.4632419
β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.42490804
β¨ β¨ β¨ β¨ β¨ Computer_keyboard 0.40419492
β¨ β¨ β¨ β¨ Computer_keyboard 0.38645235
β¨ β¨ β¨ β¨ Computer_keyboard 0.36492997
β¨ β¨ β¨ β¨ Computer_keyboard 0.33910704
β¨ β¨ β¨ β¨ Computer_keyboard 0.3251212
β¨ β¨ β¨ β¨ Computer_keyboard 0.30858216
β¨ β¨ β¨ Computer_keyboard 0.27181208
β¨ β¨ β¨ Computer_keyboard 0.263743
β¨ β¨ β¨ Computer_keyboard 0.23821306
β¨ β¨ β¨ Computer_keyboard 0.20494641
β¨ β¨ Computer_keyboard 0.12053269
β¨ β¨ Computer_keyboard 0.15198663
β β Scissors 0.16703679
β β β Scissors 0.21830729
β β β Scissors 0.28535327
β β β β Scissors 0.3840115
β β β β β β Scissors 0.52598876
β β β β β β β Scissors 0.6164208
β β β β β β β Scissors 0.6412893
β β β β β β β β Scissors 0.7195315
β β β β β β β β Scissors 0.7480882
β β β β β β β β Scissors 0.76995486
β β β β β β β β Scissors 0.7952656
β β β β β β β β Scissors 0.79389054
β β β β β β β β β Scissors 0.8043309
β β β β β β β β Scissors 0.79848546
β β β β β β β β β Scissors 0.801064
β β β β β β β β Scissors 0.79365206
β β β β β β β β Scissors 0.7864271
β β β β β β β β Scissors 0.71144533
β β β β β β β β Scissors 0.70292735
β β β β β β β Scissors 0.6741176
β β β β β β β Scissors 0.6504746
β β β β β β β Scissors 0.6561931
β β β β β β β Scissors 0.6303668
β β β β β β β Scissors 0.6297095
β β β β β β β Scissors 0.6140897
β β β β β β Scissors 0.57456887
β β β β β β Scissors 0.5549676
- Documenting some more detail of training your dataset