@@ -1074,6 +1074,7 @@ performance increased to 83.7%, which seems to be state-of-the-art as of April 2
10741074
10751075
10761076### Audio waveform models
1077+ \label{section: audio-waveform-models }
10771078
10781079Recently approaches that use the raw audio waveform as input have also been documented.
10791080
@@ -1086,7 +1087,6 @@ They show that the resulting spectrograms have frequency responses with
10861087a shape similar to mel-spectrograms.
10871088The model achieves a 66.3% accuracy score on Urbansound8k[ @EnvNet2 ] with raw audio input.
10881089
1089-
10901090In [ @VeryDeepESC ] , authors evaluated a number of deep CNNs using only 1D convolutions.
10911091Raw audio with 8kHz sample rate was used as the input.
10921092Their 18 layer model (M18) got a 71% accuracy on Urbansound8k,
@@ -1593,13 +1593,13 @@ they had accuracy scores that were 6.1 and 10.2 percentage points lower, respect
15931593
15941594` FIXME: change confusion matrix color scale to show nuances in 0-20% range `
15951595
1596- ` TODO: plot MAC versus compute time `
1596+ <!-- TODO: plot MAC versus compute time -->
15971597
15981598![ Confusion matrix on Urbansound8k] ( ./results/confusion_test.png ) { height=30% }
15991599
16001600![ Confusion matrix in reduced groups with only foreground sounds] ( ./results/grouped_confusion_test_foreground.png ) { height=30% }
16011601
1602- ` TODO: add error analysis plots `
1602+ <!-- TODO: add error analysis plots >
16031603
16041604
16051605<!-- MAYBE: plot training curves over epochs -->
@@ -1616,15 +1616,17 @@ they had accuracy scores that were 6.1 and 10.2 percentage points lower, respect
16161616
16171617The model used on device (with 16kHz model with 30 mel filters)
16181618scored 72% on the associated validation-set, fold 9.
1619+ When running on the device, the model execution took 43 ms per analysis window,
1620+ while preprocessing of the mel-spectrogram took approximately 60 ms.
16191621
16201622Figure \ref {figure: demo } shows a closeup of the on-device testing scenario.
16211623When playing back a few sounds the system was able to
16221624correctly classify classes such as "dog barking" most of the time.
16231625The classes "jackhammer" and "drilling" were confused several times (in both directions),
16241626but these were often hard to distinguish by ear also.
16251627The system seemed to struggle with the "children playing" class.
1626- When not playing any sound, the GPU fan noise from the nearby computer
1627- was classified as "air conditioner" - which sounded pretty close.
1628+ When not playing any sound, the GPU fan noise from the nearby machine-learning rig
1629+ was classified as "air conditioner" - which the author can agree sounded pretty close.
16281630
16291631
16301632\newpage
@@ -1663,53 +1665,107 @@ This hyperparameter was set to a seemingly conservative reduction of 2x
16631665It is possible that this choice of hyperparameter is critical and that other values
16641666would have performed better, but this has not been investigated.
16651667
1666-
16671668Of the models compared it looks like the Strided-DS family of models
16681669give the highest accuracy relative to model compute requirements.
16691670The largest model, Strided-DS-24, was able to achieve near Baseline performance while utilizing 12x less CPU.
16701671The CPU usage of this model is 11%, well within the 50% set as a requirement,
1671- allowing the microcontroller to sleep for up to 7/ amounts of time even
1672- when classifications are performed for every 720ms block (real-time).
1672+ allowing the microcontroller to sleep for up to 80% of the time even
1673+ when classifications are performed for every 720 ms window (real-time).
16731674
16741675The smaller models (with 20,16,12 filters) in the family with less compute requirements had correspondingly
16751676lower accuracies, suggesting that a tradeoff between model requirements and performance is possible.
1676-
16771677The Strided-DS-3x3 variation with 4 layers with 3x3 convolutions instead
16781678was close in performance to the Strided-DS models with 3 layers of 5x5.
1679+ This could be investigated closer, there may exist variations on this 3x3 model
1680+ that would perform better than 5x5.
16791681
1680- The on-device model which was trained on 16kHz with 30 mel filters (on a single fold),
1681- looked to perform similarly to those with the full 22kHz and 60 mel-filters.
1682+ From a one-fold spot check the on-device model trained on
1683+ 16kHz sample-rate with 30 mel filters, looked to perform similarly to those with the full 22kHz and 60 mel-filters.
16821684This may suggest that perhaps the feature representation (and thus compute requirements)
16831685can be reduced even further without much reduction in performance.
16841686
1687+ ## Spectrogram processing time
1688+
1689+ Interestingly the mel-feature preprocessing took 60 ms on device,
1690+ which is on the same order as the efficient models during inference (38-81 ms).
1691+ This means that the CPU bottleneck is not just the model inference time,
1692+ but that spectrogram calculation must be also optimized to reach even lower power-consumption.
1693+ In the FP-SENSING1 example used, the spectrogram computation already use ARM-specific
1694+ optimized codepaths from CMSIS, albeit with floating-point and not fixed-point.
1695+
1696+ This is an opportunity for end-to-end models
1697+ that take raw-audio as input instead of requiring preprocessed spectrograms
1698+ (ref section \ref{section: audio-waveform-models }),
1699+ as they might be able to do this more efficiently.
1700+ When low-power hardware accelerators for Convolutional Neural Networks becomes available,
1701+ an end-to-end CNN model will become extra interesting,
1702+ as it would allow also the filterbank processing to be offloaded to the CNN co-processor.
1703+
1704+
1705+ ## Practical evaluation
1706+
1707+ Deployment of noise monitoring systems,
1708+ and especially systems with noise classification capabilities, are still rare.
1709+ Of the 5 large-scale research projects mentioned in the introduction,
1710+ only the SONYC deployment seems to have some level of noise classification capability.
1711+ Therefore answering the question of whether the 70% accuracy achieved on Urbansound8k,
1712+ or even state-of-the-art accuracy of 83%,
1713+ is sufficient for a useful real-world noise classification system is hard.
1714+
1715+ From a critical perspective, 70.9% on Urbansound8k is likely below human-level performance.
1716+ While no studies have been done on human-level performance on Urbansound8k directly,
1717+ it is estimated to be 81.3% for ESC-50 and 95.7% on ESC-10[ @ESC-50 , ch 3.1] ,
1718+ and PiczakCNN who scored 73% on Urbansound8k scored only 62% on ESC-50 and 81% on ESC-10.
1719+
1720+ From an optimistic perspective, today the vast majority of cities do not use widespread
1721+ noise monitoring equipment.
1722+ So * any* sensor with sound-level monitoring and even rudimentary classification capabilities
1723+ would be adding new information that could potentially be of use.
1724+ The key to successful application is to design a system and practice
1725+ which makes use of this, taking into account limitations of the information.
1726+
1727+ From table \ref{table: results } it can be seen that the accuracy for
1728+ foreground sounds is around 5 percentage points better than overall accuracy,
1729+ reaching above 75%.
1730+ Background sounds on the other hand has a much lower accuracy,
1731+ with the best models under 62%, a 8 percentage point drop (or more).
1732+ This is expected since the signal to noise ratio is lower.
1733+ If the information of interest is the predominant sound in an area
1734+ close to the sensor, one could maybe take this into account by only
1735+ classifying loud (and probably closer) sounds,
1736+ in order to achieve higher precision.
1737+
1738+ In Urbansound8k classification is done on 4 second intervals.
1739+ In a noise monitoring situation this granularity of information is possibly not needed.
1740+ For example to analyze understand temporal patterns across a day or week,
1741+ information about the predominant noise source(s) on a 15 minute or even hourly might
1742+ be a more suitable time-scale.
1743+ For sound-sources with a relatively long duration (much more than 4 seconds),
1744+ such as children playing, drilling or street music it should
1745+ be possible to achieve higher accuracy by combining many predictions over time.
1746+ However this is unlikely to help for short, intermittent sounds ("events")
1747+ such as a car honk or a gun-shot.
16851748
1686- ## Practical implications
1687-
1688- Accuracy when considering only foreground sounds improved significantly.
1689- Median improvement.
1690-
1691- Far from the state-of-the-art when not considering performance constraints
1692- Probably below human-level accuracy. Ref ESC-50
1749+ <!--
1750+ TODO: include error analysis
1751+ -->
16931752
16941753Before deployment in the field, a more systematic validation of on-device must be performed.
16951754
1696- Classification is done on 4 second intervals (as that is what is available in Urbansound8k)
1697- In a noise monitoring situation this is probably way too fine grained.
1698- A detailed time-line might desire per minute resolution.
1699- Overall picture might be OK with 15 minute or maybe even hourly summarization.
1700- predominant sound source
1701-
17021755
17031756Is the easiest-to-classify sound is the loudest
17041757/ contributing the most to the increased sound level?
17051758
17061759
1760+ <!--
17071761
17081762When considering the reduced 5-group classification.
17091763Some misclassifications are within a group of classes, and this increases accuracy.
17101764Example...
17111765However still have significant confusion for some groups...
17121766
1767+ -->
1768+
17131769
17141770<!--
17151771Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
@@ -1757,15 +1813,6 @@ Utilizing larger amounts of training data might be able to increase performance
17571813Possible techniques for this are transfer learning[ @PretrainingSpeechCommandRecognition ] ,
17581814or applying stronger data augmentation techniques (such as Mixup[ Mixup] or SpecAugment[ @SpecAugment ] ).
17591815
1760- <!--
1761- Low-power hardware accelerators for Convolutional Neural Networks will hopefully
1762- become available over the next few years.
1763- This may enable larger models at the same power budget,
1764- or to reduce power consumption at a given predictive performance level.
1765- End-to-end CNN models using raw audio as input becomes extra interesting with such a co-processor,
1766- since it allows also the filterbank processing to be offloaded from the general purpose CPU.
1767- -->
1768-
17691816In a practical deployment of on-sensor classification, it is still desirable to
17701817collect * some* data for evaluation of performance and further training.
17711818This could be sampled at random, but an on-sensor implementation
0 commit comments