Skip to content

Commit c3a759e

Browse files
committed
report: Bunch more writing
1 parent 2ea1b27 commit c3a759e

File tree

1 file changed

+80
-33
lines changed

1 file changed

+80
-33
lines changed

report/report.md

Lines changed: 80 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1074,6 +1074,7 @@ performance increased to 83.7%, which seems to be state-of-the-art as of April 2
10741074

10751075

10761076
### Audio waveform models
1077+
\label{section:audio-waveform-models}
10771078

10781079
Recently approaches that use the raw audio waveform as input have also been documented.
10791080

@@ -1086,7 +1087,6 @@ They show that the resulting spectrograms have frequency responses with
10861087
a shape similar to mel-spectrograms.
10871088
The model achieves a 66.3% accuracy score on Urbansound8k[@EnvNet2] with raw audio input.
10881089

1089-
10901090
In [@VeryDeepESC], authors evaluated a number of deep CNNs using only 1D convolutions.
10911091
Raw audio with 8kHz sample rate was used as the input.
10921092
Their 18 layer model (M18) got a 71% accuracy on Urbansound8k,
@@ -1593,13 +1593,13 @@ they had accuracy scores that were 6.1 and 10.2 percentage points lower, respect
15931593

15941594
`FIXME: change confusion matrix color scale to show nuances in 0-20% range`
15951595

1596-
`TODO: plot MAC versus compute time`
1596+
<!-- TODO: plot MAC versus compute time -->
15971597

15981598
![Confusion matrix on Urbansound8k](./results/confusion_test.png){ height=30% }
15991599

16001600
![Confusion matrix in reduced groups with only foreground sounds](./results/grouped_confusion_test_foreground.png){ height=30% }
16011601

1602-
`TODO: add error analysis plots`
1602+
<!-- TODO: add error analysis plots >
16031603
16041604
16051605
<!-- MAYBE: plot training curves over epochs -->
@@ -1616,15 +1616,17 @@ they had accuracy scores that were 6.1 and 10.2 percentage points lower, respect
16161616

16171617
The model used on device (with 16kHz model with 30 mel filters)
16181618
scored 72% on the associated validation-set, fold 9.
1619+
When running on the device, the model execution took 43 ms per analysis window,
1620+
while preprocessing of the mel-spectrogram took approximately 60 ms.
16191621

16201622
Figure \ref {figure:demo} shows a closeup of the on-device testing scenario.
16211623
When playing back a few sounds the system was able to
16221624
correctly classify classes such as "dog barking" most of the time.
16231625
The classes "jackhammer" and "drilling" were confused several times (in both directions),
16241626
but these were often hard to distinguish by ear also.
16251627
The system seemed to struggle with the "children playing" class.
1626-
When not playing any sound, the GPU fan noise from the nearby computer
1627-
was classified as "air conditioner" - which sounded pretty close.
1628+
When not playing any sound, the GPU fan noise from the nearby machine-learning rig
1629+
was classified as "air conditioner" - which the author can agree sounded pretty close.
16281630

16291631

16301632
\newpage
@@ -1663,53 +1665,107 @@ This hyperparameter was set to a seemingly conservative reduction of 2x
16631665
It is possible that this choice of hyperparameter is critical and that other values
16641666
would have performed better, but this has not been investigated.
16651667

1666-
16671668
Of the models compared it looks like the Strided-DS family of models
16681669
give the highest accuracy relative to model compute requirements.
16691670
The largest model, Strided-DS-24, was able to achieve near Baseline performance while utilizing 12x less CPU.
16701671
The CPU usage of this model is 11%, well within the 50% set as a requirement,
1671-
allowing the microcontroller to sleep for up to 7/ amounts of time even
1672-
when classifications are performed for every 720ms block (real-time).
1672+
allowing the microcontroller to sleep for up to 80% of the time even
1673+
when classifications are performed for every 720 ms window (real-time).
16731674

16741675
The smaller models (with 20,16,12 filters) in the family with less compute requirements had correspondingly
16751676
lower accuracies, suggesting that a tradeoff between model requirements and performance is possible.
1676-
16771677
The Strided-DS-3x3 variation with 4 layers with 3x3 convolutions instead
16781678
was close in performance to the Strided-DS models with 3 layers of 5x5.
1679+
This could be investigated closer, there may exist variations on this 3x3 model
1680+
that would perform better than 5x5.
16791681

1680-
The on-device model which was trained on 16kHz with 30 mel filters (on a single fold),
1681-
looked to perform similarly to those with the full 22kHz and 60 mel-filters.
1682+
From a one-fold spot check the on-device model trained on
1683+
16kHz sample-rate with 30 mel filters, looked to perform similarly to those with the full 22kHz and 60 mel-filters.
16821684
This may suggest that perhaps the feature representation (and thus compute requirements)
16831685
can be reduced even further without much reduction in performance.
16841686

1687+
## Spectrogram processing time
1688+
1689+
Interestingly the mel-feature preprocessing took 60 ms on device,
1690+
which is on the same order as the efficient models during inference (38-81 ms).
1691+
This means that the CPU bottleneck is not just the model inference time,
1692+
but that spectrogram calculation must be also optimized to reach even lower power-consumption.
1693+
In the FP-SENSING1 example used, the spectrogram computation already use ARM-specific
1694+
optimized codepaths from CMSIS, albeit with floating-point and not fixed-point.
1695+
1696+
This is an opportunity for end-to-end models
1697+
that take raw-audio as input instead of requiring preprocessed spectrograms
1698+
(ref section \ref{section:audio-waveform-models}),
1699+
as they might be able to do this more efficiently.
1700+
When low-power hardware accelerators for Convolutional Neural Networks becomes available,
1701+
an end-to-end CNN model will become extra interesting,
1702+
as it would allow also the filterbank processing to be offloaded to the CNN co-processor.
1703+
1704+
1705+
## Practical evaluation
1706+
1707+
Deployment of noise monitoring systems,
1708+
and especially systems with noise classification capabilities, are still rare.
1709+
Of the 5 large-scale research projects mentioned in the introduction,
1710+
only the SONYC deployment seems to have some level of noise classification capability.
1711+
Therefore answering the question of whether the 70% accuracy achieved on Urbansound8k,
1712+
or even state-of-the-art accuracy of 83%,
1713+
is sufficient for a useful real-world noise classification system is hard.
1714+
1715+
From a critical perspective, 70.9% on Urbansound8k is likely below human-level performance.
1716+
While no studies have been done on human-level performance on Urbansound8k directly,
1717+
it is estimated to be 81.3% for ESC-50 and 95.7% on ESC-10[@ESC-50, ch 3.1],
1718+
and PiczakCNN who scored 73% on Urbansound8k scored only 62% on ESC-50 and 81% on ESC-10.
1719+
1720+
From an optimistic perspective, today the vast majority of cities do not use widespread
1721+
noise monitoring equipment.
1722+
So *any* sensor with sound-level monitoring and even rudimentary classification capabilities
1723+
would be adding new information that could potentially be of use.
1724+
The key to successful application is to design a system and practice
1725+
which makes use of this, taking into account limitations of the information.
1726+
1727+
From table \ref{table:results} it can be seen that the accuracy for
1728+
foreground sounds is around 5 percentage points better than overall accuracy,
1729+
reaching above 75%.
1730+
Background sounds on the other hand has a much lower accuracy,
1731+
with the best models under 62%, a 8 percentage point drop (or more).
1732+
This is expected since the signal to noise ratio is lower.
1733+
If the information of interest is the predominant sound in an area
1734+
close to the sensor, one could maybe take this into account by only
1735+
classifying loud (and probably closer) sounds,
1736+
in order to achieve higher precision.
1737+
1738+
In Urbansound8k classification is done on 4 second intervals.
1739+
In a noise monitoring situation this granularity of information is possibly not needed.
1740+
For example to analyze understand temporal patterns across a day or week,
1741+
information about the predominant noise source(s) on a 15 minute or even hourly might
1742+
be a more suitable time-scale.
1743+
For sound-sources with a relatively long duration (much more than 4 seconds),
1744+
such as children playing, drilling or street music it should
1745+
be possible to achieve higher accuracy by combining many predictions over time.
1746+
However this is unlikely to help for short, intermittent sounds ("events")
1747+
such as a car honk or a gun-shot.
16851748

1686-
## Practical implications
1687-
1688-
Accuracy when considering only foreground sounds improved significantly.
1689-
Median improvement.
1690-
1691-
Far from the state-of-the-art when not considering performance constraints
1692-
Probably below human-level accuracy. Ref ESC-50
1749+
<!--
1750+
TODO: include error analysis
1751+
-->
16931752

16941753
Before deployment in the field, a more systematic validation of on-device must be performed.
16951754

1696-
Classification is done on 4 second intervals (as that is what is available in Urbansound8k)
1697-
In a noise monitoring situation this is probably way too fine grained.
1698-
A detailed time-line might desire per minute resolution.
1699-
Overall picture might be OK with 15 minute or maybe even hourly summarization.
1700-
predominant sound source
1701-
17021755

17031756
Is the easiest-to-classify sound is the loudest
17041757
/ contributing the most to the increased sound level?
17051758

17061759

1760+
<!--
17071761
17081762
When considering the reduced 5-group classification.
17091763
Some misclassifications are within a group of classes, and this increases accuracy.
17101764
Example...
17111765
However still have significant confusion for some groups...
17121766
1767+
-->
1768+
17131769

17141770
<!--
17151771
Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
@@ -1757,15 +1813,6 @@ Utilizing larger amounts of training data might be able to increase performance
17571813
Possible techniques for this are transfer learning[@PretrainingSpeechCommandRecognition],
17581814
or applying stronger data augmentation techniques (such as Mixup[Mixup] or SpecAugment[@SpecAugment]).
17591815

1760-
<!--
1761-
Low-power hardware accelerators for Convolutional Neural Networks will hopefully
1762-
become available over the next few years.
1763-
This may enable larger models at the same power budget,
1764-
or to reduce power consumption at a given predictive performance level.
1765-
End-to-end CNN models using raw audio as input becomes extra interesting with such a co-processor,
1766-
since it allows also the filterbank processing to be offloaded from the general purpose CPU.
1767-
-->
1768-
17691816
In a practical deployment of on-sensor classification, it is still desirable to
17701817
collect *some* data for evaluation of performance and further training.
17711818
This could be sampled at random, but an on-sensor implementation

0 commit comments

Comments
 (0)