@@ -1532,20 +1532,28 @@ average inference time per sample on the STM32L476 microcontroller.
15321532This accounts for potential variations in number of MACC/second for different models,
15331533which would be ignored if only relying on the theoretical MACC number.
15341534
1535+ Finally the trained models were tested running on the microcontrollers using live audio on the microphone.
1536+ The on-device test used example code from the ST FP-SENSING1[ @FP-AI-SENSING1 ] function pack as a base,
1537+ with modifications made to send the model predictions out over USB.
1538+ The example code unfortunately only supports mel-spectrogram preprocessing
1539+ with 16 kHz sample-rate, 30 filters and 1024 samples FFT window with 512 hops,
1540+ using max-normalization for the analysis windows.
1541+ Therefore a Strided-DS model was trained on fold 1-8 to match these feature settings.
1542+
1543+ The on-device testing was done ad-hoc with a few samples from Freesound.org,
1544+ as a sanity-check that the model remained functional when ran on the microcontroller.
1545+ No systematic measurements of the performance was performed.
15351546
15361547\newpage
15371548# Results
15381549
1539-
1540- ## Model comparisons
1541-
15421550\begin{figure}[ h]
15431551\centering
15441552\includegraphics[ width=1.0\textwidth] {./results/models_accuracy.png}
15451553\caption[ Test accuracy of the different models] {Test accuracy of the different models.
15461554State-of-the-art averages (SB-CNN/LD-CNN and D-CNN) marked with green dots.
15471555No-information rate marked with black dots.}
1548- \label{figure: demo }
1556+ \label{figure: models-accuracy }
15491557\end{figure}
15501558
15511559\begin{table}[ h]
@@ -1558,33 +1566,35 @@ FG=Foreground samples only, BG=Background samples only.}
15581566\label{table: results }
15591567\end{table}
15601568
1561- The Baseline model gets 72.3% mean accuracy.
1562- This is the same level as SB-CNN and PiczakCNN without data-augmentation (73%)[ @SB-CNN ] ,
1563- but significantly below the 79% of SB-CNN and LD-CNN with data-augmentation.
1564- As expected, the Baseline uses more CPU than our requirements.
1565-
1566- The Strided model is outside the desirable range.
1567-
1568- Strided-DS with 70.9% mean accuracy is able to get quite close close the baseline performance,
1569- despite having $10185/477 = 21x$ fewer multiply-add operations (MACC).
1570- The practical efficiency gain in CPU usage is however only $971/81 = 12x$.
1571-
1572-
15731569\begin{figure}[ h]
15741570\centering
15751571\includegraphics[ width=1.0\textwidth] {./results/models_efficiency.png}
15761572\caption[ Accuracy versus compute of different models] {Accuracy versus compute of different models.
15771573Variations of the same model family have the same color.
15781574Strided- has been shortened to S- for readability.}
1579- \label{figure: demo }
1575+ \label{figure: model-efficiency }
15801576\end{figure}
15811577
1578+ As seen in Table \ref{table: results } and Figure \ref{figure: models-accuracy }
1579+ the Baseline model gets 72.3% mean accuracy.
1580+ This is the same level as SB-CNN and PiczakCNN without data-augmentation (73%)[ @SB-CNN ] ,
1581+ but significantly below the 79% of SB-CNN and LD-CNN with data-augmentation.
1582+ As expected, the Baseline uses more CPU than our requirements
1583+ with 971 ms classification time per 730 ms analysis window.
1584+
1585+ Stride-DS with 70.9% mean accuracy is able to get quite close close the baseline performance,
1586+ despite having (from Table \ref{table: models }) $10185/477 = 21x$ fewer multiply-add operations (MACC).
1587+ The practical efficiency gain in CPU usage is however only $971/81 = 12x$.
1588+
1589+ Stride-BTLN-DS and Stride-Effnet performed very poorly in comparison.
1590+ This can be most clearly seen from Figure \ref{figure: model-efficiency }.
1591+ Despite almost the same computational requirements as Stride-DS-24,
1592+ they had accuracy scores that were 6.1 and 10.2 percentage points lower, respectively.
1593+
15821594` FIXME: change confusion matrix color scale to show nuances in 0-20% range `
15831595
15841596` TODO: plot MAC versus compute time `
15851597
1586- ## Error analysis
1587-
15881598![ Confusion matrix on Urbansound8k] ( ./results/confusion_test.png ) { height=30% }
15891599
15901600![ Confusion matrix in reduced groups with only foreground sounds] ( ./results/grouped_confusion_test_foreground.png ) { height=30% }
@@ -1604,73 +1614,84 @@ Strided- has been shortened to S- for readability.}
16041614\label{figure: demo }
16051615\end{figure}
16061616
1607- The on-device demonstration used the ST FP-SENSING1[ @FP-AI-SENSING1 ] function pack as a base,
1608- with modifications made to send the model predictions out over USB.
1609- This example code unfortunately only supports mel-spectrogram preprocessing
1610- with 16 kHz sample-rate, 30 filters and 1024 samples FFT window with 512 hops,
1611- using max-normalization for the analysis windows.
1612- Therefore a Strided-DS model was trained on fold 1-8 to match these feature settings.
1613- The model scored 72% on the associated validation-set, fold 9.
1617+ The model used on device (with 16kHz model with 30 mel filters)
1618+ scored 72% on the associated validation-set, fold 9.
1619+
1620+ Figure \ref {figure: demo } shows a closeup of the on-device testing scenario.
1621+ When playing back a few sounds the system was able to
1622+ correctly classify classes such as "dog barking" most of the time.
1623+ The classes "jackhammer" and "drilling" were confused several times (in both directions),
1624+ but these were often hard to distinguish by ear also.
1625+ The system seemed to struggle with the "children playing" class.
1626+ When not playing any sound, the GPU fan noise from the nearby computer
1627+ was classified as "air conditioner" - which sounded pretty close.
16141628
16151629
16161630\newpage
16171631# Discussion
16181632
1619- <!--
1620- Ref Problem
1621- > Can we classify environmental sounds directly on a wireless and battery-operated noise sensor?
1622- -->
16231633
16241634## Model comparison
16251635
1626- ` TODO: make into coherent flow `
1627-
1628-
1629- The lower performance of our Baseline
1636+ The lower performance of our Baseline relative to SB-CNN/LD-CNN
16301637may be a result of the reduced feature representation,
1631- or the reduced number of predictions per
1632- Compared to LD-CNN the delta-melspectrogram features are missing,
1633- which might make it easier to learn patterns of fluctuation.
1634- Two channels in the input also effectively doubles the number of convolutional kernels.
1635- Compared to SB-CNN the analysis window is shorter ``
1636- Since no overlap is used there are only 6 analysis windows
1637- and predictions to be aggregated over a 4 second clip.
1638- In LD-CNN and SB-CNN, ` FIXME: find out how much overlap ` .
1638+ or the reduced number of predictions for one clip.
1639+ Compared to LD-CNN the delta-mel-spectrogram features are missing,
1640+ which might have made it easier to learn some patterns
1641+ - at a cost of twice the RAM and CPU for the first layer.
1642+ Compared to SB-CNN the analysis window is shorter (720 ms versus 1765 ms),
1643+ also a 2x reduction in RAM and CPU.
1644+ Since no overlap is used there are only 6 analysis windows and predictions to be aggregated over a 4 second clip.
1645+ In LD-CNN and SB-CNN, ` FIXME: find out how much overlap they use ` .
1646+ However it is possible that with a more powerful training setup,
1647+ as transfer learning or stronger data augmentation scheme,
1648+ that this gap could be reduced.
16391649
1650+ <!--
1651+ Strided-DS-24 is essentially a combination of the
1652+ two model reduction tested individually in Baseline-DS and Stride.
1653+ Therefore it is somewhat surprising that Strided-DS-24 has a slightly higher mean than these two.
1654+ However since the amount of variation in accuracy across the folds is large,
1655+ and the hyperparameters were chosen by testing on Strided-DS models,
1656+ cannot conclude that this is a significant effect.
1657+ -->
16401658
1659+ The poorly performing Strided-BTLN-DS and Strided-Effnet both have have a bottleneck 1x1
1660+ convolution in the start of each block, reducing the number of channels used in the spatial convolution.
1661+ This hyperparameter was set to a seemingly conservative reduction of 2x
1662+ (original Effnet used 8x[ @Effnet ] , ShuffleNet used 4x[ @Shufflenet ] , albeit on much bigger models).
1663+ It is possible that this choice of hyperparameter is critical and that other values
1664+ would have performed better, but this has not been investigated.
16411665
16421666
1667+ Of the models compared it looks like the Strided-DS family of models
1668+ give the highest accuracy relative to model compute requirements.
1669+ The largest model, Strided-DS-24, was able to achieve near Baseline performance while utilizing 12x less CPU.
1670+ The CPU usage of this model is 11%, well within the 50% set as a requirement,
1671+ allowing the microcontroller to sleep for up to 7/ amounts of time even
1672+ when classifications are performed for every 720ms block (real-time).
16431673
1674+ The smaller models (with 20,16,12 filters) in the family with less compute requirements had correspondingly
1675+ lower accuracies, suggesting that a tradeoff between model requirements and performance is possible.
16441676
1677+ The Strided-DS-3x3 variation with 4 layers with 3x3 convolutions instead
1678+ was close in performance to the Strided-DS models with 3 layers of 5x5.
16451679
1646- Strided-BTLN-DS and Stride-Effnet performed very poorly in comparison.
1647- Despite almost the same computational requirements as Strided-DS-24,
1648- they had accuracy scores that were 6.1 and 10.2 percentage points lower, respectively.
1680+ The on-device model which was trained on 16kHz with 30 mel filters (on a single fold),
1681+ looked to perform similarly to those with the full 22kHz and 60 mel-filters.
1682+ This may suggest that perhaps the feature representation (and thus compute requirements)
1683+ can be reduced even further without much reduction in performance.
16491684
1650- Pareto optimal
16511685
1652-
1653- Far from the state-of-the-art when not considering performance constraints
1654- Probably below human-level accuracy. Ref ESC-50
1655-
1656- <!--
1657- Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
1658- and better than without data augmentation[@PiczakCNN].
1659- With estimated 88M MAC/s, a factor 200x more.
1660- Indicator of huge differences in efficiency between different CNN architectures
1661- -->
1686+ ## Practical implications
16621687
16631688Accuracy when considering only foreground sounds improved significantly.
16641689Median improvement.
16651690
1666- When considering , accuracy increases significantly (as expected).
1667-
1668-
1669- When considering the reduced 5-group classification.
1670- Some misclassifications are within a group of classes, and this increases accuracy.
1671- Example...
1672- However still have significant confusion for some groups...
1691+ Far from the state-of-the-art when not considering performance constraints
1692+ Probably below human-level accuracy. Ref ESC-50
16731693
1694+ Before deployment in the field, a more systematic validation of on-device must be performed.
16741695
16751696Classification is done on 4 second intervals (as that is what is available in Urbansound8k)
16761697In a noise monitoring situation this is probably way too fine grained.
@@ -1682,57 +1703,60 @@ predominant sound source
16821703Is the easiest-to-classify sound is the loudest
16831704/ contributing the most to the increased sound level?
16841705
1685- ` TODO: update to reflect latest results `
1686-
16871706
1688- <!--
1689- SKIP
1690- Possible to use slightly bigger microcontroller.
1691- Able to double Flash. Up to 1024kB RAM, 8x. Approx 8x CPU.
16921707
1693- What is the approx cost of system. BOM
1694- What is the battery lifetime. BOM
1695- -->
1708+ When considering the reduced 5-group classification.
1709+ Some misclassifications are within a group of classes, and this increases accuracy.
1710+ Example...
1711+ However still have significant confusion for some groups...
16961712
1697- # Conclusions
16981713
16991714<!--
1700-
1701- Recap what you did .
1702- Highlight the big accomplishments .
1703- Conclude. Wraps up your paper. Tie your research to the “real world.”
1715+ Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
1716+ and better than without data augmentation[@PiczakCNN] .
1717+ With estimated 88M MAC/s, a factor 200x more .
1718+ Indicator of huge differences in efficiency between different CNN architectures
17041719-->
17051720
1706- Able to demonstrate Environmental Sound Classification
1707- running on a low-power microcontroller suitable for use in a sensor node.
17081721
1709- The best model achieves a `` accuracy when evaluated on the Urbansound8k dataset,
1710- using ` XX % ` of the CPU capacity.
1711- And under 50% of RAM and FLASH.
1722+ # Conclusions
17121723
1724+ Based on the need for wireless sensor systems that can monitor and classify environmental noise,
1725+ this project has investigated performing noise classification directly on microcontroller-based sensor hardware.
1726+ This on-sensor classification makes it possible to reduce power-consumption and privacy issues
1727+ associated with transmitting raw audio or detailed audio fingerprints to a cloud system for classification.
17131728
1714- ` TODO: evaluate `
1715- ` ??? is the perf high enough to be useful in practice? `
1716- ` ??? When considering foreground/grouped, and class of errors `
1729+ Several different Convolutional Neural Networks were designed for the
1730+ STM32L476 low-power microcontroller using the vendor-provided X-CUBE-AI inference engine.
1731+ The models were evaluated on the Environmental Sound Classification
1732+ task using the standard Urbansound8k dataset, validated briefly for use in real-time classification on device.
1733+ The best models used Depthwise-Separable convolutions with striding,
1734+ and were able to reach up to 70.9% mean accuracy while consuming only 11% CPU,
1735+ and staying within predefined 50% RAM and FLASH storage budgets.
1736+ To our knowledge, this is the highest reported performance on Urbansound8k on a microcontroller.
17171737
1718- ## Further work
1738+ ` FIXME: one sentence about perf level `
17191739
1720- Some further work is identified in two major areas:
1721- Increasing model efficiency on the Environmental Sound Classification tasks and
1722- practical challenges with applying on-edge classification of noise in sensor networks.
1740+ This indicates that it is computationally feasible to classify environmental sound
1741+ on affordable low-power microcontrollers,
1742+ possibly enabling advanced noise monitoring sensor networks with low costs and high density.
1743+ Further investigations into the power consumption and practical considerations
1744+ of on-edge Environmental Sound Classification using microcontrollers is warranted.
17231745
1724- Utilizing larger amounts of training data might
1725- be able to increase performance of the models shown.
1726- Possible techniques for this are transfer learning[ @PretrainingSpeechCommandRecognition ] ,
1727- or applying stronger data augmentation techniques (such as Mixup[ Mixup] or SpecAugment[ @SpecAugment ] ).
1746+ ## Further work
17281747
1729- Applying quantization should speed up the computations of the models.
1730- A first step would be to make use of the optimized CMSIS-NN library[ @CMSIS-NN ] ,
1748+ Applying quantization to the models should reduce CPU, RAM and FLASH usage.
1749+ This could be used to fit slightly larger models, or to make existing models more efficient.
1750+ A first step could be to make use of the optimized CMSIS-NN library[ @CMSIS-NN ] ,
17311751which utilizes 8-bit integer operations and the SIMD unit in the ARM Cortex M4F.
17321752However there are also promising results showing that CNNs can be
17331753effectively implemented with as little as 2 bits[ @andri2016yodann ] [ @miyashita2016convolutional ] [ @IncrementalNetworkQuantization ] ,
17341754and without using any multiplications[ @leng2018extremely ] [ @cintra2018low ] .
17351755
1756+ Utilizing larger amounts of training data might be able to increase performance of models.
1757+ Possible techniques for this are transfer learning[ @PretrainingSpeechCommandRecognition ] ,
1758+ or applying stronger data augmentation techniques (such as Mixup[ Mixup] or SpecAugment[ @SpecAugment ] ).
1759+
17361760<!--
17371761Low-power hardware accelerators for Convolutional Neural Networks will hopefully
17381762become available over the next few years.
@@ -1744,9 +1768,9 @@ since it allows also the filterbank processing to be offloaded from the general
17441768
17451769In a practical deployment of on-sensor classification, it is still desirable to
17461770collect * some* data for evaluation of performance and further training.
1747- This could be sampled at random.
1748- But can an on-sensor implementation Active Learning[ @ActiveLearningSonyc ] [ @SemiSupervisedActiveLearning ]
1749- make this process more efficient?
1771+ This could be sampled at random, but an on-sensor implementation
1772+ of Active Learning[ @ActiveLearningSonyc ] [ @SemiSupervisedActiveLearning ]
1773+ could be able to make this process more power efficient.
17501774
17511775<!--
17521776Normally such training and evaluation data is transferred as raw PCM audio,
@@ -1755,7 +1779,7 @@ Could low-power audio coding be applied to compress the data,
17551779while still enable reliable human labeling and use as evaluation/training data?
17561780-->
17571781
1758- It is critical for power consumption to reduce how often on-sensor classification is performed.
1782+ It is critical for overall power consumption to reduce how often on-sensor classification is performed.
17591783This should also benefit from an adaptive sampling strategy.
17601784For example to primarily do classification for time-periods which exceed
17611785a sound level threshold, or to sample less often when the sound source changes slowly.
0 commit comments