Skip to content

Commit 2ea1b27

Browse files
committed
report: Most of conclusion in
1 parent be250a0 commit 2ea1b27

File tree

1 file changed

+123
-99
lines changed

1 file changed

+123
-99
lines changed

report/report.md

Lines changed: 123 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1532,20 +1532,28 @@ average inference time per sample on the STM32L476 microcontroller.
15321532
This accounts for potential variations in number of MACC/second for different models,
15331533
which would be ignored if only relying on the theoretical MACC number.
15341534

1535+
Finally the trained models were tested running on the microcontrollers using live audio on the microphone.
1536+
The on-device test used example code from the ST FP-SENSING1[@FP-AI-SENSING1] function pack as a base,
1537+
with modifications made to send the model predictions out over USB.
1538+
The example code unfortunately only supports mel-spectrogram preprocessing
1539+
with 16 kHz sample-rate, 30 filters and 1024 samples FFT window with 512 hops,
1540+
using max-normalization for the analysis windows.
1541+
Therefore a Strided-DS model was trained on fold 1-8 to match these feature settings.
1542+
1543+
The on-device testing was done ad-hoc with a few samples from Freesound.org,
1544+
as a sanity-check that the model remained functional when ran on the microcontroller.
1545+
No systematic measurements of the performance was performed.
15351546

15361547
\newpage
15371548
# Results
15381549

1539-
1540-
## Model comparisons
1541-
15421550
\begin{figure}[h]
15431551
\centering
15441552
\includegraphics[width=1.0\textwidth]{./results/models_accuracy.png}
15451553
\caption[Test accuracy of the different models]{Test accuracy of the different models.
15461554
State-of-the-art averages (SB-CNN/LD-CNN and D-CNN) marked with green dots.
15471555
No-information rate marked with black dots.}
1548-
\label{figure:demo}
1556+
\label{figure:models-accuracy}
15491557
\end{figure}
15501558

15511559
\begin{table}[h]
@@ -1558,33 +1566,35 @@ FG=Foreground samples only, BG=Background samples only.}
15581566
\label{table:results}
15591567
\end{table}
15601568

1561-
The Baseline model gets 72.3% mean accuracy.
1562-
This is the same level as SB-CNN and PiczakCNN without data-augmentation (73%)[@SB-CNN],
1563-
but significantly below the 79% of SB-CNN and LD-CNN with data-augmentation.
1564-
As expected, the Baseline uses more CPU than our requirements.
1565-
1566-
The Strided model is outside the desirable range.
1567-
1568-
Strided-DS with 70.9% mean accuracy is able to get quite close close the baseline performance,
1569-
despite having $10185/477 = 21x$ fewer multiply-add operations (MACC).
1570-
The practical efficiency gain in CPU usage is however only $971/81 = 12x$.
1571-
1572-
15731569
\begin{figure}[h]
15741570
\centering
15751571
\includegraphics[width=1.0\textwidth]{./results/models_efficiency.png}
15761572
\caption[Accuracy versus compute of different models]{Accuracy versus compute of different models.
15771573
Variations of the same model family have the same color.
15781574
Strided- has been shortened to S- for readability.}
1579-
\label{figure:demo}
1575+
\label{figure:model-efficiency}
15801576
\end{figure}
15811577

1578+
As seen in Table \ref{table:results} and Figure \ref{figure:models-accuracy}
1579+
the Baseline model gets 72.3% mean accuracy.
1580+
This is the same level as SB-CNN and PiczakCNN without data-augmentation (73%)[@SB-CNN],
1581+
but significantly below the 79% of SB-CNN and LD-CNN with data-augmentation.
1582+
As expected, the Baseline uses more CPU than our requirements
1583+
with 971 ms classification time per 730 ms analysis window.
1584+
1585+
Stride-DS with 70.9% mean accuracy is able to get quite close close the baseline performance,
1586+
despite having (from Table \ref{table:models}) $10185/477 = 21x$ fewer multiply-add operations (MACC).
1587+
The practical efficiency gain in CPU usage is however only $971/81 = 12x$.
1588+
1589+
Stride-BTLN-DS and Stride-Effnet performed very poorly in comparison.
1590+
This can be most clearly seen from Figure \ref{figure:model-efficiency}.
1591+
Despite almost the same computational requirements as Stride-DS-24,
1592+
they had accuracy scores that were 6.1 and 10.2 percentage points lower, respectively.
1593+
15821594
`FIXME: change confusion matrix color scale to show nuances in 0-20% range`
15831595

15841596
`TODO: plot MAC versus compute time`
15851597

1586-
## Error analysis
1587-
15881598
![Confusion matrix on Urbansound8k](./results/confusion_test.png){ height=30% }
15891599

15901600
![Confusion matrix in reduced groups with only foreground sounds](./results/grouped_confusion_test_foreground.png){ height=30% }
@@ -1604,73 +1614,84 @@ Strided- has been shortened to S- for readability.}
16041614
\label{figure:demo}
16051615
\end{figure}
16061616

1607-
The on-device demonstration used the ST FP-SENSING1[@FP-AI-SENSING1] function pack as a base,
1608-
with modifications made to send the model predictions out over USB.
1609-
This example code unfortunately only supports mel-spectrogram preprocessing
1610-
with 16 kHz sample-rate, 30 filters and 1024 samples FFT window with 512 hops,
1611-
using max-normalization for the analysis windows.
1612-
Therefore a Strided-DS model was trained on fold 1-8 to match these feature settings.
1613-
The model scored 72% on the associated validation-set, fold 9.
1617+
The model used on device (with 16kHz model with 30 mel filters)
1618+
scored 72% on the associated validation-set, fold 9.
1619+
1620+
Figure \ref {figure:demo} shows a closeup of the on-device testing scenario.
1621+
When playing back a few sounds the system was able to
1622+
correctly classify classes such as "dog barking" most of the time.
1623+
The classes "jackhammer" and "drilling" were confused several times (in both directions),
1624+
but these were often hard to distinguish by ear also.
1625+
The system seemed to struggle with the "children playing" class.
1626+
When not playing any sound, the GPU fan noise from the nearby computer
1627+
was classified as "air conditioner" - which sounded pretty close.
16141628

16151629

16161630
\newpage
16171631
# Discussion
16181632

1619-
<!--
1620-
Ref Problem
1621-
> Can we classify environmental sounds directly on a wireless and battery-operated noise sensor?
1622-
-->
16231633

16241634
## Model comparison
16251635

1626-
`TODO: make into coherent flow`
1627-
1628-
1629-
The lower performance of our Baseline
1636+
The lower performance of our Baseline relative to SB-CNN/LD-CNN
16301637
may be a result of the reduced feature representation,
1631-
or the reduced number of predictions per
1632-
Compared to LD-CNN the delta-melspectrogram features are missing,
1633-
which might make it easier to learn patterns of fluctuation.
1634-
Two channels in the input also effectively doubles the number of convolutional kernels.
1635-
Compared to SB-CNN the analysis window is shorter ``
1636-
Since no overlap is used there are only 6 analysis windows
1637-
and predictions to be aggregated over a 4 second clip.
1638-
In LD-CNN and SB-CNN, `FIXME: find out how much overlap`.
1638+
or the reduced number of predictions for one clip.
1639+
Compared to LD-CNN the delta-mel-spectrogram features are missing,
1640+
which might have made it easier to learn some patterns
1641+
- at a cost of twice the RAM and CPU for the first layer.
1642+
Compared to SB-CNN the analysis window is shorter (720 ms versus 1765 ms),
1643+
also a 2x reduction in RAM and CPU.
1644+
Since no overlap is used there are only 6 analysis windows and predictions to be aggregated over a 4 second clip.
1645+
In LD-CNN and SB-CNN, `FIXME: find out how much overlap they use`.
1646+
However it is possible that with a more powerful training setup,
1647+
as transfer learning or stronger data augmentation scheme,
1648+
that this gap could be reduced.
16391649

1650+
<!--
1651+
Strided-DS-24 is essentially a combination of the
1652+
two model reduction tested individually in Baseline-DS and Stride.
1653+
Therefore it is somewhat surprising that Strided-DS-24 has a slightly higher mean than these two.
1654+
However since the amount of variation in accuracy across the folds is large,
1655+
and the hyperparameters were chosen by testing on Strided-DS models,
1656+
cannot conclude that this is a significant effect.
1657+
-->
16401658

1659+
The poorly performing Strided-BTLN-DS and Strided-Effnet both have have a bottleneck 1x1
1660+
convolution in the start of each block, reducing the number of channels used in the spatial convolution.
1661+
This hyperparameter was set to a seemingly conservative reduction of 2x
1662+
(original Effnet used 8x[@Effnet], ShuffleNet used 4x[@Shufflenet], albeit on much bigger models).
1663+
It is possible that this choice of hyperparameter is critical and that other values
1664+
would have performed better, but this has not been investigated.
16411665

16421666

1667+
Of the models compared it looks like the Strided-DS family of models
1668+
give the highest accuracy relative to model compute requirements.
1669+
The largest model, Strided-DS-24, was able to achieve near Baseline performance while utilizing 12x less CPU.
1670+
The CPU usage of this model is 11%, well within the 50% set as a requirement,
1671+
allowing the microcontroller to sleep for up to 7/ amounts of time even
1672+
when classifications are performed for every 720ms block (real-time).
16431673

1674+
The smaller models (with 20,16,12 filters) in the family with less compute requirements had correspondingly
1675+
lower accuracies, suggesting that a tradeoff between model requirements and performance is possible.
16441676

1677+
The Strided-DS-3x3 variation with 4 layers with 3x3 convolutions instead
1678+
was close in performance to the Strided-DS models with 3 layers of 5x5.
16451679

1646-
Strided-BTLN-DS and Stride-Effnet performed very poorly in comparison.
1647-
Despite almost the same computational requirements as Strided-DS-24,
1648-
they had accuracy scores that were 6.1 and 10.2 percentage points lower, respectively.
1680+
The on-device model which was trained on 16kHz with 30 mel filters (on a single fold),
1681+
looked to perform similarly to those with the full 22kHz and 60 mel-filters.
1682+
This may suggest that perhaps the feature representation (and thus compute requirements)
1683+
can be reduced even further without much reduction in performance.
16491684

1650-
Pareto optimal
16511685

1652-
1653-
Far from the state-of-the-art when not considering performance constraints
1654-
Probably below human-level accuracy. Ref ESC-50
1655-
1656-
<!--
1657-
Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
1658-
and better than without data augmentation[@PiczakCNN].
1659-
With estimated 88M MAC/s, a factor 200x more.
1660-
Indicator of huge differences in efficiency between different CNN architectures
1661-
-->
1686+
## Practical implications
16621687

16631688
Accuracy when considering only foreground sounds improved significantly.
16641689
Median improvement.
16651690

1666-
When considering , accuracy increases significantly (as expected).
1667-
1668-
1669-
When considering the reduced 5-group classification.
1670-
Some misclassifications are within a group of classes, and this increases accuracy.
1671-
Example...
1672-
However still have significant confusion for some groups...
1691+
Far from the state-of-the-art when not considering performance constraints
1692+
Probably below human-level accuracy. Ref ESC-50
16731693

1694+
Before deployment in the field, a more systematic validation of on-device must be performed.
16741695

16751696
Classification is done on 4 second intervals (as that is what is available in Urbansound8k)
16761697
In a noise monitoring situation this is probably way too fine grained.
@@ -1682,57 +1703,60 @@ predominant sound source
16821703
Is the easiest-to-classify sound is the loudest
16831704
/ contributing the most to the increased sound level?
16841705

1685-
`TODO: update to reflect latest results`
1686-
16871706

1688-
<!--
1689-
SKIP
1690-
Possible to use slightly bigger microcontroller.
1691-
Able to double Flash. Up to 1024kB RAM, 8x. Approx 8x CPU.
16921707

1693-
What is the approx cost of system. BOM
1694-
What is the battery lifetime. BOM
1695-
-->
1708+
When considering the reduced 5-group classification.
1709+
Some misclassifications are within a group of classes, and this increases accuracy.
1710+
Example...
1711+
However still have significant confusion for some groups...
16961712

1697-
# Conclusions
16981713

16991714
<!--
1700-
1701-
Recap what you did.
1702-
Highlight the big accomplishments.
1703-
Conclude. Wraps up your paper. Tie your research to the “real world.”
1715+
Almost reaching level of PiczakCNN[@SB-CNN] with data augmentation,
1716+
and better than without data augmentation[@PiczakCNN].
1717+
With estimated 88M MAC/s, a factor 200x more.
1718+
Indicator of huge differences in efficiency between different CNN architectures
17041719
-->
17051720

1706-
Able to demonstrate Environmental Sound Classification
1707-
running on a low-power microcontroller suitable for use in a sensor node.
17081721

1709-
The best model achieves a `` accuracy when evaluated on the Urbansound8k dataset,
1710-
using `XX %` of the CPU capacity.
1711-
And under 50% of RAM and FLASH.
1722+
# Conclusions
17121723

1724+
Based on the need for wireless sensor systems that can monitor and classify environmental noise,
1725+
this project has investigated performing noise classification directly on microcontroller-based sensor hardware.
1726+
This on-sensor classification makes it possible to reduce power-consumption and privacy issues
1727+
associated with transmitting raw audio or detailed audio fingerprints to a cloud system for classification.
17131728

1714-
`TODO: evaluate`
1715-
`??? is the perf high enough to be useful in practice?`
1716-
`??? When considering foreground/grouped, and class of errors`
1729+
Several different Convolutional Neural Networks were designed for the
1730+
STM32L476 low-power microcontroller using the vendor-provided X-CUBE-AI inference engine.
1731+
The models were evaluated on the Environmental Sound Classification
1732+
task using the standard Urbansound8k dataset, validated briefly for use in real-time classification on device.
1733+
The best models used Depthwise-Separable convolutions with striding,
1734+
and were able to reach up to 70.9% mean accuracy while consuming only 11% CPU,
1735+
and staying within predefined 50% RAM and FLASH storage budgets.
1736+
To our knowledge, this is the highest reported performance on Urbansound8k on a microcontroller.
17171737

1718-
## Further work
1738+
`FIXME: one sentence about perf level`
17191739

1720-
Some further work is identified in two major areas:
1721-
Increasing model efficiency on the Environmental Sound Classification tasks and
1722-
practical challenges with applying on-edge classification of noise in sensor networks.
1740+
This indicates that it is computationally feasible to classify environmental sound
1741+
on affordable low-power microcontrollers,
1742+
possibly enabling advanced noise monitoring sensor networks with low costs and high density.
1743+
Further investigations into the power consumption and practical considerations
1744+
of on-edge Environmental Sound Classification using microcontrollers is warranted.
17231745

1724-
Utilizing larger amounts of training data might
1725-
be able to increase performance of the models shown.
1726-
Possible techniques for this are transfer learning[@PretrainingSpeechCommandRecognition],
1727-
or applying stronger data augmentation techniques (such as Mixup[Mixup] or SpecAugment[@SpecAugment]).
1746+
## Further work
17281747

1729-
Applying quantization should speed up the computations of the models.
1730-
A first step would be to make use of the optimized CMSIS-NN library[@CMSIS-NN],
1748+
Applying quantization to the models should reduce CPU, RAM and FLASH usage.
1749+
This could be used to fit slightly larger models, or to make existing models more efficient.
1750+
A first step could be to make use of the optimized CMSIS-NN library[@CMSIS-NN],
17311751
which utilizes 8-bit integer operations and the SIMD unit in the ARM Cortex M4F.
17321752
However there are also promising results showing that CNNs can be
17331753
effectively implemented with as little as 2 bits[@andri2016yodann][@miyashita2016convolutional][@IncrementalNetworkQuantization],
17341754
and without using any multiplications[@leng2018extremely][@cintra2018low].
17351755

1756+
Utilizing larger amounts of training data might be able to increase performance of models.
1757+
Possible techniques for this are transfer learning[@PretrainingSpeechCommandRecognition],
1758+
or applying stronger data augmentation techniques (such as Mixup[Mixup] or SpecAugment[@SpecAugment]).
1759+
17361760
<!--
17371761
Low-power hardware accelerators for Convolutional Neural Networks will hopefully
17381762
become available over the next few years.
@@ -1744,9 +1768,9 @@ since it allows also the filterbank processing to be offloaded from the general
17441768

17451769
In a practical deployment of on-sensor classification, it is still desirable to
17461770
collect *some* data for evaluation of performance and further training.
1747-
This could be sampled at random.
1748-
But can an on-sensor implementation Active Learning[@ActiveLearningSonyc][@SemiSupervisedActiveLearning]
1749-
make this process more efficient?
1771+
This could be sampled at random, but an on-sensor implementation
1772+
of Active Learning[@ActiveLearningSonyc][@SemiSupervisedActiveLearning]
1773+
could be able to make this process more power efficient.
17501774

17511775
<!--
17521776
Normally such training and evaluation data is transferred as raw PCM audio,
@@ -1755,7 +1779,7 @@ Could low-power audio coding be applied to compress the data,
17551779
while still enable reliable human labeling and use as evaluation/training data?
17561780
-->
17571781

1758-
It is critical for power consumption to reduce how often on-sensor classification is performed.
1782+
It is critical for overall power consumption to reduce how often on-sensor classification is performed.
17591783
This should also benefit from an adaptive sampling strategy.
17601784
For example to primarily do classification for time-periods which exceed
17611785
a sound level threshold, or to sample less often when the sound source changes slowly.

0 commit comments

Comments
 (0)