Skip to content

Commit ac94b35

Browse files
mlodyjesieninchmjkbmsluszniak
authored
Voice activity detection model (#625)
## Description This PR introduces voice activity detection (vad) feature into the react native executorch library. **This PR is not ready for merge yet,** however it is ready for the review of C++ code. Things that are missing: - Documentation - exported model on official swm huggingface - benchmarks - maybe example usage in one app? idk ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions <!-- Provide step-by-step instructions on how to test your changes. Include setup details if necessary. --> ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues Closes #547 ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [ ] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. --> --------- Co-authored-by: Jakub Chmura <[email protected]> Co-authored-by: Mateusz Sluszniak <[email protected]> Co-authored-by: chmjkb <[email protected]>
1 parent c902882 commit ac94b35

File tree

19 files changed

+655
-4
lines changed

19 files changed

+655
-4
lines changed

.cspell-wordlist.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,6 @@ setpriority
8080
errno
8181
ifdef
8282
elif
83+
FSMN
84+
fsmn
85+
subarray
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
---
2+
title: useVAD
3+
---
4+
5+
Voice Activity Detection (VAD) is the task of analyzing an audio signal to identify time segments containing human speech, separating them from non-speech sections like silence and background noise.
6+
7+
:::caution
8+
It is recommended to use models provided by us, which are available at our [Hugging Face repository](https://huggingface.co/software-mansion/react-native-executorch-fsmn-vad). You can also use [constants](https://github.com/software-mansion/react-native-executorch/blob/main/packages/react-native-executorch/src/constants/modelUrls.ts) shipped with our library.
9+
:::
10+
11+
## Reference
12+
13+
You can obtain waveform from audio in any way most suitable to you, however in the snippet below we utilize `react-native-audio-api` library to process a `.mp3` file.
14+
15+
```typescript
16+
import { useVAD, FSMN_VAD } from 'react-native-executorch';
17+
import { AudioContext } from 'react-native-audio-api';
18+
import * as FileSystem from 'expo-file-system';
19+
20+
const model = useVAD({
21+
model: FSMN_VAD,
22+
});
23+
24+
const { uri } = await FileSystem.downloadAsync(
25+
'https://some-audio-url.com/file.mp3',
26+
FileSystem.cacheDirectory + 'audio_file'
27+
);
28+
29+
const audioContext = new AudioContext({ sampleRate: 16000 });
30+
const decodedAudioData = await audioContext.decodeAudioDataSource(uri);
31+
const audioBuffer = decodedAudioData.getChannelData(0);
32+
33+
try {
34+
// NOTE: to obtain segments in seconds, you need to divide
35+
// start / end of the segment by the sampling rate (16k)
36+
37+
const speechSegments = await model.forward(audioBuffer);
38+
console.log(speechSegments);
39+
} catch (error) {
40+
console.error('Error during running VAD model', error);
41+
}
42+
```
43+
44+
### Arguments
45+
46+
**`model`** - Object containing the model source.
47+
48+
- **`modelSource`** - A string that specifies the location of the model binary.
49+
50+
**`preventLoad?`** - Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.
51+
52+
For more information on loading resources, take a look at [loading models](../../01-fundamentals/02-loading-models.md) page.
53+
54+
### Returns
55+
56+
| Field | Type | Description |
57+
| ------------------ | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
58+
| `forward` | `(waveform: Float32Array) => Promise<{Segment[]}>` | Executes the model's forward pass, where input array should be a waveform at 16kHz. Returns a promise containing an array of `Segment` objects. |
59+
| `error` | <code>string &#124; null</code> | Contains the error message if the model failed to load. |
60+
| `isGenerating` | `boolean` | Indicates whether the model is currently processing an inference. |
61+
| `isReady` | `boolean` | Indicates whether the model has successfully loaded and is ready for inference. |
62+
| `downloadProgress` | `number` | Represents the download progress as a value between 0 and 1. |
63+
64+
<details>
65+
<summary>Type definitions</summary>
66+
67+
```typescript
68+
interface Segment {
69+
start: number;
70+
end: number;
71+
}
72+
```
73+
74+
</details>
75+
## Running the model
76+
77+
Before running the model's `forward` method, make sure to extract the audio waveform you want to process. You'll need to handle this step yourself, ensuring the audio is sampled at 16 kHz. Once you have the waveform, pass it as an argument to the forward method. The method returns a promise that resolves to the array of detected speech segments.
78+
79+
:::info
80+
Timestamps in returned speech segments, correspond to indices of input array (waveform).
81+
:::
82+
83+
## Example
84+
85+
```tsx
86+
import React from 'react';
87+
import { Button, Text, SafeAreaView } from 'react-native';
88+
import { useVAD, FSMN_VAD } from 'react-native-executorch';
89+
import { AudioContext } from 'react-native-audio-api';
90+
import * as FileSystem from 'expo-file-system';
91+
92+
export default function App() {
93+
const model = useVAD({
94+
model: FSMN_VAD,
95+
});
96+
97+
const audioURL = 'https://some-audio-url.com/file.mp3';
98+
99+
const handleAudio = async () => {
100+
if (!model) {
101+
console.error('VAD model is not loaded yet.');
102+
return;
103+
}
104+
105+
console.log('Processing URL:', audioURL);
106+
107+
try {
108+
const { uri } = await FileSystem.downloadAsync(
109+
audioURL,
110+
FileSystem.cacheDirectory + 'vad_example.tmp'
111+
);
112+
113+
const audioContext = new AudioContext({ sampleRate: 16000 });
114+
const originalDecodedBuffer =
115+
await audioContext.decodeAudioDataSource(uri);
116+
const originalChannelData = originalDecodedBuffer.getChannelData(0);
117+
118+
const segments = await model.forward(originalChannelData);
119+
if (segments.length === 0) {
120+
console.log('No speech segments were found.');
121+
return;
122+
}
123+
console.log(`Found ${segments.length} speech segments.`);
124+
125+
const totalLength = segments.reduce(
126+
(sum, seg) => sum + (seg.end - seg.start),
127+
0
128+
);
129+
const newAudioBuffer = audioContext.createBuffer(
130+
1, // Mono
131+
totalLength,
132+
originalDecodedBuffer.sampleRate
133+
);
134+
const newChannelData = newAudioBuffer.getChannelData(0);
135+
136+
let offset = 0;
137+
for (const segment of segments) {
138+
const slice = originalChannelData.subarray(segment.start, segment.end);
139+
newChannelData.set(slice, offset);
140+
offset += slice.length;
141+
}
142+
143+
// Play the processed audio
144+
const source = audioContext.createBufferSource();
145+
source.buffer = newAudioBuffer;
146+
source.connect(audioContext.destination);
147+
source.start();
148+
} catch (error) {
149+
console.error('Error processing audio data:', error);
150+
}
151+
};
152+
153+
return (
154+
<SafeAreaView>
155+
<Text>
156+
Press the button to process and play speech from a sample file.
157+
</Text>
158+
<Button onPress={handleAudio} title="Run VAD Example" />
159+
</SafeAreaView>
160+
);
161+
}
162+
```
163+
164+
## Supported models
165+
166+
- [fsmn-vad](https://huggingface.co/funasr/fsmn-vad)
167+
168+
## Benchmarks
169+
170+
### Model size
171+
172+
| Model | XNNPACK [MB] |
173+
| -------- | :----------: |
174+
| FSMN_VAD | 1.83 |
175+
176+
### Memory usage
177+
178+
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
179+
| -------- | :--------------------: | :----------------: |
180+
| FSMN_VAD | 97 | 45,9 |
181+
182+
### Inference time
183+
184+
<!-- TODO: MEASURE INFERENCE TIME FOR SAMSUNG GALAXY S24 WHEN POSSIBLE -->
185+
186+
:::warning warning
187+
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.
188+
:::
189+
190+
Inference time were measured on a 60s audio, that can be found [here](https://models.silero.ai/vad_models/en.wav).
191+
192+
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
193+
| -------- | :--------------------------: | :------------------------------: | :------------------------: | :-----------------------: |
194+
| FSMN_VAD | 151 | 171 | 180 | 109 |
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
title: VADModule
3+
---
4+
5+
TypeScript API implementation of the [useVAD](../../02-hooks/01-natural-language-processing/useVAD.md) hook.
6+
7+
## Reference
8+
9+
```typescript
10+
import { VADModule, FSMN_VAD } from 'react-native-executorch';
11+
12+
const model = new VADModule();
13+
await model.load(FSMN_VAD, (progress) => {
14+
console.log(progress);
15+
});
16+
17+
await model.forward(waveform);
18+
```
19+
20+
### Methods
21+
22+
| Method | Type | Description |
23+
| --------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
24+
| `load` | `(model: { modelSource: ResourceSource }, onDownloadProgressCallback?: (progress: number) => void): Promise<void>` | Loads the model, where `modelSource` is a string that specifies the location of the model binary. To track the download progress, supply a callback function `onDownloadProgressCallback`. |
25+
| `forward` | `(waveform: Float32Array): Promise<{ [category: string]: number }>` | Executes the model's forward pass, where `imageSource` can be a fetchable resource or a Base64-encoded string. |
26+
| `delete` | `(): void` | Release the memory held by the module. Calling `forward` afterwards is invalid. |
27+
28+
<details>
29+
<summary>Type definitions</summary>
30+
31+
```typescript
32+
type ResourceSource = string | number | object;
33+
```
34+
35+
```typescript
36+
interface Segment {
37+
start: number;
38+
end: number;
39+
}
40+
```
41+
42+
</details>
43+
44+
## Loading the model
45+
46+
To load the model, create a new instance of the module and use the `load` method on it. It accepts an object:
47+
48+
**`model`** - Object containing the model source.
49+
50+
- **`modelSource`** - A string that specifies the location of the model binary.
51+
52+
**`onDownloadProgressCallback`** - (Optional) Function called on download progress.
53+
54+
This method returns a promise, which can resolve to an error or void.
55+
56+
For more information on loading resources, take a look at [loading models](../../01-fundamentals/02-loading-models.md) page.
57+
58+
## Running the model
59+
60+
To run the model, you can use the `forward` method on the module object. Before running the model's `forward` method, make sure to extract the audio waveform you want to process. You'll need to handle this step yourself, ensuring the audio is sampled at 16 kHz. Once you have the waveform, pass it as an argument to the forward method. The method returns a promise that resolves to the array of detected speech segments.
61+
62+
## Managing memory
63+
64+
The module is a regular JavaScript object, and as such its lifespan will be managed by the garbage collector. In most cases this should be enough, and you should not worry about freeing the memory of the module yourself, but in some cases you may want to release the memory occupied by the module before the garbage collector steps in. In this case use the method `delete()` on the module object you will no longer use, and want to remove from the memory. Note that you cannot use `forward` after `delete` unless you load the module again.

docs/docs/04-benchmarks/inference-time.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ Times presented in the tables are measured as consecutive runs of the model. Ini
6262

6363
❌ - Insufficient RAM.
6464

65-
### Streaming mode
65+
## Streaming mode
6666

6767
Notice than for `Whisper` model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) `fast` mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. If you believe that this might be a problem for you, prefer `balanced` mode instead.
6868

@@ -119,3 +119,13 @@ Average time for generating one image of size 256×256 in 10 inference steps.
119119
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
120120
| --------------------- | :--------------------------: | :------------------------------: | :-------------------: | :-------------------------------: | :-----------------------: |
121121
| BK_SDM_TINY_VPRED_256 | 19100 | 25000 ||| 23100 |
122+
123+
## Voice Activity Detection (VAD)
124+
125+
Average time for processing 60s audio.
126+
127+
<!-- TODO: MEASURE INFERENCE TIME FOR SAMSUNG GALAXY S24 WHEN POSSIBLE -->
128+
129+
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
130+
| -------- | :--------------------------: | :------------------------------: | :------------------------: | :-----------------------: |
131+
| FSMN_VAD | 151 | 171 | 180 | 109 |

docs/docs/04-benchmarks/memory-usage.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,9 @@ title: Memory Usage
7575
| --------------------- | ---------------------- | ------------------ |
7676
| BK_SDM_TINY_VPRED_256 | 2900 | 2800 |
7777
| BK_SDM_TINY_VPRED | 6700 | 6560 |
78+
79+
## Voice Activity Detection (VAD)
80+
81+
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
82+
| -------- | :--------------------: | :----------------: |
83+
| FSMN_VAD | 97 | 45,9 |

docs/docs/04-benchmarks/model-size.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,9 @@ title: Model Size
8888
| Model | Text encoder (XNNPACK) [MB] | UNet (XNNPACK) [MB] | VAE decoder (XNNPACK) [MB] |
8989
| ----------------- | --------------------------- | ------------------- | -------------------------- |
9090
| BK_SDM_TINY_VPRED | 492 | 1290 | 198 |
91+
92+
## Voice Activity Detection (VAD)
93+
94+
| Model | XNNPACK [MB] |
95+
| -------- | :----------: |
96+
| FSMN_VAD | 1.83 |

packages/react-native-executorch/common/rnexecutorch/RnExecutorchInstaller.cpp

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,23 @@
66
#include <rnexecutorch/models/embeddings/image/ImageEmbeddings.h>
77
#include <rnexecutorch/models/embeddings/text/TextEmbeddings.h>
88
#include <rnexecutorch/models/image_segmentation/ImageSegmentation.h>
9-
#include <rnexecutorch/models/text_to_image/TextToImage.h>
109
#include <rnexecutorch/models/llm/LLM.h>
1110
#include <rnexecutorch/models/object_detection/ObjectDetection.h>
1211
#include <rnexecutorch/models/ocr/OCR.h>
1312
#include <rnexecutorch/models/speech_to_text/SpeechToText.h>
1413
#include <rnexecutorch/models/style_transfer/StyleTransfer.h>
14+
#include <rnexecutorch/models/text_to_image/TextToImage.h>
1515
#include <rnexecutorch/models/vertical_ocr/VerticalOCR.h>
16+
#include <rnexecutorch/models/voice_activity_detection/VoiceActivityDetection.h>
1617
#include <rnexecutorch/threads/GlobalThreadPool.h>
1718
#include <rnexecutorch/threads/utils/ThreadUtils.h>
1819

20+
#if defined(__ANDROID__) && defined(__aarch64__)
21+
#include <executorch/extension/threadpool/cpuinfo_utils.h>
22+
#include <executorch/extension/threadpool/threadpool.h>
23+
#include <rnexecutorch/Log.h>
24+
#endif
25+
1926
namespace rnexecutorch {
2027

2128
// This function fetches data from a url address. It is implemented in
@@ -51,8 +58,9 @@ void RnExecutorchInstaller::injectJSIBindings(
5158

5259
jsiRuntime->global().setProperty(
5360
*jsiRuntime, "loadObjectDetection",
54-
RnExecutorchInstaller::loadModel<models::object_detection::ObjectDetection>(
55-
jsiRuntime, jsCallInvoker, "loadObjectDetection"));
61+
RnExecutorchInstaller::loadModel<
62+
models::object_detection::ObjectDetection>(jsiRuntime, jsCallInvoker,
63+
"loadObjectDetection"));
5664

5765
jsiRuntime->global().setProperty(
5866
*jsiRuntime, "loadExecutorchModule",
@@ -92,9 +100,30 @@ void RnExecutorchInstaller::injectJSIBindings(
92100
*jsiRuntime, "loadSpeechToText",
93101
RnExecutorchInstaller::loadModel<models::speech_to_text::SpeechToText>(
94102
jsiRuntime, jsCallInvoker, "loadSpeechToText"));
103+
jsiRuntime->global().setProperty(
104+
*jsiRuntime, "loadVAD",
105+
RnExecutorchInstaller::loadModel<
106+
models::voice_activity_detection::VoiceActivityDetection>(
107+
jsiRuntime, jsCallInvoker, "loadVAD"));
95108

96109
threads::utils::unsafeSetupThreadPool();
97110
threads::GlobalThreadPool::initialize();
111+
112+
#if defined(__ANDROID__) && defined(__aarch64__)
113+
auto num_of_perf_cores =
114+
::executorch::extension::cpuinfo::get_num_performant_cores();
115+
log(LOG_LEVEL::Info, "Detected ", num_of_perf_cores, " performant cores");
116+
// setting num_of_cores to floor(num_of_perf_cores / 2) + 1) because depending
117+
// on cpu arch as when possible we want to leave at least 2 performant cores
118+
// for other tasks (setting more actually results in drop of performance). For
119+
// older devices (i.e. samsung s22) resolves to 3 cores, and for newer ones
120+
// (like OnePlus 12) resolves to 4, which when benchamrked gives highest
121+
// throughput.
122+
auto num_of_cores = static_cast<uint32_t>(num_of_perf_cores / 2) + 1;
123+
::executorch::extension::threadpool::get_threadpool()
124+
->_unsafe_reset_threadpool(num_of_cores);
125+
log(LOG_LEVEL::Info, "Configuring xnnpack for ", num_of_cores, " threads");
126+
#endif
98127
}
99128

100129
} // namespace rnexecutorch

0 commit comments

Comments
 (0)