Skip to content

Commit d2a0b36

Browse files
New Audio Pipelines, Improved binaries download workflow (#61)
2 parents 5183a00 + 26959ca commit d2a0b36

File tree

133 files changed

+8758
-1755
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+8758
-1755
lines changed

.github/workflows/release.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Build and Release Libraries
2+
3+
permissions:
4+
contents: write
5+
packages: read
6+
7+
on:
8+
release:
9+
types:
10+
- published
11+
12+
workflow_dispatch:
13+
inputs:
14+
tag:
15+
description: 'Release Tag'
16+
required: true
17+
18+
19+
jobs:
20+
add-libs:
21+
runs-on: ubuntu-latest
22+
23+
steps:
24+
- name: Log in to GHCR
25+
uses: docker/login-action@v3
26+
with:
27+
registry: ghcr.io
28+
username: ${{ github.actor }}
29+
password: ${{ secrets.GITHUB_TOKEN }}
30+
31+
- name: Build Libraries
32+
run: |
33+
TAG=${{ startsWith(github.ref, 'refs/tags/') && github.ref_name || github.event.inputs.tag }}
34+
docker run --rm -v ./libs:/libs -e TAG=$TAG ghcr.io/codewithkyrian/transformers-php:latest
35+
ls libs
36+
37+
- name: Add Libraries to Release
38+
uses: softprops/action-gh-release@v2
39+
with:
40+
files: |
41+
libs/*

.gitignore

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,22 @@
1-
/.phpunit.cache
2-
/.php-cs-fixer.cache
3-
/.php-cs-fixer.php
4-
/composer.lock
1+
.phpunit.cache
2+
.phpunit.result.cache
3+
.php-cs-fixer.cache
4+
.php-cs-fixer.php
5+
6+
composer.lock
57
/vendor/
8+
9+
.DS_Store
10+
Thumbs.db
11+
612
*.swp
713
*.swo
814
playground/*
15+
916
.idea
17+
.fleet
18+
.vscode
19+
1020
.transformers-cache/*
1121
tests/models/*
12-
dist
22+
dist

VERSION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.4.4

composer.json

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,22 @@
1616
"php": "^8.1",
1717
"ext-ffi": "*",
1818
"codewithkyrian/jinja-php": "^1.0",
19-
"codewithkyrian/transformers-libsloader": "^1.0",
19+
"codewithkyrian/transformers-libsloader": "^2.0",
2020
"imagine/imagine": "^1.3",
21-
"rokka/imagine-vips": "^0.31.0",
2221
"rindow/rindow-math-matrix": "^2.0",
2322
"rindow/rindow-matlib-ffi": "^1.0",
2423
"rindow/rindow-openblas-ffi": "^1.0",
2524
"symfony/console": "^6.4|^7.0"
2625
},
2726
"require-dev": {
2827
"pestphp/pest": "^2.31",
29-
"symfony/var-dumper": "^7.0"
28+
"symfony/var-dumper": "^7.0",
29+
"rokka/imagine-vips": "^0.31.0"
30+
},
31+
"suggest": {
32+
"ext-imagick": "Required to use the Imagick Driver for image processing",
33+
"ext-gd": "Required to use the GD Driver for image processing",
34+
"rokka/imagine-vips": "Required to use the VIPS Driver for image processing"
3035
},
3136
"license": "Apache-2.0",
3237
"autoload": {

docs/.vitepress/config.mts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,14 @@ export default defineConfig({
6969
{text: 'Image To Text', link: '/image-to-text'},
7070
{text: 'Image To Image', link: '/image-to-image'},
7171
]
72+
},
73+
{
74+
text: 'Audio Tasks',
75+
collapsed: true,
76+
items: [
77+
{text: 'Audio Classification', link: '/audio-classification'},
78+
{text: 'Automatic Speech Recognition', link: '/automatic-speech-recognition'},
79+
]
7280
}
7381
]
7482
},

docs/audio-classification.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
outline: deep
3+
---
4+
5+
# Audio Classification <Badge type="tip" text="^0.5.0" />
6+
7+
Audio classification involves assigning a label or class to an audio input. It can be used to recognize commands,
8+
identify speakers, or detect emotions in speech. The model processes the audio and returns a classification label with a
9+
corresponding confidence score.
10+
11+
## Task ID
12+
13+
- `audio-classification`
14+
15+
## Default Model
16+
17+
- `Xenova/wav2vec2-base-superb-ks`
18+
19+
## Use Cases
20+
21+
Audio classification models have a wide range of applications, including:
22+
23+
- **Command Recognition:** Classifying utterances into a predefined set of commands, often done on-device for fast
24+
response times.
25+
- **Language Identification:** Detecting the language spoken in the audio.
26+
- **Emotion Recognition:** Analyzing speech to identify the emotion expressed by the speaker.
27+
- **Speaker Identification:** Determining the identity of the speaker from a set of known voices.
28+
29+
## Running an Inference Session
30+
31+
Here's how to perform audio classification using the pipeline:
32+
33+
```php
34+
use function Codewithkyrian\Transformers\Pipelines\pipeline;
35+
36+
$classifier = pipeline('audio-classification', 'Xenova/ast-finetuned-audioset-10-10-0.4593');
37+
38+
$audioUrl = __DIR__ . '/../sounds/cat_meow.wav';
39+
40+
$output = $classifier($audioUrl, topK: 4);
41+
```
42+
43+
::: details Click to view output
44+
45+
```php
46+
[
47+
['label' => 'Cat Meow', 'score' => 0.8456],
48+
['label' => 'Domestic Animal', 'score' => 0.1234],
49+
['label' => 'Pet', 'score' => 0.0987],
50+
['label' => 'Mammal', 'score' => 0.0567]
51+
]
52+
```
53+
54+
:::
55+
56+
## Pipeline Input Options
57+
58+
When running the `audio-classification` pipeline, you can use the following options:
59+
60+
- ### `inputs` *(string)*
61+
The audio file(s) to classify. It can be a local file path, a file resource, a URL to an audio file (local or remote),
62+
or an array of these inputs. It's the first argument, so there's no need to pass it as a named argument.
63+
64+
```php
65+
$output = $classifier('https://example.com/audio.wav');
66+
```
67+
68+
- ### `topK` *(int)*
69+
The number of top labels to return. The default is `1`.
70+
71+
```php
72+
$output = $classifier('https://example.com/audio.wav', topK: 4);
73+
```
74+
75+
::: details Click to view output
76+
77+
```php
78+
[
79+
['label' => 'Cat Meow', 'score' => 0.8456],
80+
['label' => 'Domestic Animal', 'score' => 0.1234],
81+
['label' => 'Pet', 'score' => 0.0987],
82+
['label' => 'Mammal', 'score' => 0.0567]
83+
]
84+
```
85+
86+
:::
87+
88+
## Pipeline Outputs
89+
90+
The output of the pipeline is an array containing the classification label and the confidence score. The confidence
91+
score is a value between 0 and 1, with 1 being the highest confidence.
92+
93+
Since the actual labels depend on the model, it's crucial to consult the model's documentation for the specific labels
94+
it uses. Here are examples demonstrating how outputs might differ:
95+
96+
For a single audio file:
97+
98+
```php
99+
['label' => 'Dog Barking', 'score' => 0.9321]
100+
```
101+
102+
For multiple audio files:
103+
104+
```php
105+
[
106+
['label' => 'Dog Barking', 'score' => 0.9321],
107+
['label' => 'Car Horn', 'score' => 0.8234],
108+
['label' => 'Siren', 'score' => 0.7123]
109+
]
110+
```
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
outline: deep
3+
---
4+
5+
# Automatic Speech Recognition <Badge type="tip" text="^0.5.0" />
6+
7+
Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing audio into text. It
8+
has various applications, such as voice user interfaces, caption generation, and virtual assistants.
9+
10+
## Task ID
11+
12+
- `automatic-speech-recognition`
13+
- `asr`
14+
15+
## Default Model
16+
17+
- `Xenova/whisper-tiny.en`
18+
19+
## Use Cases
20+
21+
Automatic Speech Recognition is widely used in several domains, including:
22+
23+
- **Caption Generation:** Automatically generates captions for live-streamed or recorded videos, enhancing accessibility
24+
and aiding in content interpretation for non-native language speakers.
25+
- **Virtual Speech Assistants:** Embedded in devices to recognize voice commands, facilitating tasks like dialing a
26+
phone number, answering general questions, or scheduling meetings.
27+
- **Multilingual ASR:** Converts audio inputs in multiple languages into transcripts, often with language identification
28+
for improved performance. Examples include models like Whisper.
29+
30+
## Running an Inference Session
31+
32+
Here's how to perform automatic speech recognition using the pipeline:
33+
34+
```php
35+
use function Codewithkyrian\Transformers\Pipelines\pipeline;
36+
37+
$transcriber = pipeline('automatic-speech-recognition', 'onnx-community/whisper-tiny.en');
38+
39+
$audioUrl = __DIR__ . '/preamble.wav';
40+
$output = $transcriber($audioUrl, maxNewTokens: 256);
41+
```
42+
43+
## Pipeline Input Options
44+
45+
When running the `automatic-speech-recognition` pipeline, you can use the following options:
46+
47+
- ### `inputs` *(string)*
48+
49+
The audio file to transcribe. It can be a local file path, a file resource, or a URL to an audio file (local or
50+
remote). It's the first argument, so there's no need to pass it as a named argument.
51+
52+
```php
53+
$output = $transcriber('https://example.com/audio.wav');
54+
```
55+
56+
- ### `returnTimestamps` *(bool|string)*
57+
58+
Determines whether to return timestamps with the transcribed text.
59+
- If set to `true`, the model will return the start and end timestamps for each chunk of text, with the chunks
60+
determined by the model itself.
61+
- If set to `'word'`, the model will return timestamps for individual words. Note that word-level timestamps require
62+
models exported with `output_attentions=True`.
63+
64+
- ### `chunkLengthSecs` *(int)*
65+
66+
The length of audio chunks to process in seconds. This is essential for models like Whisper that can only process a
67+
maximum of 30 seconds at a time. Setting this option will chunk the audio, process each chunk individually, and then
68+
merge the results into a single output.
69+
70+
- ### `strideLengthSecs` *(int)*
71+
72+
The length of overlap between consecutive audio chunks in seconds. If not provided, this defaults
73+
to `chunkLengthSecs / 6`. Overlapping ensures smoother transitions and more accurate transcriptions, especially for
74+
longer audio segments.
75+
76+
- ### `forceFullSequences` *(bool)*
77+
78+
Whether to force the output to be in full sequences. This is set to `false` by default.
79+
80+
- ### `language` *(string)*
81+
82+
The source language of the audio. By default, this is `null`, meaning the language will be auto-detected. Specifying
83+
the language can improve performance if the source language is known.
84+
85+
- ### `task` *(string)*
86+
87+
The specific task to perform. By default, this is `null`, meaning it will be auto-detected. Possible values
88+
are `'transcribe'` for transcription and `'translate'` for translating the audio content.
89+
90+
Please note that using the streamer option with this task is not yet supported.
91+
92+
## Pipeline Outputs
93+
94+
The output of the pipeline is an array containing the transcribed text and, optionally, the timestamps. The timestamps
95+
can be provided either at the chunk level or word level, depending on the `returnTimestamps` setting.
96+
97+
- **Default Output (without timestamps):**
98+
99+
```php
100+
[
101+
"text" => "We, the people of the United States, in order to form a more perfect union, establish justice, ensure domestic tranquility, provide for the common defense, promote the general welfare, and secure the blessings of liberty to ourselves and our posterity, to ordain and establish this constitution for the United States of America."
102+
]
103+
```
104+
105+
- **Output with Chunk-Level Timestamps:**
106+
107+
```php
108+
[
109+
"text" => "We, the people of the United States, in order to form a more perfect union...",
110+
"chunks" => [
111+
[
112+
"timestamp" => [0.0, 5.12],
113+
"text" => "We, the people of the United States, in order to form a more perfect union, establish"
114+
],
115+
[
116+
"timestamp" => [5.12, 10.4],
117+
"text" => " justice, ensure domestic tranquility, provide for the common defense, promote the general"
118+
],
119+
[
120+
"timestamp" => [10.4, 15.2],
121+
"text" => " welfare, and secure the blessings of liberty to ourselves and our posterity, to ordain"
122+
],
123+
...
124+
]
125+
]
126+
```
127+
128+
- **Output with Word-Level Timestamps:**
129+
130+
```php
131+
[
132+
"text" => "...",
133+
"chunks" => [
134+
["text" => "We,", "timestamp" => [0.6, 0.94]],
135+
["text" => "the", "timestamp" => [0.94, 1.3]],
136+
["text" => "people", "timestamp" => [1.3, 1.52]],
137+
["text" => "of", "timestamp" => [1.52, 1.62]],
138+
["text" => "the", "timestamp" => [1.62, 1.82]],
139+
["text" => "United", "timestamp" => [1.82, 2.52]],
140+
["text" => "States", "timestamp" => [2.52, 2.72]],
141+
["text" => "in", "timestamp" => [2.72, 2.88]],
142+
["text" => "order", "timestamp" => [2.88, 3.1]],
143+
...
144+
]
145+
]
146+
```

0 commit comments

Comments
 (0)