Skip to content

Commit ed2493b

Browse files
committed
Improve README and mention whisper
1 parent 9644f7f commit ed2493b

File tree

3 files changed

+110
-30
lines changed

3 files changed

+110
-30
lines changed

README.md

Lines changed: 49 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,22 @@ Transform slides and speaker notes into video.
44

55
[![Demo video](https://transformrs.github.io/trv/demo.png)](https://transformrs.github.io/trv/demo.mp4)
66

7+
## Features
8+
9+
- 🔒 Fully offline generation of audio via the Kokoro text-to-speech model.
10+
- 🛠️ Version control friendly - store your video source in git.
11+
- 🚀 Caching of audio files to avoid redundant API calls.
12+
- 🚀 Caching of video files for quick re-builds.
13+
- 🚀 A development mode with a built-in web server for fast feedback.
14+
- 🌐 Support for multiple languages and voices.
15+
- 🚀 Small file sizes for easy sharing and hosting.
16+
17+
## Installation
18+
19+
```raw
20+
$ cargo install trv
21+
```
22+
723
## Usage
824

925
This tool is designed to work with [Typst](https://github.com/typst/typst) presentations.
@@ -29,7 +45,15 @@ To create a video, create a Typst presentation with speaker notes (we show only
2945
]
3046
```
3147

32-
Next, run the following command:
48+
Next, we can work on the video with the following command:
49+
50+
```raw
51+
$ trv watch examples/first.typ
52+
```
53+
54+
This will start a local web server that will automatically update the video as you make changes to the presentation.
55+
56+
Once everything looks good, we can build the final video with the following command:
3357

3458
```raw
3559
$ trv build examples/first.typ
@@ -63,10 +87,10 @@ $ trv --input=presentation.typ
6387

6488

6589
To create a video without an API key nor an internet connection, you can self-host [Kokoros](https://github.com/lucasjinreal/Kokoros).
66-
See the [Offline section](#offline) for more information.
90+
See the [Kokoros section](#kokoros) for more information.
6791
Or for a state-of-the-art model with voice cloning capabilities, see the [Zyphra Zonos section](#zyphra-zonos).
6892

69-
## Offline
93+
## Kokoros
7094

7195
To use Kokoros locally, the easiest way is to use the Docker image.
7296

@@ -82,13 +106,26 @@ $ docker run -it --rm -p 3000:3000 kokoros openai
82106

83107
Then, you can use the Docker image as the provider:
84108

109+
```typ
110+
#import "@preview/polylux:0.4.0": *
111+
112+
// --- trv config:
113+
// provider = "openai-compatible(localhost:3000)"
114+
// model = "tts-1"
115+
// voice = "af_sky"
116+
// audio_format = "wav"
117+
// ---
118+
119+
...
120+
```
121+
85122
```raw
86-
$ trv --input=presentation.typ --provider=openai-compatible(localhost:3000)
123+
$ trv build presentation.typ
87124
```
88125

89-
## Via Google
126+
## Google
90127

91-
Google has some high-quality voices available via their API:
128+
My favourite text-to-speech engine is the one from Google.
92129

93130
```raw
94131
$ export GOOGLE_KEY="<YOUR KEY>"
@@ -98,11 +135,6 @@ $ trv build examples/google.typ
98135

99136
[![Google demo video](https://transformrs.github.io/trv/google.png)](https://transformrs.github.io/trv/google.mp4)
100137

101-
See the [Google section](#google) for more information about the Google API.
102-
103-
Google, meanwhile, has the best text-to-speech engine that I've found as part of Gemini 2.0 Flash Experimental.
104-
However, audio output is not yet available via the API.
105-
106138
## Zyphra Zonos
107139

108140
To use the Zyphra Zonos model, you need 8 GB of VRAM.
@@ -132,7 +164,7 @@ So in practice, the Kokoro model is probably the better option for now.
132164

133165
To create a portait video, like a YouTube Short, you can set the page to
134166

135-
```typst
167+
```typ
136168
#set page(width: 259.2pt, height: 460.8pt)
137169
```
138170

@@ -141,24 +173,12 @@ This will automatically create slides with 1080 x 1920 resolution since Typst is
141173
Next, ffmpeg will automatically scale the video to a height of 1920p so in this case the height will not be changed.
142174
For landscape videos, it might scale the image down to 1920p.
143175

144-
## About Audio
145-
146-
Audio is generated using the [transformrs](https://github.com/transformrs/transformrs) crate.
147-
It supports multiple providers, including DeepInfra, OpenAI, and Google.
148-
149-
So `trv` should also work with providers other than DeepInfra.
150-
However, during testing, I got the best results with Kokoros or DeepInfra for the lowest price.
176+
## Subtitles
151177

152-
For example, OpenAI text-to-speech requires any video to contain a "clear disclosure" that the voice they are hearing is AI-generated.
178+
To add subtitles to the video, you can use OpenAI's [`whisper`](https://github.com/openai/whisper):
153179

154-
## Installation
155-
156-
```sh
157-
cargo install trv
180+
```raw
181+
$ whisper _out/out.mp4 -f srt --model small --language=en
158182
```
159183

160-
Or with [`cargo binstall`](https://github.com/cargo-bins/cargo-binstall):
161-
162-
```sh
163-
cargo binstall trv
164-
```
184+
This will create a `out.srt` file with the subtitles.

examples/google.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@
44

55
export GOOGLE_KEY=$(cat keys.env | grep GOOGLE_KEY | cut -d '=' -f 2)
66

7-
trv build examples/google.typ
7+
trv build examples/google.typ --audio-codec aac_at

out.srt

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
1
2+
00:00:00,000 --> 00:00:03,280
3+
This video was created with the Google Text to Speech API.
4+
5+
2
6+
00:00:03,280 --> 00:00:06,600
7+
As an example, we can explain the following math problem.
8+
9+
3
10+
00:00:06,600 --> 00:00:08,680
11+
Two plus two equals two X.
12+
13+
4
14+
00:00:08,680 --> 00:00:10,280
15+
What is X in this equation?
16+
17+
5
18+
00:00:10,280 --> 00:00:13,000
19+
To solve it, we can move the two X to the left.
20+
21+
6
22+
00:00:13,000 --> 00:00:15,560
23+
Or in other words, we put everything on the left side
24+
25+
7
26+
00:00:15,560 --> 00:00:17,240
27+
of the equation on the right side,
28+
29+
8
30+
00:00:17,240 --> 00:00:18,320
31+
and everything on the right side
32+
33+
9
34+
00:00:18,320 --> 00:00:19,620
35+
of the equation on the left side.
36+
37+
10
38+
00:00:19,620 --> 00:00:22,220
39+
Now we move the two to the right.
40+
41+
11
42+
00:00:22,220 --> 00:00:24,080
43+
This can be done by dividing both sides
44+
45+
12
46+
00:00:24,080 --> 00:00:26,200
47+
of the equation by two.
48+
49+
13
50+
00:00:26,200 --> 00:00:30,160
51+
Now we have X is two plus two divided by two.
52+
53+
14
54+
00:00:30,160 --> 00:00:33,800
55+
This gives us X is four divided by two.
56+
57+
15
58+
00:00:33,800 --> 00:00:35,120
59+
So the answer is two.
60+

0 commit comments

Comments
 (0)