-
Notifications
You must be signed in to change notification settings - Fork 56
feat: text to speech (#546) #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please apply changes to all applicable places, I didn't repeated these comments. But overall, it looks really cool, these suggestions are just very small nits ;)
packages/react-native-executorch/common/rnexecutorch/data_processing/Sequential.h
Outdated
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/data_processing/Sequential.h
Outdated
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Decoder.cpp
Outdated
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Decoder.h
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Constants.h
Show resolved
Hide resolved
...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp
Show resolved
Hide resolved
...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp
Outdated
Show resolved
Hide resolved
...ges/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Partitioner.cpp
Outdated
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Utils.cpp
Show resolved
Hide resolved
packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Utils.cpp
Outdated
Show resolved
Hide resolved
571fada to
9eb0629
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits for the moment, I feel like we need to add docs to have a solid understanding of the public API layer. However, this looks like really a lot of solid work, congrats!
Haven't finished reviewing tho :D
| const [readyToGenerate, setReadyToGenerate] = useState(false); | ||
|
|
||
| const audioContextRef = useRef<AudioContext | null>(null); | ||
| const sourceRef = useRef<any>(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we type this?
| iosOptions: ['defaultToSpeaker'], | ||
| }); | ||
|
|
||
| // Initialize context once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Initialize context once |
|
|
||
| const onEnd = async () => { | ||
| setIsPlaying(false); | ||
| setReadyToGenerate(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt the useEffect above handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this deleted?
| export enum TextToSpeechLanguage { | ||
| EN_US = 0, | ||
| EN_GB = 1, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S2T shared this as string literal union and I recommend using it instead, so we have a unified approach:
export type SpeechToTextLanguage =
| 'af'
| 'sq'
| 'ar'
// this continues| // Voice configuration | ||
| // So far in Kokoro, each voice is directly associated with a language. | ||
| // The 'data' field corresponds to (usually) binary file with voice tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use JSDoc for this kind of comments?
For example:
**
* Voice configuration
* * So far in Kokoro, each voice is directly associated with a language.
* The 'data' field corresponds to (usually) binary file with voice tensor.
*/
Overall I think that this is a good approach for everything we share to the public API as it is easy for the user to figure out what the thing is doing without the docs.
| export const URL_PREFIX = | ||
| 'https://huggingface.co/software-mansion/react-native-executorch'; | ||
| const VERSION_TAG = 'resolve/v0.6.0'; | ||
| // const NEXT_VERSION_TAG = 'resolve/v0.7.0'; | ||
| export const VERSION_TAG = 'resolve/v0.6.0'; | ||
| export const NEXT_VERSION_TAG = 'resolve/v0.7.0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be exported? This makes it possible for the users to import this, which is unnecessary
Removed unnecessary Log.h include from RnExecutorchInstaller.h
Removed unnecessary iostream include from BaseModel.cpp
9eb0629 to
2b633eb
Compare
| export interface VoiceConfig { | ||
| language: TextToSpeechLanguage; | ||
| data: ResourceSource; | ||
| extra?: Record<string, unknown>; | ||
| } | ||
|
|
||
| // Individual model configurations | ||
| // - Kokoro Configuration (including Phonemis tagger resource) | ||
| export interface KokoroConfig { | ||
| durationPredictorSource: ResourceSource; | ||
| f0nPredictorSource: ResourceSource; | ||
| textEncoderSource: ResourceSource; | ||
| textDecoderSource: ResourceSource; | ||
| } | ||
|
|
||
| // Model + voice configurations | ||
| export interface TextToSpeechConfig { | ||
| model: KokoroConfig; // ... add other model types in the future | ||
| voice?: VoiceConfig; | ||
| options?: any; // A completely optional model-specific configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to find a way to not use any or unknown here? See the comment in TextToSpeechModule.ts
| const uri = (config.model as any)[anySourceKey]; | ||
| if (uri.includes('kokoro')) { | ||
| await this.loadKokoro( | ||
| config.model, | ||
| config.voice!, | ||
| onDownloadProgressCallback, | ||
| config.options | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't do it this way. What if someone changes the name of this file, or uses require()?
I would recommend doing something like this:
// 1. Define specific option types first for reusability
interface KokoroOptions {
temperature: number;
contextSize: number;
}
interface OtherOptions {
language: string;
beamSize: number;
}
// 2. Define the Configs with a strict 'type' discriminator
interface KokoroConfig {
type: 'kokoro';
modelPath: string;
options: KokoroOptions;
}
interface OtherModelConfig {
type: 'other';
modelPath: string;
options: OtherOptions;
}
type ModelConfig = KokoroConfig | OtherModelConfig;
class GenericModel {
public async load(config: ModelConfig): Promise<void> {
switch (config.type) {
case 'kokoro':
// TypeScript implies config is KokoroConfig here
await this.loadKokoro(config.modelPath, config.options);
break;
case 'other':
// TypeScript implies config is OtherModelConfig here
await this.loadOther(config.modelPath, config.options);
break;
}
}
private async loadKokoro(path: string, options: KokoroOptions) {
console.log(`Loading Kokoro with temp: ${options.temperature}`);
// implementation...
}
private async loadOther(path: string, options: OtherOptions) {
console.log(`Loading Other with lang: ${options.language}`);
// implementation...
}
}| if (!voice.extra || !voice.extra.tagger || !voice.extra.lexicon) { | ||
| throw new Error( | ||
| 'Kokoro: voice config is missing required extra fields: tagger and/or lexicon.' | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| voice.extra!.tagger, | ||
| voice.extra!.lexicon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use non-null assertion operators
| return await this.nativeModule.generate(text, speed); | ||
| } | ||
|
|
||
| public async stream({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could make this a generator? See the STT implementation
| } | ||
|
|
||
| int32_t nTokens = shape[0]; | ||
| int64_t *durationsPtr = durations.data_ptr<int64_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use const_data_ptr() or mutable_data_ptr()
| explicit DurationPredictor(const std::string &modelSource, | ||
| std::shared_ptr<react::CallInvoker> callInvoker); | ||
|
|
||
| // Returns a tuple (d, indices, effectiveDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if there was some explaination what exactly the returned stuff is
| explicit Encoder(const std::string &modelSource, | ||
| std::shared_ptr<react::CallInvoker> callInvoker); | ||
|
|
||
| Result<std::vector<EValue>> generate(const std::string &method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, a comment would be nice
| class Encoder : public BaseModel { | ||
| public: | ||
| explicit Encoder(const std::string &modelSource, | ||
| std::shared_ptr<react::CallInvoker> callInvoker); | ||
|
|
||
| Result<std::vector<EValue>> generate(const std::string &method, | ||
| const Configuration &inputConfig, | ||
| std::span<Token> tokens, | ||
| std::span<bool> textMask, | ||
| std::span<float> pred_aln_trg); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, high-level comments would be nice
| auto croppedAudio = | ||
| utils::stripAudio(audio, constants::kSamplesPerMilisecond * 50); | ||
|
|
||
| std::vector<float> result(croppedAudio.begin(), croppedAudio.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can return span instead of vector to avoid copying?

Description
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Screenshots
Related issues
Checklist
Additional notes
There are still a few side things to be done on this feature.