feat: text to speech (#546) #710

IgorSwat · 2026-01-09T08:46:21Z

Description

Introduces a breaking change?

Yes
No

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

Screenshots

Related issues

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

There are still a few side things to be done on this feature.

msluszniak

Please apply changes to all applicable places, I didn't repeated these comments. But overall, it looks really cool, these suggestions are just very small nits ;)

packages/react-native-executorch/common/rnexecutorch/data_processing/Sequential.h

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Decoder.cpp

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Decoder.h

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Constants.h

...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp

...ges/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Partitioner.cpp

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Utils.cpp

msluszniak · 2026-01-09T14:10:03Z

Since you added demo app in this PR, you should also update README.md in the following section:

chmjkb

Left some nits for the moment, I feel like we need to add docs to have a solid understanding of the public API layer. However, this looks like really a lot of solid work, congrats!

Haven't finished reviewing tho :D

chmjkb · 2026-01-09T14:30:32Z

apps/speech/screens/TextToSpeechScreen.tsx

+  const [readyToGenerate, setReadyToGenerate] = useState(false);
+
+  const audioContextRef = useRef<AudioContext | null>(null);
+  const sourceRef = useRef<any>(null);


Can we type this?

chmjkb · 2026-01-09T14:30:49Z

apps/speech/screens/TextToSpeechScreen.tsx

+      iosOptions: ['defaultToSpeaker'],
+    });
+
+    // Initialize context once


Suggested change

// Initialize context once

chmjkb · 2026-01-09T14:33:29Z

apps/speech/screens/TextToSpeechScreen.tsx

+
+      const onEnd = async () => {
+        setIsPlaying(false);
+        setReadyToGenerate(true);


doesnt the useEffect above handle this?

chmjkb · 2026-01-09T14:42:00Z

packages/react-native-executorch/scripts/create-package.sh

Why is this deleted?

chmjkb · 2026-01-09T14:45:44Z

packages/react-native-executorch/src/types/tts.ts

+export enum TextToSpeechLanguage {
+  EN_US = 0,
+  EN_GB = 1,
+}


S2T shared this as string literal union and I recommend using it instead, so we have a unified approach:

export type SpeechToTextLanguage = | 'af' | 'sq' | 'ar' // this continues

chmjkb · 2026-01-09T14:48:00Z

packages/react-native-executorch/src/types/tts.ts

+// Voice configuration
+// So far in Kokoro, each voice is directly associated with a language.
+// The 'data' field corresponds to (usually) binary file with voice tensor.


Can we use JSDoc for this kind of comments?
For example:

** * Voice configuration * * So far in Kokoro, each voice is directly associated with a language. * The 'data' field corresponds to (usually) binary file with voice tensor. */

Overall I think that this is a good approach for everything we share to the public API as it is easy for the user to figure out what the thing is doing without the docs.

chmjkb · 2026-01-09T14:50:18Z

packages/react-native-executorch/src/constants/modelUrls.ts

+export const URL_PREFIX =
  'https://huggingface.co/software-mansion/react-native-executorch';
-const VERSION_TAG = 'resolve/v0.6.0';
-// const NEXT_VERSION_TAG = 'resolve/v0.7.0';
+export const VERSION_TAG = 'resolve/v0.6.0';
+export const NEXT_VERSION_TAG = 'resolve/v0.7.0';


Does this need to be exported? This makes it possible for the users to import this, which is unnecessary

Removed unnecessary Log.h include from RnExecutorchInstaller.h

Removed unnecessary iostream include from BaseModel.cpp

chmjkb · 2026-01-12T08:40:41Z

packages/react-native-executorch/src/types/tts.ts

+export interface VoiceConfig {
+  language: TextToSpeechLanguage;
+  data: ResourceSource;
+  extra?: Record<string, unknown>;
+}
+
+// Individual model configurations
+// - Kokoro Configuration (including Phonemis tagger resource)
+export interface KokoroConfig {
+  durationPredictorSource: ResourceSource;
+  f0nPredictorSource: ResourceSource;
+  textEncoderSource: ResourceSource;
+  textDecoderSource: ResourceSource;
+}
+
+// Model + voice configurations
+export interface TextToSpeechConfig {
+  model: KokoroConfig; // ... add other model types in the future
+  voice?: VoiceConfig;
+  options?: any; // A completely optional model-specific configuration


Would it be possible to find a way to not use any or unknown here? See the comment in TextToSpeechModule.ts

chmjkb · 2026-01-12T08:58:24Z

packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts

+    const uri = (config.model as any)[anySourceKey];
+    if (uri.includes('kokoro')) {
+      await this.loadKokoro(
+        config.model,
+        config.voice!,
+        onDownloadProgressCallback,
+        config.options
+      );
+    }


We shouldn't do it this way. What if someone changes the name of this file, or uses require()?

I would recommend doing something like this:

// 1. Define specific option types first for reusability interface KokoroOptions { temperature: number; contextSize: number; } interface OtherOptions { language: string; beamSize: number; } // 2. Define the Configs with a strict 'type' discriminator interface KokoroConfig { type: 'kokoro'; modelPath: string; options: KokoroOptions; } interface OtherModelConfig { type: 'other'; modelPath: string; options: OtherOptions; } type ModelConfig = KokoroConfig | OtherModelConfig; class GenericModel { public async load(config: ModelConfig): Promise<void> { switch (config.type) { case 'kokoro': // TypeScript implies config is KokoroConfig here await this.loadKokoro(config.modelPath, config.options); break; case 'other': // TypeScript implies config is OtherModelConfig here await this.loadOther(config.modelPath, config.options); break; } } private async loadKokoro(path: string, options: KokoroOptions) { console.log(`Loading Kokoro with temp: ${options.temperature}`); // implementation... } private async loadOther(path: string, options: OtherOptions) { console.log(`Loading Other with lang: ${options.language}`); // implementation... } }

chmjkb · 2026-01-12T08:59:07Z

packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts

+    if (!voice.extra || !voice.extra.tagger || !voice.extra.lexicon) {
+      throw new Error(
+        'Kokoro: voice config is missing required extra fields: tagger and/or lexicon.'
+      );
+    }


https://github.com/software-mansion/react-native-executorch/pull/710/changes#r2681356511

chmjkb · 2026-01-12T08:59:20Z

packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts

+      voice.extra!.tagger,
+      voice.extra!.lexicon


Let's not use non-null assertion operators

chmjkb · 2026-01-12T09:04:19Z

packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts

+    return await this.nativeModule.generate(text, speed);
+  }
+
+  public async stream({


Maybe we could make this a generator? See the STT implementation

chmjkb · 2026-01-12T11:24:06Z

...act-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp

+  }
+
+  int32_t nTokens = shape[0];
+  int64_t *durationsPtr = durations.data_ptr<int64_t>();


use const_data_ptr() or mutable_data_ptr()

chmjkb · 2026-01-12T11:27:24Z

...react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.h

+  explicit DurationPredictor(const std::string &modelSource,
+                             std::shared_ptr<react::CallInvoker> callInvoker);
+
+  // Returns a tuple (d, indices, effectiveDuration)


I think it would be nice if there was some explaination what exactly the returned stuff is

chmjkb · 2026-01-12T11:28:23Z

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Encoder.h

+  explicit Encoder(const std::string &modelSource,
+                   std::shared_ptr<react::CallInvoker> callInvoker);
+
+  Result<std::vector<EValue>> generate(const std::string &method,


same here, a comment would be nice

chmjkb · 2026-01-12T11:34:35Z

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Encoder.h

+class Encoder : public BaseModel {
+public:
+  explicit Encoder(const std::string &modelSource,
+                   std::shared_ptr<react::CallInvoker> callInvoker);
+
+  Result<std::vector<EValue>> generate(const std::string &method,
+                                       const Configuration &inputConfig,
+                                       std::span<Token> tokens,
+                                       std::span<bool> textMask,
+                                       std::span<float> pred_aln_trg);
+};


Same here, high-level comments would be nice

chmjkb · 2026-01-12T11:38:22Z

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

+  auto croppedAudio =
+      utils::stripAudio(audio, constants::kSamplesPerMilisecond * 50);
+
+  std::vector<float> result(croppedAudio.begin(), croppedAudio.end());


Maybe we can return span instead of vector to avoid copying?

IgorSwat and others added 27 commits January 9, 2026 09:27

implement Kokoro components

15728a7

restructurize kokoro submodules

13da30f

main model logic

3a4a787

more progress...

9278ae0

kokoro main inference implemented

427fca8

text to speech prototype

d30a309

chore: remove create-package.sh

6799333

various fixes & improvements

66730cb

Temporary testing screen

2f233a9

reorganize DurationPredictor data flow

ca8a1b2

text-to-speech mvp

a31a20c

add ios support

98dfda9

implement fallback phonemization (US)

b3a38b4

fix 'ed' phonemization bug

3f706d7

add british support

dc52f73

add cropping audio vector

1a98997

small refactor

0415505

demo app finished

d78c936

update model input variants

d65e87a

implement input partitioning

a38a1c1

native side streaming

6cfad82

audio streaming fixed

bb27b40

enable additional model options

8299362

fix reload bug

e4025b4

update phonemis binaries

18c5da3

reduce phonemis android binaries size & other small fixes

050cbc7

implement a demo quiz app

f9ff668

IgorSwat requested review from benITo47, chmjkb and msluszniak January 9, 2026 08:46

IgorSwat requested a review from mkopcins January 9, 2026 08:46

msluszniak reviewed Jan 9, 2026

View reviewed changes

IgorSwat force-pushed the @is/text-to-speech branch from 571fada to 9eb0629 Compare January 9, 2026 13:58

chmjkb requested changes Jan 9, 2026

View reviewed changes

chmjkb and others added 6 commits January 12, 2026 10:55

fix: revert legacy imports

e1f4d0b

chore: remove create-package.sh

bd9d83f

small code refactor

79eea5f

Remove Log.h include from RnExecutorchInstaller.h

ca76a5c

Removed unnecessary Log.h include from RnExecutorchInstaller.h

Remove unused iostream include

8f594d1

Removed unnecessary iostream include from BaseModel.cpp

remove doubled commits

2b633eb

IgorSwat force-pushed the @is/text-to-speech branch from 9eb0629 to 2b633eb Compare January 12, 2026 10:01

chmjkb requested changes Jan 12, 2026

View reviewed changes

feat: text to speech (#546) #710

Are you sure you want to change the base?

feat: text to speech (#546) #710

Conversation

IgorSwat commented Jan 9, 2026

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

msluszniak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msluszniak commented Jan 9, 2026

Uh oh!

chmjkb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msluszniak left a comment •

edited

Loading

chmjkb left a comment •

edited

Loading