LiteRTLM Swift SDK

Unofficial Swift SDK for Google's LiteRT-LM on-device inference engine. Not affiliated with or endorsed by Google.

Run Gemma 4 and other LLMs entirely on-device — text, vision, audio, and tool calling with zero cloud dependency.

Screenshots

Model Setup

Text Chat

Vision Response

Audio Response

graph TB
    subgraph App["Your Application"]
        UI[SwiftUI / UIKit]
    end

    subgraph SDK["LiteRTLM Swift SDK"]
        direction TB
        DL[LiteRTLMDownloader<br><i>Download / Pause / Resume</i>]
        
        subgraph Core["LiteRTLM"]
            Engine["LMEngine (Actor)<br><i>Model Lifecycle</i>"]
            Session["LMSession<br><i>Text Streaming</i>"]
            Conv["LMConversation<br><i>Multi-turn + Multimodal</i>"]
            Tools["Tool Calling<br><i>OpenAPI Schema</i>"]
        end
    end

    subgraph Runtime["CLiteRTLM.xcframework"]
        CAPI["C API Bridge"]
        LiteRT["Google LiteRT-LM<br><i>On-device Inference</i>"]
    end

    subgraph HW["Hardware"]
        CPU[CPU]
        GPU["GPU (Metal)"]
    end

    UI --> DL
    UI --> Engine
    Engine --> Session
    Engine --> Conv
    Conv --> Tools
    Session --> CAPI
    Conv --> CAPI
    CAPI --> LiteRT
    LiteRT --> CPU
    LiteRT --> GPU

    style App fill:#e8f5e9,stroke:#2e7d32
    style SDK fill:#e3f2fd,stroke:#1565c0
    style Runtime fill:#fff3e0,stroke:#e65100
    style HW fill:#f3e5f5,stroke:#6a1b9a

Features

Feature	Description
Text Generation	Streaming and blocking generation with Gemma 4 prompt templates
Vision	Send images (JPEG/PNG) alongside text for multimodal understanding
Audio	Process raw WAV audio input — passed directly to the model
Multi-turn Conversations	Automatic KV-cache and history management
Tool Calling	Register functions the model can invoke (automatic or manual execution)
GPU Inference	Metal accelerator bundled for GPU-accelerated text generation
Model Management	Download, pause/resume, cancel `.litertlm` model files with progress
Streaming	`AsyncSequence`-based token streaming with clean text output
Benchmark Metrics	Prefill/decode speed, time-to-first-token, per-turn breakdowns
Actor Concurrency	Thread-safe engine access via Swift actors

Installation

Swift Package Manager

Add to your Package.swift:

dependencies: [
    .package(
        url: "https://github.com/Luxshan2000/LiteRTLM-Swift-SDK.git",
        from: "0.1.0"
    )
]

Add products to your target:

.target(
    name: "YourApp",
    dependencies: [
        .product(name: "LiteRTLM", package: "LiteRTLM-Swift-SDK"),
        .product(name: "LiteRTLMDownloader", package: "LiteRTLM-Swift-SDK"),
    ]
)

Or in Xcode: File > Add Package Dependencies > paste the repository URL.

Required Setup

Entitlements (iOS) — models consume ~4 GB RAM:

<key>com.apple.developer.kernel.increased-memory-limit</key>
<true/>
<key>com.apple.developer.kernel.extended-virtual-addressing</key>
<true/>

Codesign — the xcframework includes companion dylibs that Xcode may not auto-sign. Add a Run Script build phase:

for DYLIB in \
  "${BUILT_PRODUCTS_DIR}/${FRAMEWORKS_FOLDER_PATH}/CLiteRTLM.framework/libGemmaModelConstraintProvider.dylib" \
  "${BUILT_PRODUCTS_DIR}/${FRAMEWORKS_FOLDER_PATH}/CLiteRTLM.framework/libLiteRtMetalAccelerator.dylib"; do
  if [ -f "$DYLIB" ]; then
    /usr/bin/codesign --force --sign "${EXPANDED_CODE_SIGN_IDENTITY}" "$DYLIB"
  fi
done

Quick Start

import LiteRTLM
import LiteRTLMDownloader

// 1. Download the model (~2.4 GB, cached after first download)
let downloader = ModelDownloader()
await downloader.download(model: .gemma4E2B)

// 2. Create and load the engine
let config = EngineConfiguration(modelPath: downloader.modelPath(for: .gemma4E2B)!)
    .backend(.cpu)           // or .gpu for Metal acceleration
    .visionBackend(.cpu)
    .audioBackend(.cpu)
    .maxTokens(4096)
let engine = LMEngine(configuration: config)
try await engine.load()

// 3. Chat
let conversation = try await engine.createConversation()
let reply = try await conversation.send("Hello! What can you do?")
print(reply)

// 4. Cleanup
conversation.close()
await engine.unload()

Usage

Text Streaming

let conversation = try await engine.createConversation(
    configuration: ConversationConfiguration()
        .maxOutputTokens(1024)
        .sampler(SamplerConfiguration(temperature: 0.7, topK: 40, topP: 0.95, samplerType: .topP))
)

// Stream tokens as they generate
let stream = try conversation.sendStream("Explain quantum computing simply")
for try await token in stream {
    print(token, terminator: "")
}

// Or blocking
let response = try await conversation.send("What is Swift?")

Vision (Image Input)

let photoData = UIImage(named: "cat")!.jpegData(compressionQuality: 0.8)!
let description = try await conversation.send(
    "What's in this image?",
    images: [photoData]
)

Images are automatically saved to temp files and passed as file paths to the C API. Resized to fit max dimension (default 1024px).

Audio Input

let audioData = try Data(contentsOf: recordingURL) // 16kHz mono WAV
let response = try await conversation.send(
    "What did they say?",
    audio: [audioData]
)

Multi-turn Conversation

Conversations maintain KV-cache across turns. First turn takes ~15-30s, follow-ups are much faster.

let reply1 = try await conversation.send("Tell me about Tokyo")
let reply2 = try await conversation.send("What about the food scene?")
let reply3 = try await conversation.send("Give me a 3-day itinerary")

// History tracked automatically
print(conversation.history.count) // 6 messages

Tool Calling

let weatherTool = Tool(
    name: "get_weather",
    description: "Get current weather for a location",
    parameters: [
        .init(name: "city", type: .string, description: "City name", required: true),
    ]
) { args in
    let city = args["city"] as? String ?? "unknown"
    return ["temperature": 22, "condition": "sunny", "city": city]
}

let conversation = try await engine.createConversation(
    configuration: ConversationConfiguration()
        .tools([weatherTool])
        .toolExecution(.automatic) // SDK calls tool and feeds result back
)

let response = try await conversation.send("What's the weather in Tokyo?")
// → "It's currently 22°C and sunny in Tokyo!"

CPU vs GPU

// CPU — works on all devices
let config = EngineConfiguration(modelPath: modelURL).backend(.cpu)

// GPU — Metal acceleration for faster text inference
// Vision and audio backends must stay on CPU (model constraint)
let config = EngineConfiguration(modelPath: modelURL)
    .backend(.gpu)
    .visionBackend(.cpu)
    .audioBackend(.cpu)

Benchmark Metrics

let config = EngineConfiguration(modelPath: modelURL)
    .benchmarkEnabled(true)

if let metrics = conversation.benchmarkInfo() {
    print("Time to first token: \(metrics.timeToFirstToken)s")
    print("Decode speed: \(metrics.averageDecodeSpeed) tok/s")
}

Configuration Reference

EngineConfiguration

Method	Default	Description
`.backend(.cpu/.gpu)`	`.cpu`	Primary inference backend
`.visionBackend(.cpu)`	`nil`	Vision encoder backend
`.audioBackend(.cpu)`	`nil`	Audio encoder backend
`.maxTokens(4096)`	`nil`	Max context length
`.cacheDirectory(url)`	`nil`	Compiled model cache
`.benchmarkEnabled(true)`	`false`	Enable timing metrics
`.logLevel(.warning)`	`.warning`	Log verbosity

ConversationConfiguration

Method	Default	Description
`.maxOutputTokens(1024)`	`1024`	Max tokens per response
`.sampler(...)`	`.balanced`	Sampling strategy
`.tools([...])`	`[]`	Registered tools
`.toolExecution(.automatic)`	`.automatic`	Tool execution mode
`.maxImageDimension(1024)`	`1024`	Max image resize dimension

Available Models

Model	Size	Registry
Gemma 4 E2B	~2.4 GB	`ModelRegistry.gemma4E2B`
Gemma 4 E4B	~3.4 GB	`ModelRegistry.gemma4E4B`

Example App

See Examples/ChatDemo for a complete iOS chat app demonstrating:

CPU/GPU backend selection
Text chat with streaming responses
Image attachment via PhotosPicker
Raw audio recording passed directly to model
Tool calling with sample tools (weather, calculator, dice roll)
Model download with speed/ETA display
Stop/cancel generation
Gemma 4 turn tag stripping
Modern SwiftUI layout with keyboard avoidance

cd Examples/ChatDemo
xcodegen generate
open ChatDemo.xcodeproj
# Set your Development Team in Signing & Capabilities, then build & run

Architecture

Module	Purpose	Dependencies
LiteRTLM	Public API — engine, sessions, conversations, streaming, tools	CLiteRTLM
LiteRTLMDownloader	Model download management — progress, pause/resume, registry	None
CLiteRTLM	Pre-built xcframework binary from Google's LiteRT-LM	None

Key design decisions:

Actor isolation on LMEngine — serializes access to mutable C pointers without manual locking
@unchecked Sendable on LMSession/LMConversation — internal serial DispatchQueue for C API callback compatibility
Builder-pattern configs — immutable value types with copy-on-write, safe to share across threads
Conversation streaming — C API sends JSON snapshots per callback; SDK parses each snapshot and yields only the text delta as clean tokens

Requirements

Requirement	Detail
iOS / iPadOS	17.0+
Swift	5.9+
Device	iPhone 12+ / iPad with A14+ chip
RAM	6 GB+ available
Model	`.litertlm` format

License

The Swift SDK code is licensed under the MIT License — see LICENSE.

Third-Party Components (Apache 2.0)

This SDK bundles pre-built binaries from Google, which remain under the Apache License 2.0:

Binary	Source	License
`CLiteRTLM.xcframework`	LiteRT-LM	Apache 2.0
`libGemmaModelConstraintProvider.dylib`	LiteRT-LM	Apache 2.0
`libLiteRtMetalAccelerator.dylib`	LiteRT	Apache 2.0

See NOTICE for full attribution.

This is not an official Google product. "LiteRT", "Gemma", and "Google" are trademarks of Google LLC.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Examples/ChatDemo		Examples/ChatDemo
Frameworks/LiteRTLM.xcframework		Frameworks/LiteRTLM.xcframework
Reference		Reference
Sources		Sources
Tests		Tests
assets/screenshots		assets/screenshots
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiteRTLM Swift SDK

Screenshots

Features

Installation

Swift Package Manager

Required Setup

Quick Start

Usage

Text Streaming

Vision (Image Input)

Audio Input

Multi-turn Conversation

Tool Calling

CPU vs GPU

Benchmark Metrics

Configuration Reference

EngineConfiguration

ConversationConfiguration

Available Models

Example App

Architecture

Requirements

License

Third-Party Components (Apache 2.0)

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LiteRTLM Swift SDK

Screenshots

Features

Installation

Swift Package Manager

Required Setup

Quick Start

Usage

Text Streaming

Vision (Image Input)

Audio Input

Multi-turn Conversation

Tool Calling

CPU vs GPU

Benchmark Metrics

Configuration Reference

EngineConfiguration

ConversationConfiguration

Available Models

Example App

Architecture

Requirements

License

Third-Party Components (Apache 2.0)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages