Support Local Usage of Google's Gemma Open-Source Models via Terminal #5945

MustafaKemal0146 · 2025-08-10T16:10:59Z

MustafaKemal0146
Aug 10, 2025

It would be great if this terminal tool could support running some of Google's open-source AI models like gemma3n:e2b, gemma3n:e4b, and other Gemma variants locally. These models are freely available and would enable users to leverage powerful LLM capabilities directly from their own machine, improving privacy, speed, and flexibility. Integrating local Gemma model support could make the CLI much more versatile for academic and research workflows.

Why this matters:

Privacy: Keep sensitive research text local.
Latency: Local inference avoids network round trips.
Cost control: No per‑token cloud billing.
Offline capability: Works during network outages.

Requested capabilities (MVP → nice-to-have):

Model fetch / setup command (e.g. cli gemma setup --model gemma3n:e2b).
Local inference mode toggle (--local flag or config mode=local).
Automatic hardware capability check (GPU/VRAM, fallback to CPU).
Simple prompt + stream output API parity with current remote mode.
Configurable model cache directory with size / eviction info.
Basic token usage stats for local runs.

Stretch:

Hot-swap between local Gemma and remote provider mid-session.
Quantization options (e.g. 4-bit, 8-bit) for lower VRAM systems.
Batched inference for scripts.
Local embedding generation if/when Gemma embedding model is exposed.

Possible implementation outline:

Add a backends/ abstraction: RemoteBackend, GemmaLocalBackend.
Use existing open-source runtime (e.g. GGUF + llama.cpp / TensorRT-LLM) or official Gemma runtime if license permits.
Model registry mapping short names (gemma3n:e2b, gemma3n:e4b, etc.) to download URLs + recommended quantizations.
Persistent cache with checksum verification to avoid corrupted downloads.
Graceful degradation: if GPU insufficient, warn and continue on CPU with estimated slowdown.

Risks / considerations:

Licensing & redistribution compliance with Gemma terms.
Download size management (allow partial / resume).
Clear errors when user hardware cannot load chosen variant.

Acceptance (initial):

cli --local --model gemma3n:e2b "Hello" returns coherent text without remote calls.
Cache hit on second run (no re-download).
Switching --local off reverts to current remote behavior unchanged.

Let me know if you’d like a smaller scoped first PR (e.g. just backend abstraction + one Gemma variant).

SolusSola · 2025-08-10T16:13:19Z

SolusSola
Aug 10, 2025

I completely agree — having support for running open-source Gemma models locally in the CLI would be a fantastic addition.
It would bring huge benefits in terms of privacy, lower latency, offline capability, and avoiding ongoing cloud costs.

The ideas you’ve outlined — like a setup command, hardware capability checks, caching, and GPU/CPU fallback — make a lot of sense.
Starting with a single Gemma variant and building out the backend abstraction first sounds like a solid plan, then expanding to more features later.

This would make the CLI far more versatile and research-friendly.

0 replies

acoliver · 2025-08-10T17:30:51Z

acoliver
Aug 10, 2025

FYI I have a downstream fork of gemini-cli that keeps up with their main. https://github.com/acoliver/llxprt-code

It adds support for local models.

You would do

llxprt --provider openai --baseurl 127.0.0.1:1234/v1/ --model gemma-3n-it
or just
llxprt
/provider openai
/baseurl 127.0.0.1:1234/v1/
/model gemma-3n-it (or you can just /model and it will pop a dialog)

We also offer configurable prompts. The gemini-cli prompt is REALLY long so we put the default set in ~/.llxprt/prompts/ and you can edit them or orverride them per provider/model. you can also save /profile save "mysetup" and then just do llxprt --profile-load mysetup with everything preconfigured.

The Gemini Code Assist team developing gemini-cli has responded to this multiple times that they intend to focus exclusively on Gemini models "for now." I think this is the right decision as it is a massive undertaking and they have a lot ot develop. Community forks downstream can give developers more choice and control including local models.

If you come downstream you can enjoy all of the features of gemini-cli along with your local models, other providers and features like claude code style todolists!

1 reply

MustafaKemal0146 Aug 10, 2025
Author

Thanks for sharing this and the repo link.
The local model support, configurable prompts, and profile management are really useful additions, and I appreciate the flexibility they bring compared to the main branch.

I also understand the Gemini-cli team’s focus on Gemini models, but it’s great to see community forks like yours offering more options and control.

Thanks again for the work and for making it available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Local Usage of Google's Gemma Open-Source Models via Terminal #5945

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support Local Usage of Google's Gemma Open-Source Models via Terminal #5945

Uh oh!

MustafaKemal0146 Aug 10, 2025

Replies: 2 comments · 1 reply

Uh oh!

SolusSola Aug 10, 2025

Uh oh!

acoliver Aug 10, 2025

Uh oh!

MustafaKemal0146 Aug 10, 2025 Author

MustafaKemal0146
Aug 10, 2025

Replies: 2 comments 1 reply

SolusSola
Aug 10, 2025

acoliver
Aug 10, 2025

MustafaKemal0146 Aug 10, 2025
Author