DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

asg017 · 2024-11-27T20:57:41Z

embedfile is a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI, sqlite-vec, sqlite-lembed, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.

Just like llamafile and whisperfile, you can embed a .gguf embeddings model file into a embedfile, removing the need for managing weights yourself.

Model	embedfile	Size (f16 quant)
sentence-transformers/all-MiniLM-L6-v2	`all-MiniLM-L6-v2.f16.embedfile`	`56MB`
mixedbread-ai/mxbai-embed-xsmall-v1	`mxbai-embed-xsmall-v1-f16.embedfile`	`61MB`
nomic-ai/nomic-embed-text-v1.5	`nomic-embed-text-v1.5.f16.embedfile`	`273MB`
snowflake-arctic-embed-m-v1.5	`snowflake-arctic-embed-m-v1.5-f16.embedfile`	`221MB`
-	`embedfile` (no embedded model)	`12MB`

Here's an example, using MixedBread's xsmall model:

$ wget https://huggingface.co/asg017/embedfile/resolve/main/mxbai-embed-xsmall-v1-f16.embedfile
$ chmod u+x mxbai-embed-xsmall-v1-f16.embedfile 
$ ./mxbai-embed-xsmall-v1-f16.embedfile --version
embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

This executable file already has sqlite-vec, sqlite-lembed, and the embeddings model pre-configured. Test that embeddings work with:

./mxbai-embed-xsmall-v1-f16.embedfile embed 'hello!'
[-0.058174,0.043776,0.030660,...]

You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the text column in the dbpedia.min.csv file, outputting to a dbpedia.db database.

$ ./mxbai-embed-xsmall-v1-f16.embedfile import --embed text dbpedia.min.csv dbpedia.db
INSERT INTO vec_items SELECT rowid, lembed("text") FROM temp.source;
100%|████████████████████| 10000/10000 [02:00<00:00, 83/s]
✔ dbpedia.min.csv imported into dbpedia.db, 10000 items

That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.

Once indexed, you can search with the search command:

$ ./mxbai-embed-xsmall-v1-f16.embedfile search dbpedia.db 'global warming'
3240 0.852299 Attribution of recent climate change is the effort to scientifically ascertain mechanisms ...
6697 0.904844 The global warming controversy concerns the public debate over whether global warming is occurring, how ...
...

At any point, if you want to "eject" and run SQL scripts yourself, the sh command will fire up the sqlite3 CLI with all extensions and embeddings models pre-configured.

$ ./mxbai-embed-xsmall-v1-f16.embedfile sh
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode qbox
sqlite> select sqlite_version(), vec_version(), lembed_version();
┌──────────────────┬───────────────┬──────────────────┐
│ sqlite_version() │ vec_version() │ lembed_version() │
├──────────────────┼───────────────┼──────────────────┤
│ '3.47.0'         │ 'v0.1.6'      │ 'v0.0.1-alpha.8' │
└──────────────────┴───────────────┴──────────────────┘
sqlite> select vec_to_json(vec_slice(lembed('hello!'), 0, 8)) as sample;
┌──────────────────────────────────────────────────────────────┐
│                            sample                            │
├──────────────────────────────────────────────────────────────┤
│ '[-0.058174,0.043776,0.030660,0.047412,-0.059377,-0.036267,0 │
│ .038117,0.005184]'                                           │
└──────────────────────────────────────────────────────────────┘

Status

This was really fun to put together, and I'd love to see this (or something like this) as part of the llamafile project. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.

Though as-is this branch isn't quite ready yet, there's a few things I want to fix:

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?
llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!
I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.
Include licenses/notices
A ton of assert()'s that fail on any error

TODO

Metdata + auxiliary column options in import
Better TUI for search results. Maybe REPL?
--k and other search options
--prefix option for nomic-like embeddings, ex --prefix 'search_document:'
Better perf
More embeddings model uploaded to HF

Build yourself

./make o//llama.cpp/embedfile/embedfile
make -f embedfile.mk all

See #644

jart · 2024-11-29T04:10:51Z

Hi Alex. Thanks for sending this. This would make an awesome addition to the project.

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?

I recommend putting it in the root of the repo, for better visibility.

llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!

I've checked-in SQLite to third party. Your build rule can now simply depend on o/$(MODE)/third_party/sqlite/sqlite.a. The build is configured to have all the compile-time options you specified in this change, e.g. FTS5, FTS3, etc.

I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.

I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location.

Include licenses/notices

It's recommended that you declare them like this:

__notice(mbedtls_notice, "\                                                                                                                                                                             
Mbed TLS (Apache 2.0)\n\                                                                                                                                                                                
Copyright ARM Limited\n\                                                                                                                                                                                
Copyright The Mbed TLS Contributors");

In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it.

A ton of assert()'s that fail on any error

Tell me more? Maybe I can help.

make -f embedfile.mk all

Could you incorporate this into the monolithic Makefile? While the default make rule needs to be hermetic, you can do whatever you want in manually-run rules. For example, under llamafile, there's a lot of manually-run CUDA stuff. You could have manual rules that package your standard embedfiles.

Here's some feedback:

Thank you for taking the time to write a man page.
Thank you for using the new cosmo_args() API.

Here's some suggestions / action items:

Could you update make install so it installs embedfile and its man page?
Please use the new third_party/sqlite/ package. Be sure to update #include lines to say #include "third_party/sqlite/sqlite3.h" etc.
Consider adding a .clang-format file to your package directory, with your preferred style (use Mozilla style in llamafile/highlight/.clang-format if you don't have a preference) and then run clang-format -i on your sources.

lbarasc · 2025-01-26T12:22:41Z

Hi, i just discovered your Embedfile tool, and this is really huge !
I want to use but i have some questions about it.

For example : if i want to add text file, i do :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.txt mybase.db

can you tell me the caracteristics of the .txt file (encoding : utf-8 ?, line break : CR+LF, ... ?)

if i want to add CSV file :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.csv mybase.db

can you specify format of CSV (char separator,encoding..., number of column, name of columns...)

Last question : how to create my own Embedfile.exe with add .gguf ? can i simply binary copy embedfile.exe + gguf ? or what else ?

Thank you for your help, and your tool.

Sincerely,

Lionel.

niutech · 2025-08-06T18:22:58Z

Great idea! Will there be any updates for embedfile?

niutech · 2025-08-11T20:35:09Z

Here is a man page for embedfile generated by ChatGPT based on the analysis of its source code:

`embedfile`(1) — Command-line tool for embedding and searching text data

NAME

embedfile — A comprehensive CLI tool for importing, embedding, and searching text data using SQLite and LLM-based vector embeddings.

SYNOPSIS

embedfile [OPTIONS] COMMAND [ARGS...]

Commands:

embedfile embed [TEXT]
embedfile sh
embedfile import [--embed COLUMN] [--table NAME] SOURCE_FILE INDEX_DB
embedfile search INDEX_DB QUERY

Options:

--model FILE, -m FILE
Specify path to .gguf model file to use for embedding.
--version, -v
Show version information of embedfile and its components.
--help, -h
Show help information.

DESCRIPTION

embedfile is a high-performance, self-contained CLI tool that turns raw text or structured data (CSV, JSON, NDJSON, TXT) into vector embeddings using LLMs (via llamafile and sqlite-lembed). It stores and indexes this data in SQLite databases for efficient similarity search.

Typical Use Cases:

Generate embeddings for raw text or structured files.
Import and index data into a local SQLite vector database.
Search for semantically similar results.
Launch an interactive shell (sh) to query the database manually.

COMMANDS

`embedfile embed [TEXT]`

Generate an embedding for a single string or stream of text.

If TEXT is provided, prints a JSON array embedding for the string.
If no argument is given, reads from stdin line-by-line and prints embeddings.

Example:

embedfile embed "hello world"
echo "hello world" | embedfile embed

`embedfile sh`

Launch an interactive SQLite shell with all extensions (sqlite-vec, sqlite-lembed, etc.) preloaded.

Example:

embedfile sh
embedfile sh < commands.sql

`embedfile import [OPTIONS] SOURCE_FILE INDEX_DB`

Import a structured or plain text file into a SQLite vector database, embedding a specific column.

Supported file types:

.csv
.json
.ndjson
.txt

Options:

--embed COLUMN
Column name from the source file to embed. Required for all formats except .txt.
--table NAME, -t NAME
Table name (required if the source file is a SQLite .db).

Positional Arguments:

SOURCE_FILE
Path to CSV, JSON, NDJSON, TXT, or SQLite DB file.
INDEX_DB
Target SQLite database where data and embeddings will be stored.

Example:

embedfile import --embed text dbpedia.min.csv dbpedia.db

`embedfile search INDEX_DB QUERY`

Search the vector database for entries similar to the input query.

Uses the vector index and returns top 10 matches with distances.
Prints: rowid, matched column text, and similarity distance.

Example:

embedfile search dbpedia.db "search query"

EMBEDDING MODEL

By default, no embedding model is loaded. You must specify a .gguf model with --model or -m.

embedfile --model ./model.gguf import --embed text input.csv output.db

EXAMPLES

Embed text from stdin:

echo "Paris is the capital of France." | embedfile embed

Import CSV and embed a column:

embedfile --model ./model.gguf import --embed description products.csv products.db

Search for similar entries:

embedfile search products.db "wireless earbuds"

Interactive shell with embedded extensions:

embedfile sh

VERSION

Show versions of embedfile and its dependencies:

embedfile --version

Output format:

embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

DEPENDENCIES

This tool is bundled with the following embedded components:

llamafile (for LLM inference)
sqlite-vec (vector similarity search)
sqlite-lembed (text to vector embedding)
sqlite-csv, sqlite-lines, sqlite-json (data import)

No Python, servers, or external dependencies required.

EXIT STATUS

0: Success
1: General error
Non-zero exit codes may also indicate SQLite or model loading issues.

AUTHOR

Created by Alex Garcia.

niutech · 2025-08-11T20:46:28Z

@lbarasc answering your questions using ChatGPT based on the source code:

TXT File Input Characteristics

Encoding:

UTF-8 is required. The source uses lines_read(?) internally via sqlite-lines, which assumes UTF-8 encoded text.

Line Breaks:

lines_read handles both LF (\n) and CR+LF (\r\n)

Format:

One logical entry per line.
No special header or metadata required.
Empty lines may be treated as empty strings (not skipped unless handled downstream).

CSV File Input Characteristics

Encoding:

Same as TXT: UTF-8. The CSV virtual table (sqlite-csv) reads from file directly using filename, and there's no transcoding.

Character Separator:

Default: Comma , The code does not specify a custom separator, so only standard CSV is supported.

Header:

Required: "CREATE VIRTUAL TABLE temp.source USING csv(filename=\"%w\", header=yes)". If your CSV lacks a header, import will fail or misinterpret the first row.

Column Names:

Must be valid SQLite identifiers (letters, digits, underscores).
Avoid duplicate column names.
Required to match the --embed COLUMN name (case-sensitive match in SQLite by default).

Column Count:

No hard limit, but embedfile uses SELECT * FROM temp.source, so the number of columns must match consistently across all rows.

Creating Your Own `embedfile` With a Custom Model

You can use a process similar to llamafile:

Option 1: `zipalign` approach (like llamafile)

zipalign -j0 embedfile model.gguf .args

.args should contain CLI arguments like:

-m
model.gguf

Option 2: Environment or CLI flag

You can also just do:

embedfile -m ./my-model.gguf import --embed text input.csv output.db

This is equivalent, but less portable than a self-contained binary.

Summary Table

Format	Encoding	Line Breaks	Special Notes
`.txt`	UTF-8	LF or CR+LF	One entry per line. Used with `lines_read()`
`.csv`	UTF-8	LF or CR+LF	Comma-separated, header required. No support for custom delimiters.
`.json`/`.ndjson`	UTF-8	LF or CR+LF	Structured parsing via `json_each()` and `lines_read()`.
`.db`	SQLite DB	—	You must provide `--table NAME`. Currently not implemented

niutech · 2025-08-11T23:20:37Z

@asg017 I've made 2 pull requests to embedfile with a new man page, proper error handling, added -k NUM (top k results) search parameter and SQLite DB file import.

Here is the gzipped binary file: embedfile.gz

github-actions bot added the llama.cpp label Nov 27, 2024

jart added a commit that referenced this pull request Nov 29, 2024

Introduce sqlite

d8123c7

See #644

asg017 added 16 commits November 30, 2024 10:29

initial pass

0d08588

sqlite-lembed

4b4665f

add sqlite.org csv

ae178b8

progress

9eeceec

sqlite-lines

3c3c103

"index" cmd, fixup a few things

92065a5

rename to embedfile

581f64d

include embedfile in dist

2c45550

bestlineover readline in shell

d1aed34

comso_dlopen for loadable SQLite extensions

ddf73f6

snapshot tests

c54c950

more sqlite compile time options

80b3693

import and search commands

7ed9101

0.0.1-alpha.1

d9a2d7f

depend on third_party/sqlite instead

542cd2e

llama.cpp/embedfile -> embedfile

6947bfa

asg017 force-pushed the embedfile-init branch from 1cc8e3e to 6947bfa Compare November 30, 2024 18:49

asg017 added 4 commits November 30, 2024 17:12

fix include

30d4e69

clang-format embedfile

481f3ef

small build fixes, error handling

61b718d

small man changes

81845d5

asg017 mentioned this pull request Dec 20, 2024

DRAFT: Add jamfile, a JavaScript runtime for creating scripts/CLIs on top of llamafile #661

Draft

niutech mentioned this pull request Aug 5, 2025

Support uploading more file formats #149

Open

DRAFT: Add embedfile for all-in-one embeddings CLI tool #644

Are you sure you want to change the base?

DRAFT: Add embedfile for all-in-one embeddings CLI tool #644

Uh oh!

Conversation

asg017 commented Nov 27, 2024

Status

TODO

Build yourself

Uh oh!

jart commented Nov 29, 2024

Uh oh!

lbarasc commented Jan 26, 2025

Uh oh!

niutech commented Aug 6, 2025

Uh oh!

niutech commented Aug 11, 2025

embedfile(1) — Command-line tool for embedding and searching text data

NAME

SYNOPSIS

Commands:

Options:

DESCRIPTION

Typical Use Cases:

COMMANDS

embedfile embed [TEXT]

embedfile sh

embedfile import [OPTIONS] SOURCE_FILE INDEX_DB

Supported file types:

Options:

Positional Arguments:

embedfile search INDEX_DB QUERY

EMBEDDING MODEL

EXAMPLES

VERSION

DEPENDENCIES

EXIT STATUS

AUTHOR

Uh oh!

niutech commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TXT File Input Characteristics

Encoding:

Line Breaks:

Format:

CSV File Input Characteristics

Encoding:

Character Separator:

Header:

Column Names:

Column Count:

Creating Your Own embedfile With a Custom Model

Option 1: zipalign approach (like llamafile)

Option 2: Environment or CLI flag

Summary Table

Uh oh!

niutech commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

`embedfile`(1) — Command-line tool for embedding and searching text data

`embedfile embed [TEXT]`

`embedfile sh`

`embedfile import [OPTIONS] SOURCE_FILE INDEX_DB`

`embedfile search INDEX_DB QUERY`

niutech commented Aug 11, 2025 •

edited

Loading

Creating Your Own `embedfile` With a Custom Model

Option 1: `zipalign` approach (like llamafile)

niutech commented Aug 11, 2025 •

edited

Loading