Skip to content

Conversation

@asg017
Copy link

@asg017 asg017 commented Nov 27, 2024

embedfile is a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI, sqlite-vec, sqlite-lembed, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.

Just like llamafile and whisperfile, you can embed a .gguf embeddings model file into a embedfile, removing the need for managing weights yourself.

Model embedfile Size (f16 quant)
sentence-transformers/all-MiniLM-L6-v2 all-MiniLM-L6-v2.f16.embedfile 56MB
mixedbread-ai/mxbai-embed-xsmall-v1 mxbai-embed-xsmall-v1-f16.embedfile 61MB
nomic-ai/nomic-embed-text-v1.5 nomic-embed-text-v1.5.f16.embedfile 273MB
snowflake-arctic-embed-m-v1.5 snowflake-arctic-embed-m-v1.5-f16.embedfile 221MB
- embedfile (no embedded model) 12MB

Here's an example, using MixedBread's xsmall model:

$ wget https://huggingface.co/asg017/embedfile/resolve/main/mxbai-embed-xsmall-v1-f16.embedfile
$ chmod u+x mxbai-embed-xsmall-v1-f16.embedfile 
$ ./mxbai-embed-xsmall-v1-f16.embedfile --version
embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

This executable file already has sqlite-vec, sqlite-lembed, and the embeddings model pre-configured. Test that embeddings work with:

./mxbai-embed-xsmall-v1-f16.embedfile embed 'hello!'
[-0.058174,0.043776,0.030660,...]

You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the text column in the dbpedia.min.csv file, outputting to a dbpedia.db database.

$ ./mxbai-embed-xsmall-v1-f16.embedfile import --embed text dbpedia.min.csv dbpedia.db
INSERT INTO vec_items SELECT rowid, lembed("text") FROM temp.source;
100%|████████████████████| 10000/10000 [02:00<00:00, 83/s]
✔ dbpedia.min.csv imported into dbpedia.db, 10000 items

That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.

Once indexed, you can search with the search command:

$ ./mxbai-embed-xsmall-v1-f16.embedfile search dbpedia.db 'global warming'
3240 0.852299 Attribution of recent climate change is the effort to scientifically ascertain mechanisms ...
6697 0.904844 The global warming controversy concerns the public debate over whether global warming is occurring, how ...
...

At any point, if you want to "eject" and run SQL scripts yourself, the sh command will fire up the sqlite3 CLI with all extensions and embeddings models pre-configured.

$ ./mxbai-embed-xsmall-v1-f16.embedfile sh
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode qbox
sqlite> select sqlite_version(), vec_version(), lembed_version();
┌──────────────────┬───────────────┬──────────────────┐
│ sqlite_version() │ vec_version() │ lembed_version() │
├──────────────────┼───────────────┼──────────────────┤
│ '3.47.0'         │ 'v0.1.6'      │ 'v0.0.1-alpha.8' │
└──────────────────┴───────────────┴──────────────────┘
sqlite> select vec_to_json(vec_slice(lembed('hello!'), 0, 8)) as sample;
┌──────────────────────────────────────────────────────────────┐
│                            sample                            │
├──────────────────────────────────────────────────────────────┤
│ '[-0.058174,0.043776,0.030660,0.047412,-0.059377,-0.036267,0 │
│ .038117,0.005184]'                                           │
└──────────────────────────────────────────────────────────────┘

Status

This was really fun to put together, and I'd love to see this (or something like this) as part of the llamafile project. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.

Though as-is this branch isn't quite ready yet, there's a few things I want to fix:

  • Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?
  • llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!
  • I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.
  • Include licenses/notices
  • A ton of assert()'s that fail on any error

TODO

  • Metdata + auxiliary column options in import
  • Better TUI for search results. Maybe REPL?
  • --k and other search options
  • --prefix option for nomic-like embeddings, ex --prefix 'search_document:'
  • Better perf
  • More embeddings model uploaded to HF

Build yourself

./make o//llama.cpp/embedfile/embedfile
make -f embedfile.mk all

jart added a commit that referenced this pull request Nov 29, 2024
@jart
Copy link
Collaborator

jart commented Nov 29, 2024

Hi Alex. Thanks for sending this. This would make an awesome addition to the project.

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?

I recommend putting it in the root of the repo, for better visibility.

llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!

I've checked-in SQLite to third party. Your build rule can now simply depend on o/$(MODE)/third_party/sqlite/sqlite.a. The build is configured to have all the compile-time options you specified in this change, e.g. FTS5, FTS3, etc.

I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.

I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location.

Include licenses/notices

It's recommended that you declare them like this:

__notice(mbedtls_notice, "\                                                                                                                                                                             
Mbed TLS (Apache 2.0)\n\                                                                                                                                                                                
Copyright ARM Limited\n\                                                                                                                                                                                
Copyright The Mbed TLS Contributors");

In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it.

A ton of assert()'s that fail on any error

Tell me more? Maybe I can help.

make -f embedfile.mk all

Could you incorporate this into the monolithic Makefile? While the default make rule needs to be hermetic, you can do whatever you want in manually-run rules. For example, under llamafile, there's a lot of manually-run CUDA stuff. You could have manual rules that package your standard embedfiles.


Here's some feedback:

  1. Thank you for taking the time to write a man page.
  2. Thank you for using the new cosmo_args() API.

Here's some suggestions / action items:

  1. Could you update make install so it installs embedfile and its man page?
  2. Please use the new third_party/sqlite/ package. Be sure to update #include lines to say #include "third_party/sqlite/sqlite3.h" etc.
  3. Consider adding a .clang-format file to your package directory, with your preferred style (use Mozilla style in llamafile/highlight/.clang-format if you don't have a preference) and then run clang-format -i on your sources.

@lbarasc
Copy link

lbarasc commented Jan 26, 2025

Hi, i just discovered your Embedfile tool, and this is really huge !
I want to use but i have some questions about it.

For example : if i want to add text file, i do :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.txt mybase.db

can you tell me the caracteristics of the .txt file (encoding : utf-8 ?, line break : CR+LF, ... ?)

if i want to add CSV file :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.csv mybase.db

can you specify format of CSV (char separator,encoding..., number of column, name of columns...)

Last question : how to create my own Embedfile.exe with add .gguf ? can i simply binary copy embedfile.exe + gguf ? or what else ?

Thank you for your help, and your tool.

Sincerely,

Lionel.

@niutech
Copy link

niutech commented Aug 6, 2025

Great idea! Will there be any updates for embedfile?

@niutech
Copy link

niutech commented Aug 11, 2025

Here is a man page for embedfile generated by ChatGPT based on the analysis of its source code:

embedfile(1) — Command-line tool for embedding and searching text data

NAME

embedfile — A comprehensive CLI tool for importing, embedding, and searching text data using SQLite and LLM-based vector embeddings.

SYNOPSIS

embedfile [OPTIONS] COMMAND [ARGS...]

Commands:

  • embedfile embed [TEXT]
  • embedfile sh
  • embedfile import [--embed COLUMN] [--table NAME] SOURCE_FILE INDEX_DB
  • embedfile search INDEX_DB QUERY

Options:

  • --model FILE, -m FILE
    Specify path to .gguf model file to use for embedding.

  • --version, -v
    Show version information of embedfile and its components.

  • --help, -h
    Show help information.


DESCRIPTION

embedfile is a high-performance, self-contained CLI tool that turns raw text or structured data (CSV, JSON, NDJSON, TXT) into vector embeddings using LLMs (via llamafile and sqlite-lembed). It stores and indexes this data in SQLite databases for efficient similarity search.

Typical Use Cases:

  • Generate embeddings for raw text or structured files.
  • Import and index data into a local SQLite vector database.
  • Search for semantically similar results.
  • Launch an interactive shell (sh) to query the database manually.

COMMANDS

embedfile embed [TEXT]

Generate an embedding for a single string or stream of text.

  • If TEXT is provided, prints a JSON array embedding for the string.
  • If no argument is given, reads from stdin line-by-line and prints embeddings.

Example:

embedfile embed "hello world"
echo "hello world" | embedfile embed

embedfile sh

Launch an interactive SQLite shell with all extensions (sqlite-vec, sqlite-lembed, etc.) preloaded.

Example:

embedfile sh
embedfile sh < commands.sql

embedfile import [OPTIONS] SOURCE_FILE INDEX_DB

Import a structured or plain text file into a SQLite vector database, embedding a specific column.

Supported file types:

  • .csv
  • .json
  • .ndjson
  • .txt

Options:

  • --embed COLUMN
    Column name from the source file to embed. Required for all formats except .txt.

  • --table NAME, -t NAME
    Table name (required if the source file is a SQLite .db).

Positional Arguments:

  • SOURCE_FILE
    Path to CSV, JSON, NDJSON, TXT, or SQLite DB file.

  • INDEX_DB
    Target SQLite database where data and embeddings will be stored.

Example:

embedfile import --embed text dbpedia.min.csv dbpedia.db

embedfile search INDEX_DB QUERY

Search the vector database for entries similar to the input query.

  • Uses the vector index and returns top 10 matches with distances.
  • Prints: rowid, matched column text, and similarity distance.

Example:

embedfile search dbpedia.db "search query"

EMBEDDING MODEL

By default, no embedding model is loaded. You must specify a .gguf model with --model or -m.

embedfile --model ./model.gguf import --embed text input.csv output.db

EXAMPLES

Embed text from stdin:

echo "Paris is the capital of France." | embedfile embed

Import CSV and embed a column:

embedfile --model ./model.gguf import --embed description products.csv products.db

Search for similar entries:

embedfile search products.db "wireless earbuds"

Interactive shell with embedded extensions:

embedfile sh

VERSION

Show versions of embedfile and its dependencies:

embedfile --version

Output format:

embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

DEPENDENCIES

This tool is bundled with the following embedded components:

  • llamafile (for LLM inference)
  • sqlite-vec (vector similarity search)
  • sqlite-lembed (text to vector embedding)
  • sqlite-csv, sqlite-lines, sqlite-json (data import)

No Python, servers, or external dependencies required.


EXIT STATUS

  • 0: Success
  • 1: General error
  • Non-zero exit codes may also indicate SQLite or model loading issues.

AUTHOR

Created by Alex Garcia.


@niutech
Copy link

niutech commented Aug 11, 2025

@lbarasc answering your questions using ChatGPT based on the source code:

TXT File Input Characteristics

Encoding:

UTF-8 is required. The source uses lines_read(?) internally via sqlite-lines, which assumes UTF-8 encoded text.

Line Breaks:

lines_read handles both LF (\n) and CR+LF (\r\n)

Format:

  • One logical entry per line.
  • No special header or metadata required.
  • Empty lines may be treated as empty strings (not skipped unless handled downstream).

CSV File Input Characteristics

Encoding:

Same as TXT: UTF-8. The CSV virtual table (sqlite-csv) reads from file directly using filename, and there's no transcoding.

Character Separator:

Default: Comma , The code does not specify a custom separator, so only standard CSV is supported.

Header:

Required: "CREATE VIRTUAL TABLE temp.source USING csv(filename=\"%w\", header=yes)". If your CSV lacks a header, import will fail or misinterpret the first row.

Column Names:

  • Must be valid SQLite identifiers (letters, digits, underscores).
  • Avoid duplicate column names.
  • Required to match the --embed COLUMN name (case-sensitive match in SQLite by default).

Column Count:

No hard limit, but embedfile uses SELECT * FROM temp.source, so the number of columns must match consistently across all rows.

Creating Your Own embedfile With a Custom Model

You can use a process similar to llamafile:

Option 1: zipalign approach (like llamafile)

zipalign -j0 embedfile model.gguf .args

.args should contain CLI arguments like:

-m
model.gguf

Option 2: Environment or CLI flag

You can also just do:

embedfile -m ./my-model.gguf import --embed text input.csv output.db

This is equivalent, but less portable than a self-contained binary.


Summary Table

Format Encoding Line Breaks Special Notes
.txt UTF-8 LF or CR+LF One entry per line. Used with lines_read()
.csv UTF-8 LF or CR+LF Comma-separated, header required. No support for custom delimiters.
.json/.ndjson UTF-8 LF or CR+LF Structured parsing via json_each() and lines_read().
.db SQLite DB You must provide --table NAME. Currently not implemented

@niutech
Copy link

niutech commented Aug 11, 2025

@asg017 I've made 2 pull requests to embedfile with a new man page, proper error handling, added -k NUM (top k results) search parameter and SQLite DB file import.

Here is the gzipped binary file: embedfile.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants