- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
          DRAFT: Add embedfile for all-in-one embeddings CLI tool
          #644
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Hi Alex. Thanks for sending this. This would make an awesome addition to the project. 
 I recommend putting it in the root of the repo, for better visibility. 
 I've checked-in SQLite to third party. Your build rule can now simply depend on  
 I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location. 
 It's recommended that you declare them like this: __notice(mbedtls_notice, "\                                                                                                                                                                             
Mbed TLS (Apache 2.0)\n\                                                                                                                                                                                
Copyright ARM Limited\n\                                                                                                                                                                                
Copyright The Mbed TLS Contributors");In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it. 
 Tell me more? Maybe I can help. 
 Could you incorporate this into the monolithic Makefile? While the default  Here's some feedback: 
 Here's some suggestions / action items: 
 | 
1cc8e3e    to
    6947bfa      
    Compare
  
    | Hi, i just discovered your Embedfile tool, and this is really huge ! For example : if i want to add text file, i do : can you tell me the caracteristics of the .txt file (encoding : utf-8 ?, line break : CR+LF, ... ?) if i want to add CSV file : can you specify format of CSV (char separator,encoding..., number of column, name of columns...) Last question : how to create my own Embedfile.exe with add .gguf ? can i simply binary copy embedfile.exe + gguf ? or what else ? Thank you for your help, and your tool. Sincerely, Lionel. | 
| Great idea! Will there be any updates for embedfile? | 
| Here is a man page for embedfile generated by ChatGPT based on the analysis of its source code:
 | 
| @lbarasc answering your questions using ChatGPT based on the source code: TXT File Input CharacteristicsEncoding:UTF-8 is required. The source uses  Line Breaks:
 Format:
 CSV File Input CharacteristicsEncoding:Same as TXT: UTF-8. The CSV virtual table ( Character Separator:Default: Comma  Header:Required:  Column Names:
 Column Count:No hard limit, but  Creating Your Own  | 
| Format | Encoding | Line Breaks | Special Notes | 
|---|---|---|---|
| .txt | UTF-8 | LF or CR+LF | One entry per line. Used with lines_read() | 
| .csv | UTF-8 | LF or CR+LF | Comma-separated, header required. No support for custom delimiters. | 
| .json/.ndjson | UTF-8 | LF or CR+LF | Structured parsing via json_each()andlines_read(). | 
| .db | SQLite DB | — | You must provide --table NAME. Currently not implemented | 
| @asg017 I've made 2 pull requests to embedfile with a new man page, proper error handling, added  Here is the gzipped binary file: embedfile.gz | 
embedfileis a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI,sqlite-vec,sqlite-lembed, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.Just like
llamafileandwhisperfile, you can embed a.ggufembeddings model file into aembedfile, removing the need for managing weights yourself.all-MiniLM-L6-v2.f16.embedfile56MBmxbai-embed-xsmall-v1-f16.embedfile61MBnomic-embed-text-v1.5.f16.embedfile273MBsnowflake-arctic-embed-m-v1.5-f16.embedfile221MBembedfile(no embedded model)12MBHere's an example, using MixedBread's xsmall model:
This executable file already has
sqlite-vec,sqlite-lembed, and the embeddings model pre-configured. Test that embeddings work with:You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the
textcolumn in thedbpedia.min.csvfile, outputting to adbpedia.dbdatabase.That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.
Once indexed, you can search with the
searchcommand:At any point, if you want to "eject" and run SQL scripts yourself, the
shcommand will fire up thesqlite3CLI with all extensions and embeddings models pre-configured.Status
This was really fun to put together, and I'd love to see this (or something like this) as part of the
llamafileproject. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.Though as-is this branch isn't quite ready yet, there's a few things I want to fix:
llama.cpp/embedfiledirectory, but maybe could be a top-level/embedfile?llama.cpp/embedfile/BUILD.mkis a bit messy, I had trouble compiling.cfiles in the subdirectory so I manually added those builds. Would love some help cleaning that up!sqlite-vec.c,sqlite-lembed.c,sqlite3.c, andshell.cfiles in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.assert()'s that fail on any errorTODO
--kand other search options--prefixoption for nomic-like embeddings, ex--prefix 'search_document:'Build yourself