Skip to content

Misaligned tokenizer on embeddings? #391

@eugeniosegala

Description

@eugeniosegala

Issue description

The nomic-embed-text-v1.5-GGUF model does not seem to calculate embedding dimensions properly.

Expected Behavior

No errors or warnings, embedding dimensions are calculated properly. As per now, I'm getting non-relevant embeddings during similarity searches.

It works okay with Ollama and the Python bindings of llama.cpp.

Actual Behavior

These errors/warnings is produced when loading the model:

[node-llama-cpp] llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[node-llama-cpp] Using this model ("./nomic-embed-text-v1.5.f16.gguf") to tokenize text and then detokenize it resulted in a different text. There might be an issue with the model or the tokenizer implementation. Using this model may not work as intended

As you will notice, the embeddings are generated but I suspect the way they are calculated is wrong, as similarity searches on vectors do not relevant content.

Steps to reproduce

Download the model https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF

And then run it using node-llama-cpp:

import { getLlama } from "node-llama-cpp";
import path from "path";
import { fileURLToPath } from "url";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

async function embedDocuments(documents) {
  const embeddings = new Map();

  await Promise.all(
    documents.map(async (document) => {
      const embedding = await context.getEmbeddingFor(document);
      embeddings.set(document, embedding);

      console.debug(
        `${embeddings.size}/${documents.length} documents embedded`
      );
    })
  );

  return embeddings;
}

function findSimilarDocuments(
  embedding,
  documentEmbeddings,
) {
  const similarities = new Map();
  for (const [otherDocument, otherDocumentEmbedding] of documentEmbeddings)
    similarities.set(
      otherDocument,
      embedding.calculateCosineSimilarity(otherDocumentEmbedding)
    );

  return Array.from(similarities.keys())
    .sort((a, b) => similarities.get(b) - similarities.get(a));
}

const llama = await getLlama();

// https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF
const model = await llama.loadModel({
  modelPath: path.join(__dirname, "nomic-embed-text-v1.5.f16.gguf"),
});

const context = await model.createEmbeddingContext();

const response = await context.getEmbeddingFor("crime");


const documentEmbeddings = await embedDocuments([
  "The sky is clear and blue today",
  "I love eating pizza with extra cheese",
  "Dogs love to play fetch with their owners",
  "The capital of France is Paris",
  "Drinking water is important for staying hydrated",
  "Mount Everest is the tallest mountain in the world",
  "A warm cup of tea is perfect for a cold winter day",
  "Painting is a form of creative expression",
  "Not all the things that shine are made of gold",
  "Cleaning the house is a good way to keep it tidy"
]);

const query = "Do you like pizza?";
const queryEmbedding = await context.getEmbeddingFor(query);

const similarDocuments = findSimilarDocuments(
  queryEmbedding,
  documentEmbeddings
);
const topSimilarDocument = similarDocuments[0];

console.log("query:", query);
console.log("Document:", topSimilarDocument); // Drinking water is important for staying hydrated

The returned vector will be:

"Drinking water is important for staying hydrated"

But it should be:

"I love eating pizza with extra cheese",

My Environment

OS: macOS 24.1.0 (arm64)
Node: 20.18.1 (arm64)
TypeScript: 5.6.3
node-llama-cpp: 3.3.0

Metal: available

Metal device: Apple M3 Pro
Metal used VRAM: 0% (80KB/27GB)
Metal free VRAM: 99.99% (27GB/27GB)
Metal unified memory: 27GB (100%)

CPU model: Apple M3 Pro
Math cores: 6
Used RAM: 96.66% (34.8GB/36GB)
Free RAM: 3.33% (1.2GB/36GB)
Used swap: 0% (0B/0B)
Max swap size: dynamic

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingreleased

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions