-
Notifications
You must be signed in to change notification settings - Fork 13.3k
granite embedding small support (ModernBert arch) #15641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…orted yet but working on getting conversion to work for encoder only
…ated gate split with views, GEGLU is now used which does exactly this
…when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
@gabe-l-hart thanks in advance :) |
also realizing this a little late haha, but should I be changing all of the modern bert stuff to a granite embedding macro like LLM_ARCH_GRANITE_EMBD or keep it as is |
You may want to check out an earlier attempt at ModernBert in #14014 |
Thanks for getting this together @ryan-mangeno and thanks for pointing out the previous work @CISC. Ryan, let me know if/when you've looked over that PR and found anything to fix and I'll take a pass at review. |
In general, we want to keep things as generic as possible, so since this uses the |
will do |
@gabe-l-hart im looking into modern berts research paper, I cant find a mention of symmetric sliding window attention but rather local sliding window attention so I am going to opt to use LLAMA_SWA_TYPE_LOCAL versus LLAMA_SWA_TYPE_SYMMETRIC used in the previous attempt. It also uses global attention every third layer so I am going to implement this stuff and then it should be ready for a review :) |
@ryan-mangeno That sounds good! I haven't unpacked any of those mechanics myself, but can try to get into it if you get stuck. |
… per previous attempt, added local sliding window attention that alternates every third layer
ok 👍 , made some changes but not sure if its fully ready yet, I will ping you when I think its ready if thats ok |
status update - I found out that modern bert uses an alternating rope method , per https://arxiv.org/pdf/2412.13663
I am currently figuring out how to implement this |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
LLM_KV_ROPE_DIMENSION_SECTIONS, | ||
LLM_KV_ROPE_FREQ_BASE, | ||
LLM_KV_ROPE_SCALE_LINEAR, | ||
LLM_KV_ROPE_FREQ_BASE_SWA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Seems like this should be one line up so it's next to LLM_KV_ROPE_FREQ_BASE
?
thanks for the insight and sugestions! I also added support to convert the modern bert base model to gguf |
Co-authored-by: Gabe Goodhart <[email protected]>
Co-authored-by: Gabe Goodhart <[email protected]>
I'm still seeing pretty substantial differences between running this with Sentence Transformersfrom sentence_transformers import SentenceTransformer, util
model_path = "/Users/ghart/models/ibm-granite/granite-embedding-small-english-r2/"
model = SentenceTransformer(model_path)
input_queries = ["hello world"]
embedding = model.encode(input_queries)
print("Embedding shape:", embedding.shape)
print("Embedding vector:", embedding) output
|
LLM_TYPE_47M, | ||
LLM_TYPE_60M, | ||
LLM_TYPE_70M, | ||
LLM_TYPE_80M, | ||
LLM_TYPE_109M, | ||
LLM_TYPE_137M, | ||
LLM_TYPE_140M, | ||
LLM_TYPE_149M, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the descriptions too:
Lines 33 to 40 in 3581b68
case LLM_TYPE_33M: return "33M"; | |
case LLM_TYPE_60M: return "60M"; | |
case LLM_TYPE_70M: return "70M"; | |
case LLM_TYPE_80M: return "80M"; | |
case LLM_TYPE_109M: return "109M"; | |
case LLM_TYPE_137M: return "137M"; | |
case LLM_TYPE_140M: return "140M"; | |
case LLM_TYPE_160M: return "160M"; |
# rename custom "head" layers to standard bert "cls.predictions" names for compatibility | ||
if name == "head.norm.weight": | ||
name = "cls.predictions.transform.LayerNorm.weight" | ||
elif name == "head.norm.bias": | ||
name = "cls.predictions.transform.LayerNorm.bias" | ||
elif name == "head.dense.weight": | ||
name = "cls.predictions.transform.dense.weight" | ||
elif name == "head.dense.bias": | ||
name = "cls.predictions.transform.dense.bias" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to commit the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had made support for granite small embedding and it was using the modern arch under the hood
Yeah I was getting differences but wasnt sure if it can be accredited to in the graph build
|
adding support to run granite embedding small, and it primarily pulls the modern bert architecture - https://huggingface.co/ibm-granite/granite-embedding-small-english-r2, currently working on it still, havent figured out the pre-tokenizer type or if I need to impliment it, also for the ubatch size the assert fails in llama-graph.cpp, hacked it to accept ubatch size of 1 for testing, but it seems to keep failing there and not sure why,
if I comment out of the line in llama-graph.cpp
then it works