llava : Add Granite Vision Support (#11794)
* Add super wip scripts for multimodal granite gguf
Signed-off-by: Alex-Brooks <[email protected]>
* Add example for converting mmgranite to gguf
Signed-off-by: Alex-Brooks <[email protected]>
* remove hardcoded path
Signed-off-by: Alex-Brooks <[email protected]>
* Add vision feature layer to gguf params
Signed-off-by: Alex-Brooks <[email protected]>
* Clean up llava surgery and remove name substitution hacks
Signed-off-by: Alex-Brooks <[email protected]>
* Add transformers llava next tensor name mapping
Signed-off-by: Alex-Brooks <[email protected]>
* Make siglip / openclip mutuall exclusive
Signed-off-by: Alex-Brooks <[email protected]>
* Fix projector linear substitution
Signed-off-by: Alex-Brooks <[email protected]>
* Fix linear 2 substitution index
Signed-off-by: Alex-Brooks <[email protected]>
* Increase max flattened gridpoints to 64
Signed-off-by: Alex-Brooks <[email protected]>
* Fix hardcoded concat for multiple feature layers
Signed-off-by: Alex-Brooks <[email protected]>
* Pull vision feature layers out of gguf keys
Signed-off-by: Alex-Brooks <[email protected]>
* fix num gridpoints and use all layers
Signed-off-by: Alex-Brooks <[email protected]>
* Avoid dropping last image encoder layer in llava models
Signed-off-by: Alex-Brooks <[email protected]>
* Use 10 for max number of patches
Signed-off-by: Alex-Brooks <[email protected]>
* Standardize vision feature layers
Signed-off-by: Alex-Brooks <[email protected]>
* Cleanup logs
Signed-off-by: Alex-Brooks <[email protected]>
* Update comment for vision feature layer init
Signed-off-by: Alex-Brooks <[email protected]>
* Update notes for alternative to legacy llm conversion script
Signed-off-by: Alex-Brooks <[email protected]>
* Fix notes rendering
Signed-off-by: Alex-Brooks <[email protected]>
* Add v prefix to vision feature layer log
Signed-off-by: Alex-Brooks <[email protected]>
* Use current defaults for feature layer
Signed-off-by: Alex-Brooks <[email protected]>
* Use constant for max gridpoints / feat layers, style fixes
Signed-off-by: Alex-Brooks <[email protected]>
* clarify non-negative feature layers
Signed-off-by: Alex-Brooks <[email protected]>
* Remove CLIP_API from func signature
Signed-off-by: Alex-Brooks <[email protected]>
* USE MAX_IMAGE_FEATURE_LAYERS const in layer calc
Signed-off-by: Alex-Brooks <[email protected]>
* Clarify feature layers are non negative ints and not uint
Signed-off-by: Alex-Brooks <[email protected]>
* Fix condition for reading feature layers
Signed-off-by: Alex-Brooks <[email protected]>
* pop last llava layer when feature layers are unset
Signed-off-by: Alex-Brooks <[email protected]>
* Fix unset vision layer 0
Signed-off-by: Alex-Brooks <[email protected]>
* Update examples/llava/clip.cpp
Co-authored-by: Xuan-Son Nguyen <[email protected]>
* Reenable assertion for out of bounds get_rows
Signed-off-by: Alex-Brooks <[email protected]>
* Use std vector for gridpoints and feature layers
Signed-off-by: Alex-Brooks <[email protected]>
* Caculate max feature layer at load time
Signed-off-by: Alex-Brooks <[email protected]>
* Include base patch for granite vision allocation
Signed-off-by: Alex-Brooks <[email protected]>
* Fix trailing whitespace
Signed-off-by: Alex-Brooks <[email protected]>
* Add max num patches = 10 back for minicpmv
Signed-off-by: Alex-Brooks <[email protected]>
* Use unordered set to store feature layers
Co-authored-by: Xuan-Son Nguyen <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
* Use max feature layer for postnorm
Signed-off-by: Alex-Brooks <[email protected]>
* Apply suggestions from code review
---------
Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Xuan-Son Nguyen <[email protected]>