You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init (#2218)
**Details:**
This PR aim to cache generated OV model from GGUF model in disk for
faster subsequent LLMPipeline initialization w/ OpenVINO model cache.
- Serialize generated OV model from GGUF model w/ GGUF Reader with
properties `ov::genai::enable_save_ov_model` (default value is `false`)
- User can check if OV model exists in same folder of GGUF model, load
OV model directly instead of creating GGUF model w/ GGUF Reader.
- If GGUF model updated, user need to take the responsibility for cache
invalidation and re-generate OV model with GGUF Reader.
- Use `OPENVINO_LOG_LEVEL` environment variable to control the verbose
of GGUF related debug information, details please refer to
[DEBUG_LOG.md](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/DEBUG_LOG.md)
**Expected behavior:**
- Set environment variable: `export OPENVINO_LOG_LEVEL=3`
- First run w/ GGUF model:
- `build/samples/cpp/text_generation/greedy_causal_lm
gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" `
> [GGUF Reader]: Loading and unpacking model from:
gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
[GGUF Reader]: Loading and unpacking model done. Time: 196ms
[GGUF Reader]: Start generating OpenVINO model...
[GGUF Reader]: Save generated OpenVINO model to:
gguf_models/openvino_model.xml done. Time: 466 ms
[GGUF Reader]: Model generation done. Time: 757ms
I am Qwen, a large language model created by Alibaba Cloud. I am a
language model designed to assist users in generating human-like text,
such as writing articles, stories, and even writing books. I am trained
on a vast corpus of text data, including books, articles, and other
written works. I am also trained on a large corpus of human language
data, including written and spoken language. I am designed to provide
information and insights to users, and to assist them in their tasks and
- 2nd run w/ OV model:
- `build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who
are you?"`
> I am Qwen, a large language model created by Alibaba Cloud. I am a
language model designed to assist users in generating human-like text,
such as writing articles, stories, and even writing books. I am trained
on a vast corpus of text data, including books, articles, and other
written works. I am also trained on a large corpus of human language
data, including written and spoken language. I am designed to provide
information and insights to users, and to assist them in their tasks and
---------
Co-authored-by: Andrei Kochin <[email protected]>
Co-authored-by: Copilot <[email protected]>
auto [config, consts, qtypes] = load_gguf(model_path);
159
161
auto load_finish_time = std::chrono::high_resolution_clock::now();
160
-
std::cout << "Loading and unpacking model done. Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(load_finish_time - start_time).count() << "ms" << std::endl;
161
-
std::cout << "Start generating OV model..." << std::endl;
162
-
163
-
std::shared_ptr<ov::Model> model;
164
162
163
+
ss.str("");
164
+
ss << "Loading and unpacking model done. Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(load_finish_time - start_time).count() << "ms";
OPENVINO_THROW("Exception during model serialization ", e.what(), ", user can disable it by setting 'ov::genai::enable_save_ov_model' property to false");
0 commit comments