Skip to content

Commit 035b591

Browse files
Xiake Sunandrei-kochinCopilot
authored
[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init (#2218)
**Details:** This PR aim to cache generated OV model from GGUF model in disk for faster subsequent LLMPipeline initialization w/ OpenVINO model cache. - Serialize generated OV model from GGUF model w/ GGUF Reader with properties `ov::genai::enable_save_ov_model` (default value is `false`) - User can check if OV model exists in same folder of GGUF model, load OV model directly instead of creating GGUF model w/ GGUF Reader. - If GGUF model updated, user need to take the responsibility for cache invalidation and re-generate OV model with GGUF Reader. - Use `OPENVINO_LOG_LEVEL` environment variable to control the verbose of GGUF related debug information, details please refer to [DEBUG_LOG.md](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/DEBUG_LOG.md) **Expected behavior:** - Set environment variable: `export OPENVINO_LOG_LEVEL=3` - First run w/ GGUF model: - `build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" ` > [GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf [GGUF Reader]: Loading and unpacking model done. Time: 196ms [GGUF Reader]: Start generating OpenVINO model... [GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms [GGUF Reader]: Model generation done. Time: 757ms I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and - 2nd run w/ OV model: - `build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who are you?"` > I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and --------- Co-authored-by: Andrei Kochin <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 23bf529 commit 035b591

File tree

13 files changed

+181
-54
lines changed

13 files changed

+181
-54
lines changed

src/cpp/include/openvino/genai/llm_pipeline.hpp

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,5 +328,13 @@ static constexpr ov::Property<SchedulerConfig> scheduler_config{"scheduler_confi
328328
*/
329329
static constexpr ov::Property<bool> prompt_lookup{"prompt_lookup"};
330330

331+
/**
332+
* @brief enable enable_save_ov_model property serves to serialize ov model (xml/bin) generated from gguf model on disk for re-use.
333+
* Set `true` to activate this mode.
334+
* And create LLMPipeline instance with this config.
335+
*/
336+
static constexpr ov::Property<bool> enable_save_ov_model{"enable_save_ov_model"};
337+
338+
331339
} // namespace genai
332340
} // namespace ov

src/cpp/src/continuous_batching/pipeline.cpp

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,12 @@ ContinuousBatchingPipeline::ContinuousBatchingPipeline( const std::filesystem::p
5252
const ov::AnyMap& tokenizer_properties,
5353
const ov::AnyMap& vision_encoder_properties) {
5454
auto start_time = std::chrono::steady_clock::now();
55-
5655
auto properties_without_draft_model = properties;
5756
auto draft_model_desr = extract_draft_model_from_config(properties_without_draft_model);
5857
auto is_prompt_lookup_enabled = extract_prompt_lookup_from_config(properties_without_draft_model);
5958

6059
auto model = utils::read_model(models_path, properties);
60+
auto [properties_without_draft_model_without_gguf, enable_save_ov_model] = utils::extract_gguf_properties(properties_without_draft_model);
6161
auto tokenizer = ov::genai::Tokenizer(models_path, tokenizer_properties);
6262
auto generation_config = utils::from_config_json_if_exists(models_path);
6363

@@ -69,16 +69,16 @@ ContinuousBatchingPipeline::ContinuousBatchingPipeline( const std::filesystem::p
6969
if (is_prompt_lookup_enabled) {
7070
OPENVINO_ASSERT(draft_model_desr.model == nullptr, "Speculative decoding and prompt lookup decoding are mutually exclusive");
7171
OPENVINO_ASSERT(embedder == nullptr, "Prompt lookup decoding is not supported for models with embeddings");
72-
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
72+
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
7373
} else if (draft_model_desr.model != nullptr) {
7474
OPENVINO_ASSERT(embedder == nullptr, "Speculative decoding is not supported for models with embeddings");
75-
auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model, scheduler_config, generation_config);
75+
auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model_without_gguf, scheduler_config, generation_config);
7676
m_impl = std::make_shared<SpeculativeDecodingImpl>(main_model_descr, draft_model_desr);
7777
} else if (embedder) {
78-
m_impl = std::make_shared<ContinuousBatchingImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
78+
m_impl = std::make_shared<ContinuousBatchingImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
7979
}
8080
else {
81-
m_impl = std::make_shared<ContinuousBatchingImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
81+
m_impl = std::make_shared<ContinuousBatchingImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
8282
}
8383

8484
m_impl->m_load_time_ms = get_load_time(start_time);
@@ -91,31 +91,32 @@ ContinuousBatchingPipeline::ContinuousBatchingPipeline(
9191
const std::string& device,
9292
const ov::AnyMap& properties) {
9393
auto start_time = std::chrono::steady_clock::now();
94-
9594
auto properties_without_draft_model = properties;
9695
auto draft_model_desr = extract_draft_model_from_config(properties_without_draft_model);
9796
auto is_prompt_lookup_enabled = extract_prompt_lookup_from_config(properties_without_draft_model);
9897

9998
auto model = utils::read_model(models_path, properties_without_draft_model);
99+
auto [properties_without_draft_model_without_gguf, enable_save_ov_model] = utils::extract_gguf_properties(properties_without_draft_model);
100+
100101
auto generation_config = utils::from_config_json_if_exists(models_path);
101102

102103
std::shared_ptr<InputsEmbedder> embedder;
103104
if (std::filesystem::exists(models_path / "openvino_text_embeddings_model.xml")) {
104-
embedder = std::make_shared<InputsEmbedder>(models_path, device, properties_without_draft_model);
105+
embedder = std::make_shared<InputsEmbedder>(models_path, device, properties_without_draft_model_without_gguf);
105106
}
106107

107108
if (is_prompt_lookup_enabled) {
108109
OPENVINO_ASSERT(draft_model_desr.model == nullptr, "Speculative decoding and prompt lookup decoding are mutually exclusive");
109110
OPENVINO_ASSERT(embedder == nullptr, "Prompt lookup decoding is not supported for models with embeddings");
110-
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
111+
m_impl = std::make_shared<PromptLookupImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
111112
} else if (draft_model_desr.model != nullptr) {
112113
OPENVINO_ASSERT(embedder == nullptr, "Speculative decoding is not supported for models with embeddings");
113-
auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model, scheduler_config, generation_config);
114+
auto main_model_descr = ov::genai::ModelDesc(model, tokenizer, device, properties_without_draft_model_without_gguf, scheduler_config, generation_config);
114115
m_impl = std::make_shared<SpeculativeDecodingImpl>(main_model_descr, draft_model_desr);
115116
} else if (embedder) {
116-
m_impl = std::make_shared<ContinuousBatchingImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
117+
m_impl = std::make_shared<ContinuousBatchingImpl>(model, embedder, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
117118
} else {
118-
m_impl = std::make_shared<ContinuousBatchingImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model, generation_config);
119+
m_impl = std::make_shared<ContinuousBatchingImpl>(model, tokenizer, scheduler_config, device, properties_without_draft_model_without_gguf, generation_config);
119120
}
120121

121122
m_impl->m_load_time_ms = get_load_time(start_time);

src/cpp/src/gguf_utils/gguf_modeling.cpp

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
#include "gguf_utils/building_blocks.hpp"
1616
#include "gguf_utils/gguf_modeling.hpp"
17-
17+
#include "utils.hpp"
1818

1919
using namespace ov;
2020
using namespace ov::op::v13;
@@ -152,24 +152,38 @@ std::shared_ptr<ov::Model> create_language_model(
152152

153153
} // namespace
154154

155-
std::shared_ptr<ov::Model> create_from_gguf(const std::string& model_path) {
155+
std::shared_ptr<ov::Model> create_from_gguf(const std::string& model_path, const bool enable_save_ov_model) {
156156
auto start_time = std::chrono::high_resolution_clock::now();
157-
std::cout << "Loading and unpacking model from: " << model_path << std::endl;
157+
std::stringstream ss;
158+
ss << "Loading and unpacking model from: " << model_path;
159+
ov::genai::utils::print_gguf_debug_info(ss.str());
158160
auto [config, consts, qtypes] = load_gguf(model_path);
159161
auto load_finish_time = std::chrono::high_resolution_clock::now();
160-
std::cout << "Loading and unpacking model done. Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(load_finish_time - start_time).count() << "ms" << std::endl;
161-
std::cout << "Start generating OV model..." << std::endl;
162-
163-
std::shared_ptr<ov::Model> model;
164162

163+
ss.str("");
164+
ss << "Loading and unpacking model done. Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(load_finish_time - start_time).count() << "ms";
165+
ov::genai::utils::print_gguf_debug_info(ss.str());
166+
167+
std::shared_ptr<ov::Model> model;
165168
const std::string model_arch = std::get<std::string>(config.at("architecture"));
169+
ss.str("");
170+
ss << "Start generating OpenVINO model...";
171+
ov::genai::utils::print_gguf_debug_info(ss.str());
166172
if (!model_arch.compare("llama") || !model_arch.compare("qwen2") || !model_arch.compare("qwen3")) {
167173
model = create_language_model(config, consts, qtypes);
174+
if (enable_save_ov_model){
175+
std::filesystem::path gguf_model_path(model_path);
176+
std::filesystem::path save_path = gguf_model_path.parent_path() / "openvino_model.xml";
177+
ov::genai::utils::save_openvino_model(model, save_path.string(), true);
178+
}
168179
} else {
169180
OPENVINO_THROW("Unsupported model architecture '", model_arch, "'");
170181
}
171182
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
172183
std::chrono::high_resolution_clock::now() - load_finish_time).count();
173-
std::cout << "Model generation done. Time: " << duration << "ms" << std::endl;
184+
ss.str("");
185+
ss << "Model generation done. Time: " << duration << "ms";
186+
ov::genai::utils::print_gguf_debug_info(ss.str());
187+
174188
return model;
175189
}

src/cpp/src/gguf_utils/gguf_modeling.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@
77

88
#include "openvino/openvino.hpp"
99

10-
std::shared_ptr<ov::Model> create_from_gguf(const std::string& model_path);
10+
std::shared_ptr<ov::Model> create_from_gguf(const std::string& model_path, const bool enable_save_ov_model);

src/cpp/src/llm/pipeline_continuous_batching_adapter.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ class ContinuousBatchingAdapter final : public LLMPipelineImplBase {
5858
const SchedulerConfig& scheduler_config,
5959
const std::string& device,
6060
const ov::AnyMap& plugin_config
61-
): LLMPipelineImplBase{Tokenizer(models_path), GenerationConfig()} {
61+
): LLMPipelineImplBase{Tokenizer(models_path, plugin_config), GenerationConfig()} {
6262
auto mutable_plugin_config = plugin_config;
6363
mutable_plugin_config["sampler_num_threads"] = 1;
6464
m_impl = std::make_unique<ContinuousBatchingPipeline>(models_path, m_tokenizer, scheduler_config, device, mutable_plugin_config);

src/cpp/src/llm/pipeline_stateful.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,8 @@ StatefulLLMPipeline::StatefulLLMPipeline(
6565
if (!m_use_full_chat_history)
6666
m_kv_cache_state.seq_length_axis = kv_pos.seq_len;
6767

68-
auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters);
68+
auto [filtered_properties_without_gguf, enable_save_ov_model] = utils::extract_gguf_properties(properties);
69+
auto filtered_properties = extract_adapters_from_properties(filtered_properties_without_gguf, &m_generation_config.adapters);
6970
if (m_generation_config.adapters) {
7071
m_generation_config.adapters->set_tensor_name_prefix("base_model.model.");
7172
m_adapter_controller = AdapterController(model, *m_generation_config.adapters, device); // TODO: Make the prefix name configurable
@@ -93,7 +94,7 @@ StatefulLLMPipeline::StatefulLLMPipeline(
9394
const std::filesystem::path& models_path,
9495
const std::string& device,
9596
const ov::AnyMap& plugin_config)
96-
: StatefulLLMPipeline{models_path, Tokenizer(models_path), device, plugin_config} {}
97+
: StatefulLLMPipeline{models_path, Tokenizer(models_path, plugin_config), device, plugin_config} {}
9798

9899
DecodedResults StatefulLLMPipeline::generate(
99100
StringInputs inputs,

src/cpp/src/tokenizer/tokenizer.cpp

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -247,7 +247,9 @@ class Tokenizer::TokenizerImpl {
247247

248248
std::shared_ptr<ov::Model> ov_tokenizer = nullptr;
249249
std::shared_ptr<ov::Model> ov_detokenizer = nullptr;
250-
250+
auto [filtered_properties, enable_save_ov_model] = utils::extract_gguf_properties(properties);
251+
// Pass no addtional properties to tokenizer/detokenizer models since it was not used by default
252+
filtered_properties = {};
251253
if (is_gguf_model(models_path)) {
252254
std::map<std::string, GGUFMetaData> tokenizer_config{};
253255
const char* ov_tokenizer_path = getenv(ScopedVar::ENVIRONMENT_VARIABLE_NAME);
@@ -272,15 +274,33 @@ class Tokenizer::TokenizerImpl {
272274
m_chat_template = patch_gguf_chat_template(m_chat_template);
273275
}
274276

275-
setup_tokenizer(std::make_pair(ov_tokenizer, ov_detokenizer), properties);
277+
if (enable_save_ov_model){
278+
std::filesystem::path gguf_model_path(models_path);
279+
std::filesystem::path save_ov_tokenizer_path = gguf_model_path.parent_path() / "openvino_tokenizer.xml";
280+
std::filesystem::path save_ov_detokenizer_path = gguf_model_path.parent_path() / "openvino_detokenizer.xml";
281+
ov_tokenizer->set_rt_info(m_pad_token_id, "pad_token_id");
282+
ov_tokenizer->set_rt_info(m_bos_token_id, "bos_token_id");
283+
ov_tokenizer->set_rt_info(m_eos_token_id, "eos_token_id");
284+
ov_tokenizer->set_rt_info(m_chat_template, "chat_template");
285+
286+
ov_detokenizer->set_rt_info(m_pad_token_id, "pad_token_id");
287+
ov_detokenizer->set_rt_info(m_bos_token_id, "bos_token_id");
288+
ov_detokenizer->set_rt_info(m_eos_token_id, "eos_token_id");
289+
ov_detokenizer->set_rt_info(m_chat_template, "chat_template");
290+
291+
ov::genai::utils::save_openvino_model(ov_tokenizer, save_ov_tokenizer_path.string(), false);
292+
ov::genai::utils::save_openvino_model(ov_detokenizer, save_ov_detokenizer_path.string(), false);
293+
}
294+
295+
setup_tokenizer(std::make_pair(ov_tokenizer, ov_detokenizer), filtered_properties);
276296
return;
277297
}
278298
if (std::filesystem::exists(models_path / "openvino_tokenizer.xml")) {
279-
ov_tokenizer = core.read_model(models_path / "openvino_tokenizer.xml", {}, properties);
299+
ov_tokenizer = core.read_model(models_path / "openvino_tokenizer.xml", {}, filtered_properties);
280300
}
281301

282302
if (std::filesystem::exists(models_path / "openvino_detokenizer.xml")) {
283-
ov_detokenizer = core.read_model(models_path / "openvino_detokenizer.xml", {}, properties);
303+
ov_detokenizer = core.read_model(models_path / "openvino_detokenizer.xml", {}, filtered_properties);
284304
}
285305

286306
read_config(models_path);
@@ -290,7 +310,7 @@ class Tokenizer::TokenizerImpl {
290310
parse_if_exists(models_path / "tokenizer_config.json", m_chat_template);
291311
parse_if_exists(models_path / "processor_config.json", m_chat_template);
292312
parse_if_exists(models_path / "chat_template.json", m_chat_template);
293-
setup_tokenizer(std::make_pair(ov_tokenizer, ov_detokenizer), properties);
313+
setup_tokenizer(std::make_pair(ov_tokenizer, ov_detokenizer), filtered_properties);
294314
}
295315

296316
void setup_tokenizer(const std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>& models, const ov::AnyMap& properties) {

src/cpp/src/utils.cpp

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -301,17 +301,45 @@ ov::Core singleton_core() {
301301

302302

303303
namespace {
304-
305304
bool is_gguf_model(const std::filesystem::path& file_path) {
306305
return file_path.extension() == ".gguf";
307306
}
308307

309308
} // namespace
310309

311-
std::shared_ptr<ov::Model> read_model(const std::filesystem::path& model_dir, const ov::AnyMap& config) {
310+
std::pair<ov::AnyMap, bool> extract_gguf_properties(const ov::AnyMap& external_properties) {
311+
bool enable_save_ov_model = false;
312+
ov::AnyMap properties = external_properties;
313+
314+
auto it = properties.find(ov::genai::enable_save_ov_model.name());
315+
if (it != properties.end()) {
316+
enable_save_ov_model = it->second.as<bool>();
317+
properties.erase(it);
318+
}
319+
320+
return {properties, enable_save_ov_model};
321+
}
322+
323+
void save_openvino_model(const std::shared_ptr<ov::Model>& model, const std::string& save_path, bool compress_to_fp16) {
324+
try {
325+
auto serialize_start_time = std::chrono::high_resolution_clock::now();
326+
ov::save_model(model, save_path, compress_to_fp16);
327+
auto serialize_finish_time = std::chrono::high_resolution_clock::now();
328+
auto serialize_duration = std::chrono::duration_cast<std::chrono::milliseconds>(serialize_finish_time - serialize_start_time).count();
329+
std::stringstream ss;
330+
ss << "Save generated OpenVINO model to: " << save_path << " done. Time: " << serialize_duration << " ms";
331+
ov::genai::utils::print_gguf_debug_info(ss.str());
332+
}
333+
catch (const ov::Exception& e) {
334+
OPENVINO_THROW("Exception during model serialization ", e.what(), ", user can disable it by setting 'ov::genai::enable_save_ov_model' property to false");
335+
}
336+
}
337+
338+
std::shared_ptr<ov::Model> read_model(const std::filesystem::path& model_dir, const ov::AnyMap& properties) {
339+
auto [filtered_properties, enable_save_ov_model] = extract_gguf_properties(properties);
312340
if (is_gguf_model(model_dir)) {
313341
#ifdef ENABLE_GGUF
314-
return create_from_gguf(model_dir.string());
342+
return create_from_gguf(model_dir.string(), enable_save_ov_model);
315343
#else
316344
OPENVINO_ASSERT("GGUF support is switched off. Please, recompile with 'cmake -DENABLE_GGUF=ON'");
317345
#endif
@@ -326,7 +354,7 @@ std::shared_ptr<ov::Model> read_model(const std::filesystem::path& model_dir, c
326354
OPENVINO_THROW("Could not find a model in the directory '", model_dir, "'");
327355
}
328356

329-
return singleton_core().read_model(model_path, {}, config);
357+
return singleton_core().read_model(model_path, {}, filtered_properties);
330358
}
331359
}
332360

@@ -467,6 +495,13 @@ void print_compiled_model_properties(ov::CompiledModel& compiled_Model, const ch
467495
}
468496
}
469497

498+
void print_gguf_debug_info(const std::string &debug_info) {
499+
if (!env_setup_for_print_debug_info()) {
500+
return;
501+
}
502+
std::cout << "[GGUF Reader]: " << debug_info << std::endl;
503+
}
504+
470505
std::pair<ov::CompiledModel, KVDesc>
471506
compile_decoder_for_npu(const std::shared_ptr<ov::Model>& model,
472507
const ov::AnyMap& config,

src/cpp/src/utils.hpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,8 @@ void apply_gather_before_matmul_transformation(std::shared_ptr<ov::Model> model)
102102

103103
ov::Core singleton_core();
104104

105+
std::pair<ov::AnyMap, bool> extract_gguf_properties(const ov::AnyMap& external_properties);
106+
105107
std::shared_ptr<ov::Model> read_model(const std::filesystem::path& model_dir, const ov::AnyMap& config);
106108

107109
void release_core_plugin(const std::string& device);
@@ -147,6 +149,8 @@ bool env_setup_for_print_debug_info();
147149

148150
void print_compiled_model_properties(ov::CompiledModel& compiled_Model, const char* model_title);
149151

152+
void print_gguf_debug_info(const std::string& debug_info);
153+
150154
struct KVDesc {
151155
uint32_t max_prompt_len;
152156
uint32_t min_response_len;
@@ -246,6 +250,8 @@ bool explicitly_requires_paged_attention(const ov::AnyMap& properties);
246250

247251
std::pair<ov::AnyMap, std::string> extract_attention_backend(const ov::AnyMap& external_properties);
248252

253+
void save_openvino_model(const std::shared_ptr<ov::Model>& model, const std::string& save_path, bool compress_to_fp16);
254+
249255
} // namespace utils
250256
} // namespace genai
251257
} // namespace ov

src/docs/DEBUG_LOG.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ the properties of the compiled model are printed as follows:
4242
CPU: Intel(R) Xeon(R) Platinum 8468
4343
```
4444

45-
When Speculative Decoding ot Prompt Lookup pipeline is executed, performance metrics will be also printed.
45+
When Speculative Decoding or Prompt Lookup pipeline is executed, performance metrics will be also printed.
4646

4747
For example:
4848

@@ -64,4 +64,16 @@ Generated tokens: 100
6464
Accepted token rate, %: 51
6565
===============================
6666
Request_id: 0 ||| 40 0 40 20 0 0 40 40 0 20 20 20 0 40 0 0 20 80 0 80 20 0 0 0 40 80 0 40 60 40 80 0 0 0 0 40 20 20 0 40 20 40 0 20 0 0 0
67-
```
67+
```
68+
69+
70+
When GGUF model passed to LLMPipeline, the details debug info will be also printed.
71+
72+
For example:
73+
```
74+
[GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
75+
[GGUF Reader]: Loading and unpacking model done. Time: 196ms
76+
[GGUF Reader]: Start generating OpenVINO model...
77+
[GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms
78+
[GGUF Reader]: Model generation done. Time: 757ms
79+
```

0 commit comments

Comments
 (0)