Skip to content

模型加载到内存中了,显存没有占用, 对话提问cpu利用率100% #39

@githust66

Description

@githust66

xllamacpp版本: 0.1.20+rocm6.4.1
xinference版本:1.7.0.post1

日志:
(xinf) root@DESKTOP-ESRGKIB:/usr/local# VLLM_USE_TRITON_FLASH_ATTN=0 Environment="XINFERENCE_MODEL_SRC=modelscope" HF_ENDPOINT=https://hf-mirror.com PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512,expandable_segments:True xinference-local --host 0.0.0.0 --port 9997
2025-06-16 17:50:31.238092: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/cuda/init.py:736: UserWarning: Can't initialize amdsmi - Error code: 34
warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
is_rocm: True
is_rocm: True
INFO 06-16 17:50:34 [init.py:257] Automatically detected platform rocm.
2025-06-16 17:50:38,173 xinference.core.supervisor 2339 INFO Xinference supervisor 0.0.0.0:62075 started
2025-06-16 17:50:38,197 xinference.core.worker 2339 INFO Starting metrics export server at 0.0.0.0:None
2025-06-16 17:50:38,199 xinference.core.worker 2339 INFO Checking metrics export server...
2025-06-16 17:50:40,031 xinference.core.worker 2339 INFO Metrics server is started at: http://0.0.0.0:46238
2025-06-16 17:50:40,032 xinference.core.worker 2339 INFO Purge cache directory: /root/.xinference/cache
2025-06-16 17:50:40,034 xinference.core.worker 2339 INFO Connected to supervisor as a fresh worker
2025-06-16 17:50:40,050 xinference.core.worker 2339 INFO Xinference worker 0.0.0.0:62075 started
2025-06-16 17:50:45,185 xinference.api.restful_api 2301 INFO Starting Xinference at endpoint: http://0.0.0.0:9997
2025-06-16 17:50:45,329 uvicorn.error 2301 INFO Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)
2025-06-16 17:51:37,844 xinference.core.worker 2339 INFO [request 7eecc442-4a97-11f0-8d12-58cdc986b77d] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7fb23a110d60>, kwargs: model_uid=deepseek-r1-distill-qwen-0,model_name=deepseek-r1-distill-qwen,model_size_in_billions=14,model_format=ggufv2,quantization=Q4_K_M,model_engine=llama.cpp,model_type=LLM,n_gpu=1,request_limits=None,peft_model_config=None,gpu_idx=None,download_hub=None,model_path=/usr/local/models/ds-ai/DeepSeek-R1-Distill-Qwen-14B,xavier_config=None,reasoning_content=False
/root/miniconda3/envs/xinf/lib/python3.10/site-packages/torch/cuda/init.py:736: UserWarning: Can't initialize amdsmi - Error code: 34
warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
is_rocm: True
is_rocm: True
INFO 06-16 17:51:52 [init.py:257] Automatically detected platform rocm.
2025-06-16 17:51:56,341 xinference.core.model 2353 INFO Start requests handler.
build: 1 (e54b394) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: loading model
srv load_model: loading model '/usr/local/models/ds-ai/DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf'
llama_model_loader: loaded meta data with 27 key-value pairs and 579 tensors from /usr/local/models/ds-ai/DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
llama_model_loader: - kv 3: general.organization str = Deepseek Ai
llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 5: general.size_label str = 14B
llama_model_loader: - kv 6: qwen2.block_count u32 = 48
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 13824
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 15
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 8.37 GiB (4.87 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 13824
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 14B
print_info: model params = 14.77 B
print_info: general.name = DeepSeek R1 Distill Qwen 14B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: CPU model buffer size = 2457.29 MiB
load_tensors: CPU_REPACK model buffer size = 6108.75 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 8
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 4.64 MiB
llama_kv_cache_unified: CPU KV buffer size = 24576.00 MiB
llama_kv_cache_unified: size = 24576.00 MiB (131072 cells, 48 layers, 8 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB
llama_context: CPU compute buffer size = 10536.01 MiB
llama_context: graph nodes = 1878
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 8
slot init: id 0 | task -1 | new slot n_ctx_slot = 16384
slot init: id 1 | task -1 | new slot n_ctx_slot = 16384
slot init: id 2 | task -1 | new slot n_ctx_slot = 16384
slot init: id 3 | task -1 | new slot n_ctx_slot = 16384
slot init: id 4 | task -1 | new slot n_ctx_slot = 16384
slot init: id 5 | task -1 | new slot n_ctx_slot = 16384
slot init: id 6 | task -1 | new slot n_ctx_slot = 16384
slot init: id 7 | task -1 | new slot n_ctx_slot = 16384
init: model loaded
init: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin |><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>\n'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
init: starting the main loop
srv update_slots: all slots are idle
2025-06-16 17:52:47,905 xinference.core.model 2353 INFO ModelActor(deepseek-r1-distill-qwen-0) loaded
2025-06-16 17:52:48,010 xinference.core.worker 2339 INFO [request 7eecc442-4a97-11f0-8d12-58cdc986b77d] Leave launch_builtin_model, elapsed time: 70 s
2025-06-16 17:52:48,048 xinference.core.worker 2339 INFO [request a8c44f06-4a97-11f0-8d12-58cdc986b77d] Enter wait_for_load, args: <xinference.core.worker.WorkerActor object at 0x7fb23a110d60>,deepseek-r1-distill-qwen-0, kwargs:
2025-06-16 17:52:48,064 xinference.core.worker 2339 INFO [request a8c44f06-4a97-11f0-8d12-58cdc986b77d] Leave wait_for_load, elapsed time: 0 s
2025-06-16 17:53:24,474 xinference.core.worker 2339 INFO [request be7b246e-4a97-11f0-8d12-58cdc986b77d] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7fb23a110d60>, kwargs: model_uid=deepseek-r1-distill-qwen-0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions