-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
There is a misleading error when deploying models in small MIG partitions
To Reproduce
- Deploy TGIS in Openshift AI.
- Enable MIG (1g.5gb partitions).
- Deploy granite 3b in TGIS standalone.
Expected output
Have the inference service running or having a detailed error in TGIS about why the model is not working.
** Actual error**
�[2m2024-06-24T10:33:54.484964Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m TGIS Commit hash:
�[2m2024-06-24T10:33:54.484984Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Launcher args: Args { model_name: "/mnt/models/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: Some(448), max_new_tokens: 384, max_batch_size: 64, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None }
�[2m2024-06-24T10:33:54.484997Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
�[2m2024-06-24T10:33:54.485049Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Saving fast tokenizer for `/mnt/models/` to `/tmp/74657ff2-73b1-45f2-b8d5-a7302a63f862`
/opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
�[2m2024-06-24T10:33:56.397996Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using configured max_sequence_length: 448
�[2m2024-06-24T10:33:56.398022Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True
�[2m2024-06-24T10:33:56.398340Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting shard 0
Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
Shard 0: warnings.warn(
Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama
Shard 0: supports_causal_lm = True, supports_seq2seq_lm = False
Shard 0: Traceback (most recent call last):
Shard 0:
Shard 0: File "/opt/tgis/bin/text-generation-server", line 8, in <module>
Shard 0: sys.exit(app())
Shard 0: ^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 75, in serve
Shard 0: raise e
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 56, in serve
Shard 0: server.serve(
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 388, in serve
Shard 0: asyncio.run(
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 190, in run
Shard 0: return runner.run(main)
Shard 0: ^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 118, in run
Shard 0: return self._loop.run_until_complete(task)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
Shard 0: return future.result()
Shard 0: ^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 267, in serve_inner
Shard 0: model = get_model(
Shard 0: ^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 126, in get_model
Shard 0: return CausalLM(model_name, revision, deployment_framework, dtype, quantize, model_config, max_sequence_length)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 558, in __init__
Shard 0: inference_engine = get_inference_engine_class(deployment_framework)(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/inference_engine/hf_transformers.py", line 76, in __init__
Shard 0: self.model = model_class.from_pretrained(**kwargs).requires_grad_(False).eval()
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
Shard 0: return model_class.from_pretrained(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
Shard 0: model = cls(config, *model_args, **model_kwargs)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__
Shard 0: self.model = LlamaModel(config)
Shard 0: ^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 701, in __init__
Shard 0: self.mlp = LlamaMLP(config)
Shard 0: ^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 219, in __init__
Shard 0: self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 98, in __init__
Shard 0: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/utils/_device.py", line 77, in __torch_function__
Shard 0: return func(*args, **kwargs)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^
Shard 0:
Shard 0: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":830, please report a bug to PyTorch.
Shard 0:
�[2m2024-06-24T10:34:00.379801Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shard 0 failed: ExitStatus(unix_wait_status(256))
�[2m2024-06-24T10:34:00.400918Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shutting down shards
Workaround
Having the model deployed in a bigger partition.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working