diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md new file mode 100644 index 0000000000..fd0766b7b6 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md @@ -0,0 +1,112 @@ +--- +title: Overview and Environment Setup +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +[vLLM](https://github.com/vllm-project/vllm) is an open-source, high-throughput inference engine designed to efficiently serve large language models (LLMs). It offers an OpenAI-compatible API, supports dynamic batching, and is optimized for low-latency performance — making it suitable for both real-time and batch inference workloads. + +This learning path walks through how to combine vLLM with INT8 quantization techniques to reduce memory usage and improve inference speed, enabling large models like Llama 3.1 to run effectively on Arm-based CPUs. + +The model featured in this guide — [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) — is sourced from Hugging Face, quantized using the `llmcompressor`, and deployed using vLLM. + +Testing for this learning path was performed on AWS Graviton instance (c8g.16xlarge). The instructions are intended for Arm-based servers running Ubuntu 24.04 LTS. + + +## Learning Path Setup + +This learning path uses a Python virtual environment (`venv`) to manage dependencies in an isolated workspace. This approach ensures a clean environment, avoids version conflicts, and makes it easy to reproduce results — especially when using custom-built packages like `vLLM` and `PyTorch`. + +### Set up the Python environment + +To get started, create a virtual environment and activate it as shown below: + +```bash +sudo apt update +sudo apt install -y python3 python3-venv +python3 -m venv vllm_env +source vllm_env/bin/activate +pip install --upgrade pip +``` +This will create a local Python environment named (`vllm_env`) and upgrade pip to the latest version. + +### Install system dependencies + +These packages are needed to build libraries like OpenBLAS and manage system-level performance: + +```bash +sudo apt-get update -y +sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-pip +sudo apt install python-is-python3 +``` +Set the system default compilers to version 12: + +```bash +sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \ + --slave /usr/bin/g++ g++ /usr/bin/g++-12 +``` +Next, install the [`tcmalloc memory allocator`](https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html?device=arm), which helps improve performance during inference: + +```bash +sudo apt-get install libtcmalloc-minimal4 +``` +This library will be preloaded during model serving to reduce latency and improve memory efficiency. + +### Install OpenBLAS + +OpenBLAS is an optimized linear algebra library that improves performance for matrix-heavy operations, which are common in LLM inference. To get the best performance on Arm CPUs, it's recommended to build OpenBLAS from source. + +Run these commands to clone and build OpenBLAS: +```bash +git clone https://github.com/OpenMathLib/OpenBLAS.git +cd OpenBLAS +git checkout ef9e3f715 +``` +{{% notice Note %}} +This commit is known to work reliably with Arm CPU optimizations (BF16, OpenMP) and has been tested in this learning path. Using it ensures consistent behavior. You can try `main`, but newer commits may introduce changes that haven't been validated here. +{{% /notice %}} + +```bash +make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3 +make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3 PREFIX=/home/ubuntu/OpenBLAS/dist install +``` +This will build and install OpenBLAS into `/home/ubuntu/OpenBLAS/dist` with optimizations for Arm CPUs. + +### Install Python dependencies + +Once the system libraries are in place, install the Python packages required for model quantization and serving. You’ll use prebuilt CPU wheels for vLLM and PyTorch, and install additional tools like `llmcompressor` and `torchvision`. + +Before proceeding, make sure the following files are downloaded to your home directory: +```bash + +``` +These are required to complete the installation and model quantization steps. +Now, navigate to your home directory: +```bash +cd /home/ubuntu/ +``` + +Install the vLLM wheel. This wheel contains the CPU-optimized version of `vLLM`, built specifically for Arm architecture. Installing it from a local `.whl` file ensures compatibility with the rest of your environment and avoids potential conflicts from nightly or default pip installations. + +```bash +pip install vllm-0.7.3.dev151+gfaee222b.cpu-cp312-cp312-linux_aarch64.whl --force-reinstall +``` +Install `llmcompressor`, which is used to quantize the model: +```bash +pip install llmcompressor +``` +Install torchvision (nightly version for CPU): +```bash +pip install --force-reinstall torchvision==0.22.0.dev20250213 --extra-index-url https://download.pytorch.org/whl/nightly/cpu +``` +Install the custom PyTorch CPU wheel:
+This custom PyTorch wheel is prebuilt for Arm CPU architectures and includes the necessary optimizations for running inference. Installing it locally ensures compatibility with your environment and avoids conflicts with default pip packages. +```bash +pip install torch-2.7.0.dev20250306-cp312-cp312-manylinux_2_28_aarch64.whl --force-reinstall --no-deps +``` + +You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md new file mode 100644 index 0000000000..b1c6390807 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md @@ -0,0 +1,178 @@ +--- +title: Quantize and Launch the vLLM server +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Access the Model from Hugging Face + +Before quantizing, authenticate with Hugging Face using a personal access token. You can generate one from your [Hugging Face Hub](https://huggingface.co/) account under Access Tokens: + +```bash +huggingface-cli login --token $hf_token +``` +## Quantization Script Template + +Create the `vllm_quantize_model.py` script shown below to quantize the model : +```bash +import argparse +import os +from transformers import AutoModelForCausalLM, AutoTokenizer + +from llmcompressor.modifiers.quantization import QuantizationModifier +from compressed_tensors.quantization import QuantizationScheme +from compressed_tensors.quantization.quant_args import ( + QuantizationArgs, + QuantizationStrategy, + QuantizationType, +) +from llmcompressor.transformers import oneshot + + +def main(): + parser = argparse.ArgumentParser( + description="Quantize a model using LLM Compressor with customizable mode, scheme, and group size." + ) + parser.add_argument( + "model_id", + type=str, + help="Model identifier or path (e.g., 'meta-llama/Llama-2-13b-chat-hf' or '/path/to/model')", + ) + parser.add_argument( + "--mode", + type=str, + choices=["int4", "int8"], + required=True, + help="Quantization mode: int4 or int8", + ) + parser.add_argument( + "--scheme", + type=str, + choices=["channelwise", "groupwise"], + required=True, + help="Quantization scheme for weights (groupwise is only supported for int4)", + ) + parser.add_argument( + "--groupsize", + type=int, + default=32, + help="Group size for groupwise quantization (only used when scheme is groupwise). Defaults to 32." + ) + args = parser.parse_args() + + # Validate unsupported configuration + if args.mode == "int8" and args.scheme == "groupwise": + raise ValueError("Groupwise int8 is unsupported. Please use channelwise for int8.") + + # Extract a base model name from the model id or path for the output directory + if "/" in args.model_id: + base_model_name = args.model_id.split("/")[-1] + else: + base_model_name = os.path.basename(args.model_id) + + # Determine output directory based on mode and scheme + if args.mode == "int4": + output_dir = f"{base_model_name}-w4a8-{args.scheme}" + else: # int8 + output_dir = f"{base_model_name}-w8a8-{args.scheme}" + + print(f"Loading model '{args.model_id}'...") + model = AutoModelForCausalLM.from_pretrained( + args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True + ) + tokenizer = AutoTokenizer.from_pretrained(args.model_id) + + # Define quantization arguments based on mode and chosen scheme. + if args.mode == "int8": + # Only channelwise is supported for int8. + weights_args = QuantizationArgs( + num_bits=8, + type=QuantizationType.INT, + strategy=QuantizationStrategy.CHANNEL, + symmetric=True, + dynamic=False, + ) + else: # int4 mode + if args.scheme == "channelwise": + strategy = QuantizationStrategy.CHANNEL + weights_args = QuantizationArgs( + num_bits=4, + type=QuantizationType.INT, + strategy=strategy, + symmetric=True, + dynamic=False, + ) + else: # groupwise + strategy = QuantizationStrategy.GROUP + weights_args = QuantizationArgs( + num_bits=4, + type=QuantizationType.INT, + strategy=strategy, + group_size=args.groupsize, + symmetric=True, + dynamic=False + ) + + # Activation quantization remains the same for both modes. + activations_args = QuantizationArgs( + num_bits=8, + type=QuantizationType.INT, + strategy=QuantizationStrategy.TOKEN, + symmetric=False, + dynamic=True, + observer=None, + ) + + # Create a quantization scheme for Linear layers. + scheme = QuantizationScheme( + targets=["Linear"], + weights=weights_args, + input_activations=activations_args, + ) + + # Create a quantization modifier. We ignore the "lm_head" layer. + modifier = QuantizationModifier(config_groups={"group_0": scheme}, ignore=["lm_head"]) + + # Apply quantization and save the quantized model. + oneshot( + model=model, + recipe=modifier, + tokenizer=tokenizer, + output_dir=output_dir, + ) + print(f"Quantized model saved to: {output_dir}") + + +if __name__ == "__main__": + main() + + +``` +Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy: + +```bash +cd /home/ubuntu/ +python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise +``` +The output model will be saved locally at: +`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`. + +## Launch the vLLM server + +The vLLM server supports the OpenAI-compatible `/v1/chat/completions` API. This is used in this learning path for single-prompt testing with `curl` and for batch testing using a custom Python script that simulates multiple concurrent requests. + +Once the model is quantized, launch the vLLM server to enable CPU-based inference. This configuration uses `tcmalloc` and the optimized `OpenBLAS` build to improve performance and reduce latency: + +```bash +LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/home/ubuntu/OpenBLAS/libopenblas.so \ +ONEDNN_DEFAULT_FPMATH_MODE=BF16 \ +VLLM_TARGET_DEVICE=cpu \ +VLLM_CPU_KVCACHE_SPACE=32 \ +VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \ +vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \ +--dtype float32 --swap-space 16 +``` +This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs. + diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md new file mode 100644 index 0000000000..f6eae52a1a --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md @@ -0,0 +1,195 @@ +--- +title: vLLM Inference Test +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run Single Inference + +Once the server is running, start by verifying it with a basic single-prompt request using `curl`. This confirms the server is running correctly and that the OpenAI-compatible /v1/chat/completions API is responding as expected: + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise", + "temperature": "0.0", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "tell me a funny story"} + ] + }' +``` +If the setup is working correctly, you'll receive a streaming response from the vLLM server. + +The server logs will show that the request was processed successfully. You'll also see prompt and generation throughput metrics, which provide a lightweight benchmark of the model’s performance in your environment. + +The following log output was generated from a single-prompt test run using the steps in this learning path: + +```output +INFO: Started server process [201749] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO 04-10 18:13:14 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. +INFO 04-10 18:13:14 logger.py:39] Received request chatcmpl-a71fae48603c4d90a5d9aa6efd740fec: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ntell me a funny story<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131026, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. +INFO 04-10 18:13:14 engine.py:275] Added request chatcmpl-a71fae48603c4d90a5d9aa6efd740fec. +WARNING 04-10 18:13:15 cpu.py:143] Pin memory is not supported on CPU. +INFO 04-10 18:13:17 metrics.py:455] Avg prompt throughput: 9.2 tokens/s, Avg generation throughput: 11.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. +INFO 04-10 18:13:22 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. +INFO 04-10 18:13:27 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. +INFO: 127.0.0.1:45986 - "POST /v1/chat/completions HTTP/1.1" 200 OK + +``` + +These results confirm that the model is running efficiently on CPU, with stable prompt and generation throughput — a solid baseline before scaling to batch inference. + +## Run Batch Inference + +After confirming single-prompt inference, run batch testing to simulate concurrent load and measure server performance at scale. + +Use the following Python script to simulate concurrent user interactions. + +Save it as `batch_test.py`: +```bash +import requests +import json +import os +import time +import multiprocessing +import argparse + +class bcolors: + HEADER = '\033[95m' + OKBLUE = '\033[94m' + OKCYAN = '\033[96m' + OKGREEN = '\033[92m' + WARNING = '\033[93m' + FAIL = '\033[91m' + ENDC = '\033[0m' + BOLD = '\033[1m' + UNDERLINE = '\033[4m' + +# prompts (duplicate questions) +# https://github.com/ggml-org/llama.cpp/blob/b4753/examples/parallel/parallel.cpp#L42-L52 +prompts = [ + #"Tell me a joke about AI.", + "What is the meaning of life?", + "Tell me an interesting fact about llamas.", + "What is the best way to cook a steak?", + "Are you familiar with the Special Theory of Relativity and can you explain it to me?", + "Recommend some interesting books to read.", + "What is the best way to learn a new language?", + "How to get a job at Google?", + "If you could have any superpower, what would it be?", + "I want to learn how to play the piano.", + "What is the meaning of life?", + "Tell me an interesting fact about llamas.", + "What is the best way to cook a steak?", + "Are you familiar with the Special Theory of Relativity and can you explain it to me?", + "Recommend some interesting books to read.", + "What is the best way to learn a new language?", + "How to get a job at Google?", +] + +def get_stream(url, prompt, index): + s = requests.Session() + print(bcolors.OKGREEN, "Sending request #{}".format(index), bcolors.ENDC) + with s.post(url, headers=None, json=prompt, stream=True) as resp: + print(bcolors.WARNING, "Waiting for the reply #{} to the prompt '".format(index) + prompt["messages"][0]["content"] + "'", bcolors.ENDC) + for line in resp.iter_lines(): + if line: + print(line) + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # this is a mandatory parameter + parser.add_argument("server", help="llama server IP ir DNS address", type=str) + parser.add_argument("port", help="llama server port", type=int) + parser.add_argument("-s", "--stream", help="stream the reply", action="store_true") + parser.add_argument("-b", "--batch", help="concurrent request batch size", type=int, default=1) + parser.add_argument("--max_tokens", help="maximum output tokens", type=int, default=128) + parser.add_argument("--schema", help="enndpoint schema (http/https)", type=str, default="http", choices=["http", "https"]) + parser.add_argument("-m", "--model", help="model name", type=str) + args = parser.parse_args() + + # by default, OpenAI-compatible API is used for the tests, which is supported by both llama.cpp and vllm + openAPI_endpoint = "/v1/chat/completions" + server = args.schema + "://" + args.server + ":" + str(args.port) + openAPI_endpoint + + print(server) + start = time.time() + + proc = [] + for i in range(args.batch): + prompt = { + "messages": [ + {"role": "user", "content": prompts[i]} + ], + "model": args.model, + "temperature": 0, + "max_tokens": args.max_tokens, # for vllm, it ignores n_predict + "n_predict": args.max_tokens, # for llama.cpp (will be ignored by vllm) + "stream": False # streaming + } + + proc.append(multiprocessing.Process(target=get_stream, args=(server, prompt, i))) + + # start the processes + for p in proc: + p.start() + + # wait for all the processes to finish + for p in proc: + p.join() + + end = time.time() + print("done!") + print(end - start) +``` +Then, run it using: + +```bash +python batch_test.py localhost 8000 --schema http --batch 16 -m /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise +``` +This simulates multiple users interacting with the model in parallel and helps validate server-side performance under load. +You can modify the number of requests using the --batch flag or review/edit batch_test.py to customize prompt content and concurrency logic. + +When the test completes, server logs will display a summary including average prompt throughput and generation throughput. This helps benchmark how well the model performs under concurrent load on your Arm-based system. + +### Sample Output +Your logs should display successful responses and performance stats, confirming the model handles concurrent requests as expected. + +The following log output was generated from a batch inference run using the steps in this learning path: + +```output +INFO 04-10 18:20:55 metrics.py:455] Avg prompt throughput: 144.4 tokens/s, Avg generation throughput: 153.4 tokens/s, Running: 16 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0%. +INFO 04-10 18:21:00 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 239.9 tokens/s, Running: 16 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%. +INFO: 127.0.0.1:57558 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57574 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57600 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57604 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57620 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57634 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57638 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57644 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57654 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57660 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57676 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57684 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57696 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57712 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO: 127.0.0.1:57718 - "POST /v1/chat/completions HTTP/1.1" 200 OK +INFO 04-10 18:21:10 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0 +``` + +This output confirms the server is handling concurrent requests effectively, with consistent generation throughput across 16 requests — a strong indication of stable multi-request performance on CPU. + +### Go Beyond: Power Up Your vLLM Workflow +Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore: + +* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Qwen or Gemma. + +* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui) or explore [OpenAI-compatible clients](https://github.com/topics/openai-api-client). diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md new file mode 100644 index 0000000000..7cebfcee55 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md @@ -0,0 +1,59 @@ +--- +title: Quantize and Run a Large Language Model using vLLM on Arm Servers + + +minutes_to_complete: 45 + +who_is_this_for: This learning path is intended for software developers and AI engineers interested in optimizing and deploying large language models using vLLM on Arm-based servers. It’s ideal for those looking to explore CPU-based inference and model quantization techniques. + +learning_objectives: + - Build and configure OpenBLAS to optimize LLM performance. + - Set up vLLM and PyTorch using builds optimized for Arm CPUs. + - Download and quantize a large language model using INT8 techniques. + - Launch a vLLM server to serve the quantized model. + - Run single-prompt and batch inference using the vLLM OpenAI-compatible API. + + +prerequisites: + - An Arm-based server or cloud instance running with at least 32 CPU cores, 64 GB RAM and 80 GB of available disk space. + - Familiarity with Python and machine learning concepts. + - An active Hugging Face account with access to the target model. + +author: Rani Chowdary Mandepudi + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Neoverse +operatingsystems: + - Linux +tools_software_languages: + - vLLM + - LLM + - GenAI + - Python + + +further_reading: + - resource: + title: vLLM Documentation + link: https://docs.vllm.ai/ + type: documentation + - resource: + title: vLLM GitHub Repository + link: https://github.com/vllm-project/vllm + type: github + - resource: + title: Hugging Face Model Hub + link: https://huggingface.co/models + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vLLM-quant/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/data/stats_weekly_data.yml b/data/stats_weekly_data.yml index c17633959e..c9a26074a7 100644 --- a/data/stats_weekly_data.yml +++ b/data/stats_weekly_data.yml @@ -5674,6 +5674,7 @@ pranay-bakre: 5 preema-merlin-dsouza: 1 przemyslaw-wirkus: 2 + rani-chowdary-mandepudi: 1 ravi-malhotra: 1 rin-dobrescu: 1 roberto-lopez-mendez: 2 @@ -5776,10 +5777,10 @@ pranay-bakre: 5 preema-merlin-dsouza: 1 przemyslaw-wirkus: 2 + rani-chowdary-mandepudi: 1 ravi-malhotra: 1 rin-dobrescu: 1 roberto-lopez-mendez: 2 - ronan-synnott: 45 shuheng-deng: 1 thirdai: 1 tianyu-li: 1