fix hfoptions blocks in unit 1 8.mdx

burtenshaw · web-flow · commit 538e8c9eab67 · 2025-09-09T14:30:31.000+02:00
diff --git a/chapters/en/chapter2/8.mdx b/chapters/en/chapter2/8.mdx
@@ -15,16 +15,19 @@ TGI, vLLM, and llama.cpp serve similar purposes but have distinct characteristic
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />
 
 <Tip title="How Flash Attention Works">
+
 Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [Chapter 1.8](/course/chapter1/8), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.
 
 The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers. 
 
 While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving.
+
 </Tip>
 
 **vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.
 
 <Tip title="How PagedAttention Works">
+
 PagedAttention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [Chapter 1.8](/course/chapter1/8), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests.
 
 vLLM's key innovation lies in how it manages this cache:
@@ -35,11 +38,13 @@ vLLM's key innovation lies in how it manages this cache:
 4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences.
 
 The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
+
 </Tip>
 
 **llama.cpp** is a highly optimized C/C++ implementation originally designed for running LLaMA models on consumer hardware. It focuses on CPU efficiency with optional GPU acceleration and is ideal for resource-constrained environments. llama.cpp uses quantization techniques to reduce model size and memory requirements while maintaining good performance. It implements optimized kernels for various CPU architectures and supports basic KV cache management for efficient token generation.
 
 <Tip title="How llama.cpp Quantization Works">
+
 Quantization in llama.cpp reduces the precision of model weights from 32-bit or 16-bit floating point to lower precision formats like 8-bit integers (INT8), 4-bit, or even lower. This significantly reduces memory usage and improves inference speed with minimal quality loss.
 
 Key quantization features in llama.cpp include:
@@ -49,9 +54,8 @@ Key quantization features in llama.cpp include:
 4. **Hardware-Specific Optimizations**: Includes optimized code paths for various CPU architectures (AVX2, AVX-512, NEON)
 
 This approach enables running billion-parameter models on consumer hardware with limited memory, making it perfect for local deployments and edge devices.
-</Tip>
-
 
+</Tip>
 
 ### Deployment and Integration
 
@@ -63,8 +67,6 @@ Let's move on to the deployment and integration differences between the framewor
 
 **llama.cpp** prioritizes simplicity and portability. Its server implementation is lightweight and can run on a wide range of hardware, from powerful servers to consumer laptops and even some high-end mobile devices. With minimal dependencies and a simple C/C++ core, it's easy to deploy in environments where installing Python frameworks would be challenging. The server provides an OpenAI-compatible API while maintaining a much smaller resource footprint than other solutions.
 
-
-
 ## Getting Started
 
 Let's explore how to use these frameworks for deploying LLMs, starting with installation and basic setup.
@@ -146,7 +148,9 @@ response = client.chat.completions.create(
 )
 print(response.choices[0].message.content)
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 llama.cpp is easy to install and use, requiring minimal dependencies and supporting both CPU and GPU inference.
@@ -235,7 +239,9 @@ response = client.chat.completions.create(
 )
 print(response.choices[0].message.content)
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 vLLM is easy to install and use, with both OpenAI API compatibility and a native Python interface.
@@ -306,6 +312,7 @@ response = client.chat.completions.create(
 )
 print(response.choices[0].message.content)
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -333,6 +340,7 @@ docker run --gpus all \
 ```
 
 Use the InferenceClient for flexible text generation:
+
 ```python
 from huggingface_hub import InferenceClient
 
@@ -380,7 +388,9 @@ response = client.chat.completions.create(
 )
 print(response.choices[0].message.content)
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 For llama.cpp, you can set advanced parameters when launching the server:
@@ -487,7 +497,9 @@ output = llm(
 
 print(output["choices"][0]["text"])
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 For advanced usage with vLLM, you can use the InferenceClient:
@@ -579,6 +591,7 @@ formatted_prompt = llm.get_chat_template()(chat_prompt)  # Uses model's chat tem
 outputs = llm.generate(formatted_prompt, sampling_params)
 print(outputs[0].outputs[0].text)
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -610,7 +623,9 @@ client.generate(
     repetition_penalty=1.1,  # Reduce repetition
 )
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 ```python
@@ -635,7 +650,9 @@ output = llm(
     repeat_penalty=1.1,
 )
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 ```python
@@ -648,6 +665,7 @@ params = SamplingParams(
 )
 llm.generate("Write a creative story", sampling_params=params)
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -659,14 +677,17 @@ Both frameworks provide ways to prevent repetitive text generation:
 <hfoptions id="inference-frameworks" >
 
 <hfoption value="tgi" label="TGI">
+
 ```python
 client.generate(
     "Write a varied text",
     repetition_penalty=1.1,  # Penalize repeated tokens
     no_repeat_ngram_size=3,  # Prevent 3-gram repetition
 )
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 ```python
@@ -686,7 +707,9 @@ output = llm(
     presence_penalty=0.5,  # Additional presence penalty
 )
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 ```python
@@ -695,6 +718,7 @@ params = SamplingParams(
     frequency_penalty=0.1,  # Penalize token frequency
 )
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -706,6 +730,7 @@ You can control generation length and specify when to stop:
 <hfoptions id="inference-frameworks" >
 
 <hfoption value="tgi" label="TGI">
+
 ```python
 client.generate(
     "Generate a short paragraph",
@@ -714,7 +739,9 @@ client.generate(
     stop_sequences=["\n\n", "###"],
 )
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 ```python
@@ -729,7 +756,9 @@ response = client.completions.create(
 # Via direct library
 output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 ```python
@@ -741,6 +770,7 @@ params = SamplingParams(
     skip_special_tokens=True,
 )
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -752,6 +782,7 @@ Both frameworks implement advanced memory management techniques for efficient in
 <hfoptions id="inference-frameworks" >
 
 <hfoption value="tgi" label="TGI">
+
 TGI uses Flash Attention 2 and continuous batching:
 
 ```sh
@@ -763,7 +794,9 @@ docker run --gpus all -p 8080:80 \
     --max-batch-total-tokens 8192 \
     --max-input-length 4096
 ```
+
 </hfoption>
+
 <hfoption value="llama.cpp" label="llama.cpp">
 
 llama.cpp uses quantization and optimized memory layout:
@@ -789,7 +822,9 @@ For models too large for your GPU, you can use CPU offloading:
     --n-gpu-layers 20 \     # Keep first 20 layers on GPU
     --threads 8             # Use more CPU threads for CPU layers
 ```
+
 </hfoption>
+
 <hfoption value="vllm" label="vLLM">
 
 vLLM uses PagedAttention for optimal memory management:
@@ -806,6 +841,7 @@ engine_args = AsyncEngineArgs(
 
 llm = LLM(engine_args=engine_args)
 ```
+
 </hfoption>
 
 </hfoptions>
@@ -818,4 +854,4 @@ llm = LLM(engine_args=engine_args)
 - [vLLM GitHub Repository](https://github.com/vllm-project/vllm)
 - [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
 - [llama.cpp GitHub Repository](https://github.com/ggerganov/llama.cpp)
-- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)
+- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)