|
3 | 3 | # cmd: ["python", "06_gpu_and_ml/llm-serving/lfm_snapshot.py"] |
4 | 4 | # --- |
5 | 5 |
|
6 | | -# # Low Latency, Serverless LFM 2 with vLLM and Modal |
| 6 | +# # Low Latency, Serverless LFM2 with vLLM and Modal |
7 | 7 |
|
8 | | -# In this example, we show how to serve Liquid AI's [LFM 2 models](https://www.liquid.ai/liquid-foundation-models) |
| 8 | +# In this example, we show how to serve Liquid AI's [LFM2 models](https://www.liquid.ai/liquid-foundation-models) |
9 | 9 | # with [vLLM](https://docs.vllm.ai) with low latency and fast cold starts on Modal. |
10 | 10 |
|
11 | | -# The LFM 2 models are not vanilla Transformers -- they have a hybrid architecture, |
| 11 | +# The LFM2 models are not vanilla Transformers -- they have a hybrid architecture, |
12 | 12 | # discovered via an architecture search that optimized for quality, latency, and memory footprint. |
13 | 13 | # Check out their [technical report](https://arxiv.org/abs/2511.23404v1) |
14 | 14 | # for more details. |
15 | 15 |
|
| 16 | +# Here, we run the [24B-A2B variant](https://huggingface.co/LiquidAI/LFM2-24B-A2B) of LFM2, |
| 17 | +# described [here](https://www.liquid.ai/blog/lfm2-24b-a2b). This variant is designed |
| 18 | +# for efficient inference and includes instruction tuning. |
| 19 | +# It is released under the weights-available [LFM 1.0 License](https://huggingface.co/LiquidAI/LFM2-24B-A2B/blob/main/LICENSE), |
| 20 | +# which restricts commercial use for entities with over $10M in revenue. |
| 21 | + |
16 | 22 | # This example demonstrates techniques to run inference at high efficiency, |
17 | 23 | # including advanced features of both vLLM and Modal. |
18 | 24 | # For a simpler introduction to LLM serving, see |
|
22 | 28 | # which uses a new, low-latency routing service on Modal designed for latency-sensitive inference workloads. |
23 | 29 | # This gives us more control over routing, but with increased power comes increased responsibility. |
24 | 30 |
|
25 | | -# We also include instructions for cutting cold start times by an order of magnitude using Modal's |
| 31 | +# We also include instructions for cutting cold start times using Modal's |
26 | 32 | # [CPU + GPU memory snapshots](https://modal.com/docs/guide/memory-snapshot). |
27 | 33 |
|
28 | 34 | # Fast cold starts are particularly useful for LLM inference applications |
|
50 | 56 |
|
51 | 57 | MINUTES = 60 |
52 | 58 |
|
53 | | -MODEL_NAME = os.environ.get("MODEL_NAME", "LiquidAI/LFM2-8B-A1B") |
| 59 | +MODEL_NAME = os.environ.get("MODEL_NAME", "LiquidAI/LFM2-24B-A2B") |
54 | 60 | print(f"Running deployment script for model: {MODEL_NAME}") |
55 | 61 |
|
56 | 62 | vllm_image = ( |
|
0 commit comments