chore: update docs for v1.9.0 (#1071)

jfrery · andrei-stoian-zama · web-flow · commit 9d9669d75796 · 2025-04-10T15:50:58.000+02:00
Co-authored-by: Andrei Stoian &lt;95410270+andrei-stoian-zama@users.noreply.github.com&gt;
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -18,14 +18,18 @@
 - [Encrypted dataframe](built-in-models/encrypted_dataframe.md)
 - [Encrypted training](built-in-models/training.md)
 
+## LLMs
+
+- [Inference](llm/inference.md)
+- [Encrypted fine-tuning](llm/lora_training.md)
+
 ## Deep Learning
 
 - [Using Torch](deep-learning/torch_support.md)
 - [Using ONNX](deep-learning/onnx_support.md)
 - [Step-by-step guide](deep-learning/fhe_friendly_models.md)
 - [Debugging models](deep-learning/fhe_assistant.md)
 - [Optimizing inference](deep-learning/optimizing_inference.md)
-- [Encrypted fine-tuning](deep-learning/lora_training.md)
 
 ## Guides
 
diff --git a/docs/built-in-models/linear.md b/docs/built-in-models/linear.md
@@ -26,6 +26,10 @@ In addition to predicting on encrypted data, the following models  support train
 
 |       [SGDClassifier](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier)       |                           [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)                           |
 
+## Ciphertext format compatibility
+
+These models only support _Concrete_ ciphertexts. See [the ciphertexts format](../getting-started/concepts.md#ciphertext-formats) documentation for more details.
+
 ## Quantization parameters
 
 The `n_bits` parameter controls the bit-width of the inputs and weights of the linear models. Linear models do not use table lookups and thus allows weight and inputs to be high precision integers.
diff --git a/docs/built-in-models/nearest-neighbors.md b/docs/built-in-models/nearest-neighbors.md
@@ -6,6 +6,10 @@ This document introduces the nearest neighbors non-parametric classification mod
 | :---------------------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------- |
 | [KNeighborsClassifier](../references/api/concrete.ml.sklearn.neighbors.md#class-kneighborsclassifier) | [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) |
 
+## Ciphertext format compatibility
+
+These models only support _Concrete_ ciphertexts. See [the ciphertexts format](../getting-started/concepts.md#ciphertext-formats) documentation for more details.
+
 ## Example
 
 ```python
diff --git a/docs/built-in-models/neural-networks.md b/docs/built-in-models/neural-networks.md
@@ -23,6 +23,10 @@ Good quantization parameter values are critical to make models [respect FHE cons
 Using `nn.ReLU` as the activation function benefits from an optimization where [quantization uses powers-of-two scales](../explanations/quantization.md#quantization-special-cases). This results in much faster inference times in FHE, thanks to a TFHE primitive that performs fast division by powers of two.
 {% endhint %}
 
+## Ciphertext format compatibility
+
+These models only support _Concrete_ ciphertexts. See [the ciphertexts format](../getting-started/concepts.md#ciphertext-formats) documentation for more details.
+
 ## Example
 
 To create an instance of a Fully Connected Neural Network (FCNN), you need to instantiate one of the `NeuralNetClassifier` and `NeuralNetRegressor` classes and configure a number of parameters that are passed to their constructor.
diff --git a/docs/built-in-models/training.md b/docs/built-in-models/training.md
@@ -17,6 +17,10 @@ See the [deployment](#deployment) section for more details.
 Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import models trained through federated learning using 3rd party tools. All model types are supported - linear, tree-based and neural networks - through the [`from_sklearn_model` function](linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
 {% endhint %}
 
+## Ciphertext format compatibility
+
+These models only support _Concrete_ ciphertexts. See [the ciphertexts format](../getting-started/concepts.md#ciphertext-formats) documentation for more details.
+
 ## Example
 
 The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows logistic regression training on encrypted data in action.
diff --git a/docs/built-in-models/tree.md b/docs/built-in-models/tree.md
@@ -26,6 +26,12 @@ For a formal explanation of the mechanisms that enable FHE-compatible decision t
 Using the maximum depth parameter of decision trees and tree-ensemble models strongly increases the number of nodes in the trees. Therefore, we recommend using the XGBoost models which achieve better performance with lower depth.
 {% endhint %}
 
+## Ciphertext format compatibility
+
+The `DecisionTreeClassifier`, `RandomForestClassifier`, and `XGBClassifier` support [TFHE-rs radix ciphertexts](../getting-started/concepts.md#ciphertext-formats) when `n_bits` is set to 8. The other tree-based models, or different `n_bits` configurations only support _Concrete_ ciphertexts.
+
+To compile a model to use _TFHE-rs ciphertexts_ as inputs and outputs, set `ciphertext_mode=CiphertextFormat.TFHE-RS` in the `compile` call.
+
 ## Pre-trained models
 
 You can convert an already trained scikit-learn tree-based model to a Concrete ML one by using the [`from_sklearn_model`](../references/api/concrete.ml.sklearn.base.md#classmethod-from_sklearn_model) method.
diff --git a/docs/getting-started/README.md b/docs/getting-started/README.md
@@ -5,11 +5,13 @@
 Concrete ML is an open source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE). It enables data scientists without any prior knowledge of cryptography to perform:
 
 - **Automatic model conversion**: Use familiar APIs from scikit-learn and PyTorch to convert machine learning models to their FHE equivalent. This is applicable for [linear models](../built-in-models/linear.md), [tree-based models](../built-in-models/tree.md), and [neural networks](../built-in-models/neural-networks.md)).
-- **Encrypted data training**: [Train models](../built-in-models/training.md) directly on encrypted data to maintain privacy.
+- **Encrypted data training**: [Train linear models](../built-in-models/training.md) or [fine-tune LLMs](../llm/lora_training.md) directly on encrypted data to maintain privacy.
 - **Encrypted data pre-processing**: [Pre-process encrypted data](../built-in-models/encrypted_dataframe.md) using a DataFrame paradigm.
 
 ## Key features
 
+- **Model inference on encrypted data**: Concrete ML converts models such as decision trees, LLMs, neural networks, etc.. to predict on encrypted data. Those models can be trained either on clear data or on encrypted data.
+
 - **Training on encrypted data**: FHE is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. Learn more about FHE in [this introduction](https://www.zama.ai/post/tfhe-deep-dive-part-1) or join the [FHE.org](https://fhe.org) community.
 
 - **Federated learning**: Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import all types of models: linear, tree-based and neural networks, that are trained using federated learning using the [`from_sklearn_model` function](../built-in-models/linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
diff --git a/docs/getting-started/concepts.md b/docs/getting-started/concepts.md
@@ -4,6 +4,13 @@ This document explains the essential cryptographic terms and the important conce
 
 Concrete ML is built on top of Concrete, which enables the conversion from NumPy programs into FHE circuits.
 
+## Table of Contents
+
+1. [Lifecycle of a Concrete ML model](#lifecycle-of-a-concrete-ml-model)
+1. [Cryptography concepts](#cryptography-concepts)
+1. [Ciphertext formats](#ciphertext-formats)
+1. [Model accuracy considerations under FHE constraints](#model-accuracy-considerations-under-fhe-constraints)
+
 ## Lifecycle of a Concrete ML model
 
 With Concrete ML, you can train a model on clear or encrypted data, then deploy it to predict on encrypted inputs. During deployment, data can be pre-processed while being encrypted. Therefore, data stay encrypted during the entire lifecycle of the machine learning model, with some limitations.
@@ -39,6 +46,8 @@ You can find examples of the model development workflow [here](../tutorials/ml_e
    - A private encryption key to encrypt/decrypt their data and results
    - A public evaluation key for the model's FHE evaluation on the server.
 
+1. **Ciphertext formats** The server-side application can be configured to accept different types of ciphertexts from the client, depending on the type of application. See [Ciphertext formats](#ciphertext-formats) for more details.
+
 You can find an example of the model deployment workflow [here](../advanced_examples/ClientServer.ipynb).
 
 ## Cryptography concepts
@@ -59,8 +68,30 @@ Concrete ML and Concrete abstract the details of the underlying cryptography sch
 
 - **Programmable Boostrapping (PBS)** : Programmable Bootstrapping enables the homomorphic evaluation of any function of a ciphertext, with a controlled level of noise. Learn more about PBS in [this paper](https://eprint.iacr.org/2021/091).
 
+- **Ciphertext formats**: To represent encrypted values, Concrete ML offers two options: the default Concrete ciphertext format, which is supported by all ML models and highly optimized for performance, or the block-based TFHE-rs radix format, which supports larger values, is forward-compatible, and suitable for Blockchain applications, but is limited to certain types of ML models.
+
 For a deeper understanding of the cryptography behind the Concrete stack, refer to the [whitepaper on TFHE and Programmable Boostrapping](https://whitepaper.zama.ai/) or [this series of blogs](https://www.zama.ai/post/tfhe-deep-dive-part-1).
 
+## Ciphertext formats
+
+Two different types of ciphertexts are usable by Concrete ML for model input/outputs.
+
+1. _Concrete_ LWE ciphertexts (default):
+
+   By default, Concrete ML uses Concrete LWE ciphertexts with crypto-system parameters that are tailored to each ML model. These parameters may vary between different versions of Concrete ML. Thus, the encryption crypto-parameters may change at any point. Some implications are:
+
+- Typically, a server-side application provides the client with its encryption cryptographic parameters.
+- When the application is updated, the client downloads the new cryptographic parameters.
+- Ciphertexts encrypted with a set of cryptographic parameters can not be re-used for a model compiled with different cryptographic parameters
+
+2. _TFHE-rs radix_ ciphertexts:
+
+   Concrete ML also supports \_TFHE-rs radix _ ciphertexts, which rely on a universal and forward-compatible parameter set. Therefore:
+   In this setting, a conversion layer is added to the ML model, potentially resulting in a 4–5× latency overhead.
+
+- Ciphertexts encrypted with the universal cryptographic parameters can be used at any point in the future with any ML model.
+- In this setting, a conversion layer is added to the ML model. This conversion may imply a 4-5x slowdown for model latency.
+
 ## Model accuracy considerations under FHE constraints
 
 FHE requires all inputs, constants, and intermediate values to be integers of maximum 16 bits. To make machine learning models compatible with FHE, Concrete ML implements some techniques with accuracy considerations:
diff --git a/docs/guides/client_server.md b/docs/guides/client_server.md
@@ -62,7 +62,7 @@ import numpy as np
 fhe_directory = '/tmp/fhe_client_server_files/'
 
 # Initialize the Decision Tree model
-model = DecisionTreeClassifier()
+model = DecisionTreeClassifier(n_bits=8)
 
 # Generate some random data for training
 X = np.random.rand(100, 20)
@@ -102,6 +102,43 @@ result = client.deserialize_decrypt_dequantize(encrypted_result)
 
 These objects are serialized into bytes to streamline the data transfer between the client and server.
 
+#### Ciphertext formats and keys
+
+Two types of ciphertext formats are [available in Concrete ML](../getting-started/concepts.md#ciphertext-formats) and both are available for deployment. To use the _TFHE-rs radix_ format, pass the `ciphertext_format` option to the compilation call as follows:
+
+<!--pytest-codeblocks:cont-->
+
+```python
+from concrete.ml.common.utils import CiphertextFormat
+model.compile(X, ciphertext_format=CiphertextFormat.TFHE_RS)
+
+fhe_directory = '/tmp/fhe_client_server_files_tfhers/'
+
+# Setup the development environment
+dev = FHEModelDev(path_dir=fhe_directory, model=model)
+dev.save()
+
+# Setup the client
+client = FHEModelClient(path_dir=fhe_directory, key_dir="/tmp/keys_client_tfhers")
+serialized_evaluation_keys, tfhers_evaluation_keys = client.get_serialized_evaluation_keys(include_tfhers_key=True)
+
+# Client pre-processes new data
+X_new = np.random.rand(1, 20)
+encrypted_data = client.quantize_encrypt_serialize(X_new)
+
+# Setup the server
+server = FHEModelServer(path_dir=fhe_directory)
+server.load()
+
+# Server processes the encrypted data
+encrypted_result = server.run(encrypted_data, serialized_evaluation_keys)
+
+# Client decrypts the result
+result = client.deserialize_decrypt_dequantize(encrypted_result[0])
+```
+
+In the example above, a second evaluation key is obtained in the `tfhers_evaluation_keys` variable. This key can be loaded by TFHE-rs Rust programs to perform further computation on the model output ciphertexts.
+
 ## Serving
 
 The client-side deployment of a secured inference machine learning model is illustrated as follows:
diff --git a/docs/guides/using_gpu.md b/docs/guides/using_gpu.md
@@ -7,10 +7,10 @@ a model is compiled for CUDA, executing it on a non-CUDA-enabled machine will ra
 
 ## Support
 
-| Feature     | Built-in models | Custom models | Deployment | DataFrame |
-| ----------- | --------------- | ------------- | ---------- | --------- |
-| GPU support | ✅              | ✅            | ✅         | ❌        |
-|             |                 |               |            |           |
+| Feature     | Built-in models | Deep NNs and LLMs | Deployment | DataFrame |
+| ----------- | --------------- | ----------------- | ---------- | --------- |
+| GPU support | ✅              | ✅                | ✅         | ❌        |
+|             |                 |                   |            |           |
 
 {% hint style="warning" %}
 When compiling a model for GPU, the model is assigned GPU-specific crypto-system parameters. These parameters are more constrained than the CPU-specific ones.
@@ -29,6 +29,10 @@ on a desktop CPU.
 
 ## Prerequisites
 
+### Built-in models and deep NNs
+
+This section pertains to models that are compiled using the `sklearn`-style built-in model classes or that are compiled using `compile_torch_model` or `compile_brevitas_qat_model`.
+
 To use the CUDA-enabled backend, install the GPU-enabled Concrete compiler:
 
 ```bash
@@ -65,3 +69,12 @@ To compile a model for CUDA, simply supply the `device='cuda'` argument to its c
 
 - For built-in models, use `.compile` function.
 - For custom models, use either`compile_torch_model` or `compile_brevitas_qat_model`.
+
+## LLMs
+
+This section pertains to models that are compiled with `HybridFHEModel`.
+
+The models compiled as described in [the LLM section](../llm/inference.md) will
+use GPU acceleration if a GPU is available on the machine where the models
+are executed. No specific compilation configuration is required to enable GPU
+execution for these models.
diff --git a/docs/llm/inference.md b/docs/llm/inference.md
@@ -0,0 +1,101 @@
+# Encrypted LLM Inference
+
+LLMs can be converted to use FHE to generate encrypted tokens based on encrypted prompts. Concrete ML implements LLM inference as a client/server protocol where
+
+- The client executes non-linear layers in the LLM, such as attention and activation functions.
+- The server executes linear layers, such as projection and embedding.
+
+The FHE LLM implementation in Concrete ML has the following characteristics:
+
+- Data transfer is necessary for each linear layer. The size of encrypted data
+  is about 4x the size of the clear data that are input/outputs to the linear layers. For instance:
+  - A [LLAMA 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model exchanges around 18MB of data per token.
+  - A [GPT2](https://huggingface.co/openai-community/gpt2) mode exchanges around 2.2MB of data per token.
+- The client machine needs to perform some computation, thus it needs to execute some PyTorch layers.
+- Advantages of FHE include:
+  - Offloading computation from clients with limited hardware.
+  - Preserving intellectual property by running sensitive model components on encrypted data.
+
+## Compiling an LLM for FHE Inference
+
+This document introduces how to use Concrete ML to run encrypted LLM inference with FHE.
+To prepare an LLM model for FHE inference, use the `HybridFHEModel` class:
+
+```python
+import random
+import json
+import numpy as np
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, Conv1D, Trainer, TrainingArguments
+
+from concrete.ml.torch.hybrid_model import HybridFHEModel
+
+# Load the GPT2 model
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+model = AutoModelForCausalLM.from_pretrained("gpt2")
+model.config.pad_token_id = model.config.eos_token_id
+
+# Determine which layers run with FHE (all linear ones)
+remote_names = []
+for name, module in model.named_modules():
+    if isinstance(module, (torch.nn.Linear, Conv1D)):
+        remote_names.append(name)
+
+# Create the HybridFHEModel with the specified remote modules
+hybrid_model = HybridFHEModel(model, module_names=remote_names)
+
+# Prepare input data for calibration
+input_tensor = torch.randint(0, tokenizer.vocab_size, (1, 32), dtype=torch.long)
+
+# Calibrate and compile the model
+hybrid_model.compile_model(input_tensor, n_bits=8, use_dynamic_quantization=True)
+```
+
+After `compile_model` is called as above, you can retrieve the FHE-enabled model in
+`hybrid_model.model`.
+
+As for all Concrete ML models, to verify accuracy of the converted LLM on clear data, you can use `fhe='disable'` or `fhe='simulate'`. To actually executed on
+encrypted data, set the `fhe_mode` to `execute`:
+
+<!--pytest-codeblocks:cont-->
+
+```python
+hybrid_model.set_fhe_mode("execute")
+```
+
+Next, to generate some tokens using FHE computation, run:
+
+<!--pytest-codeblocks:cont-->
+
+```python
+prompt = "Programming is"
+inputs = tokenizer.encode_plus(prompt, return_tensors="pt")
+inputs = {k: v for k, v in inputs.items()}
+
+N_TOKENS_GENERATE = 1
+# Generate text
+with torch.no_grad():
+    output = model.generate(
+        input_ids=inputs["input_ids"],
+        attention_mask=inputs["attention_mask"],
+        max_new_tokens=N_TOKENS_GENERATE,
+        top_p=0.9,
+        temperature=0.6,
+        do_sample=True,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+
+# Get only the newly generated tokens
+input_length = inputs["input_ids"].shape[1]
+generated_ids = output[0, input_length:]
+generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+
+# Print the prompt and generated text
+print(f"Prompt: {prompt}")
+print(f"Response: {generated_text}\n")
+```
+
+## Latency and throughput
+
+The Concrete ML LLM model inference, as described above, can use GPUs to obtain acceleration. Running on GPU reduces latency by ~30x. For example, generating
+a GPT2 token on GPU takes ~11 seconds, while it takes ~300 seconds.
diff --git a/docs/llm/lora_training.md b/docs/llm/lora_training.md
diff --git a/script/doc_utils/check_forbidden_words.py b/script/doc_utils/check_forbidden_words.py
@@ -192,6 +192,10 @@ def process_file(file_str: str, do_open_problematic_files=False):
         ("eg., ", [], []),  # use e.g.,
         ("Lora", [], []),  # use LoRA
         ("LORA", [], []),  # use LoRA
+        ("TFHE-RS", [], []),  # Use TFHE-rs
+        ("TFHE_RS", ["TFHE_RS"], [".py"]),  # Use TFHE-rs
+        ("TFHErs", [], []),  # Use TFHE-rs
+        ("TFHRES", [], []),  # Use TFHE-rs
     ]
     # For later
     #   "We" or "Our", or more generally, passive form
diff --git a/use_case_examples/lora_finetuning/MathEvalLoraLLama.ipynb b/use_case_examples/lora_finetuning/MathEvalLoraLLama.ipynb