A generative, conversational workflow and multi-agent system using PDL and mlx
- Introduction
- Design Patterns
- Core Concepts
- Usage
- Caching
- Service Deployment
- Practical Applications
- Extensions
- Examples
For a detailed explanation of design patterns enabled by Generative Redfoot, see the Design Patterns documentation.
Generative Redfoot is built around several core concepts that enable declarative, composable AI workflows:
The Prompt Declaration Language (PDL) is a YAML-based declarative approach to prompt programming, where prompts are at the forefront. It facilitates model evaluation chaining and tool use, abstracting away the plumbing needed for composition and hyperparameter setting and management. Generative Redfoot is a PDL-based declarative approach to creating orchestrated, generative AI workflows and services, conceived as prompt programming composition.
Generative Redfoot originally was meant as a modern replacement for the original RDFLIb Redfoot, an extensible RDF service architecture for building a distributed, Semantic Web of peer-to-peer nodes with modern generative AI systems and technologies in mind.
Generative Redfoot recasts this paradigm further with firm roots in declarative and RESTful principles in how generative AI actions and capabilities can be composed and orchestrated. Beyond RDFLib Redfoot, the framework it facilitates was mainly motivated by supporting this use case from the PDL documentation/paper:
The system takes a Prompt Declaration Language (PDL) file and generates a finite state generative machine as Python objects for a subset of the PDL language. These objects (the "programs" in particular) can be executed and served as web services that take parameters and return results.
The Model class can be extended and incorporated into how a dispatcher creates the PDL Python objects from a PDL file to incorporate the functionality for evaluating the prompts against the models specified in PDL. This is done using any accumulated conversational context, prompts, and generation parameters (sampling parameters, for example), (optionally) updating the context as the program execution continues and how mlx is used to implement the model loading and inference.
The language of the PDL file can be extended with additional custom functionality, and other LLM systems can handle the evaluation.
It depends on the PyYaml and click third-party Python libraries as well as mlx and can be run this way, where document.pdl is a PDL file.
% Usage: generative_redfoot [OPTIONS] PDL_FILE
Options:
-t, --temperature FLOAT
-rp, --repetition-penalty FLOAT
The penalty factor for repeating tokens
(none if not used)
--top-k INTEGER Sampling top-k
--top-p INTEGER Sampling top-p
--max-tokens INTEGER Max tokens
--min-p FLOAT Sampling min-p
--verbose / --no-verbose
-v, --variables <TEXT TEXT>...
--help Show this message and exit.
generative_redfoot.py document.pdl
The main argument is a PDL program (a YAML file), possibly with extensions of the language implemented by generative_redfoot.
You can also specify default values for sampling parameters for the LLM calls during the execution of the programs using mlx.
The model parameters directive in PDL can be used to specify the following mlx generation parameters: temperature, top_k, min_p, max_tokens, and top_p:
description: ...
text:
- read:
message: |
What is your query?
contribute: [context]
- model: .. model ..
parameters:
temperature: 0.6
min_p: .03
max_tokens: 200When a PDL file contains a server section, generative_redfoot will automatically start a FastAPI web service in which to execute the program.
Generative Redfoot includes advanced caching capabilities for efficient inference:
Generative Redfoot supports two types of prompt caching:
- Internal Caching: Uses mlx-lm's built-in KV cache for efficiency during program execution
- External Caching: Creates and reuses persistent prompt cache files for faster subsequent executions
To use basic internal caching, specify a top-level cache parameter in your PDL file:
cache: "*"
description: ...
text:
- model: mlx-community/Llama-3.2-3B-Instruct-4bit
# ... rest of your programFor persistent caching, you can create cache files using the content_model directive:
cache:
- text:
- role: system
content: "You are a helpful assistant"
- content_model: mlx-community/Llama-3.2-3B-Instruct-4bit
name: my_cache
file: my_prompt_cache.safetensorsThe content_model directive creates prompt caches that can be reused across multiple executions. This is particularly useful for prompts that don't change frequently but are expensive to process.
cache:
- text:
- role: system
read_from_wordloom: prompt_library.toml
items: helpful_chatbot_system_prompt
- content_model: mlx-community/Llama-3.2-3B-Instruct-4bit
name: chat_cache
file: chat_prompt_cache.safetensors
kv_group_size: 64
kv_bits: 4
max_kv_size: 10000Parameters for content_model:
name: Identifier for referencing the cache in model blocksfile: Path to save the prompt cache filekv_group_size: Group size for KV cache quantization (default: 64)kv_bits: Number of bits for KV cache quantization (default: 4)quantized_kv_start: When kv_bits is set, start quantizing the KV cache from this step onwards (default: 5000)max_kv_size: Maximum key-value cache size (default: 10000)prefix_marker: Wordloom marker indicating the end of the common prefix for caching
These caches can be referenced in model blocks:
text:
- model: chat_cache
parameters:
temperature: 0.7
max_tokens: 500PDL programs can be deployed as web services using FastAPI. Add a server section to your PDL file:
server:
host: 0.0.0.0
port: 8000
request_body_marker: document_text
path: /generate
content_type: text/plain
log_level: info
log_path: /tmp/service.log
description: ...
text:
- model: mlx-community/Llama-3.2-3B-Instruct-4bit
# ... rest of your programWhen you run the PDL file with generative_redfoot, it will automatically start a FastAPI service instead of executing locally.
Server configuration parameters:
host: Host address to bind the service toport: Port number for the servicerequest_body_marker: Context variable name for the incoming request body (acts as a wordloom marker for request binding)path: API endpoint pathcontent_type: Expected content type of incoming requests (optional - if omitted, any content type is accepted)log_level: Logging level (debug, info, warning, error)log_path: Path to log file
The service accepts POST requests with the specified content type and processes the request body using the PDL program. The result is returned as plain text.
The request_body_marker is used to bind incoming request content to a context variable that can then be referenced throughout the PDL program using variable syntax like { $document_text }.
Generative Redfoot supports dynamic variable references using the syntax { $variable_name } to bind values from the execution context. This is particularly useful for:
- Service Protocol Parameters: Binding request body content to variables (e.g.,
{ $file }for file uploads) - Contextual Content: Referencing values from the execution context in PDL blocks
- Parameter Injection: Dynamically injecting values based on runtime conditions
Example of variable reference in PDF processing:
text:
- text:
- PDF_raw_read_ocr: { $file }
contribute: [context]A PDL program can use mlx-lm's Prompt caching
by specifying a top-level cache parameter at the top of the PDL document. If the value of this parameter is '*',
then an internal, rotating K/V cache is used for the duration of the PDL evaluation for efficiency.
Otherwise, the value is expected to be the path to a previously saved cache, created using mlx_lm.cache_prompt for
example (or the cache preparation capabilities described above), which is treated as a cached prompt that is a prefix to
any prompts specified in the program.
Generative Redfoot defines a number of extensions from PDL
Generative Redfoot now supports four distinct PDF reading modes using PyMuPDF, each designed for different input scenarios:
PDF_raw_read_ocr: Extract text using OCR from raw PDF contentPDF_raw_read_txt: Extract text directly from raw PDF contentPDF_filename_ocr: Extract text from a PDF file using OCRPDF_filename_txt: Extract text directly from a PDF file
The choice of which extension to use depends on your input format and processing requirements:
File-based reading (PDF_filename_ocr and PDF_filename_txt):
Use these when you have a file path available. The _ocr variant is for scanned documents, while the _txt variant is for documents with selectable text.
Raw content reading (PDF_raw_read_ocr and PDF_raw_read_txt):
Use these when working with PDF content as raw bytes or when the PDF is provided as part of a web request (e.g., file uploads in the service deployment context).
Here's an example that reads PDF content from a run-time specified filename using direct text extraction and uses it as context for a model evaluation:
description: autocoding_from_pdf
text:
- text:
- PDF_filename_txt: { $context_file }
contribute: [context]
- |
Provide a list of the 5 ICD-10 codes mentioned in the document above
contribute: [context]
- model: mlx-community/medgemma-4b-it-4bitFor OCR-based extraction from a file:
description: autocoding_from_scanned_pdf
text:
- text:
- PDF_filename_ocr: { $context_file }
contribute: [context]
- |
Provide a list of the 5 ICD-10 codes mentioned in the document above
contribute: [context]
- model: mlx-community/medgemma-4b-it-4bitFor processing PDF content provided as raw bytes (useful in service deployment contexts):
description: pdf_processing_from_upload
text:
- text:
- PDF_raw_read_txt: { $uploaded_pdf_content }
contribute: [context]
- |
Summarize the key points from the document
contribute: [context]
- model: mlx-community/medgemma-4b-it-4bitThe enhanced PDF processing includes robust error handling and content validation:
- File Path Validation: Checks for existence and validity of PDF file paths
- Raw Content Processing: Handles both file paths and raw PDF byte content
- UploadFile Support: Direct integration with FastAPI's UploadFile objects for web services
- Content Type Detection: Proper handling of different PDF content types
- OCR and Text Extraction: Supports both direct text extraction and OCR processing for scanned documents
- Temporary File Handling: Fallback mechanism for problematic PDF files that PyMuPDF can't process directly
Toolio can be used for structured output by specifying a structured_output
block like so (from Toolio algebra tutor demo):
description: structured_output
text:
- structured_output: mlx-community/Llama-3.2-3B-Instruct-4bit
insert_schema: true
schema_file: ToolioGit/demo/algebra_tutor.schema.json
parameters:
temperature: 0.6
max_tokens: 512
input: |
solve 8x + 31 = 2. Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#"Beyond the approach above, its input can be specified in all the ways a PDL model can.
Generative Redfoot supports speculative decoding using draft models for faster inference. Add the draft_model parameter to your model block:
description: draft_model_example
text:
- model: mlx-community/Llama-3.2-3B-Instruct-4bit
draft_model: mlx-community/Llama-3.2-1B-Instruct-4bit
parameters:
temperature: 0.6
max_tokens: 500Draft models must have the same vocabulary size as the main model and can significantly improve generation speed.
AlphaOne (α) reasoning modulation can be used with
a supported, reasoning model using Alpha One MLX via providing
a alpha_one block within a model:
description: alpha_one_reasoning
text:
- model: mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit-AWQ
parameters:
temperature: 0.6
max_tokens: 1200
repetition_penalty: 1.4
top_k: 20
top_p: 0.95
alpha_one:
thinking_token_length: 250
alpha: 1.4
wait_words: ["Wait"]
input: |
What question has an answer of "42?"This algorithm scales the average thinking phase token length (specified by the thinking_token_length parameter and
with a default of 2,650) by the alpha parameter (which defaults to 1.4 per the paper). After
exiting this phase, it suppresses attempts to transition to slow thinking by replacing references to 'Wait' words
with the "</think>" token. The specific list of these words can be provided via the wait_words parameter or the
defaults specified by alpha-one-mlx for each model type will be used.
Prompt snippets can be loaded for use in a PDL file via a word loom library, providing a clean separation of concerns between prompt language management, prompt construction, and LLM workflow management and orchestration. This requires OgbujiPT and below is an example that constructs the system and user prompt from a word loom (TOML) file:
description: wordloom_prompt_example
text:
- text:
- role: system
read_from_wordloom: prompt_library.toml
items: helpful_chatbot_system_prompt
contribute: [context]
- text:
- read_from_wordloom: prompt_library.toml
items: hello_prompt
contribute: [context]
- model: mlx-community/Llama-3.2-3B-Instruct-4bitThe read_from_wordloom block indicates the use of this extension mechanism and its value is a path to a Word loom.
The items parameter on the block is a space-separated list of language item names in the word loom. Their values are
joined together with \n and returned as a value for use (as an example from above) to construct the message for
conversations used by downstream PDL directives.
In the example above, the helpful_chatbot_system_prompt language item from the prompt_library.toml word loom is used
as the system prompt and the hello_prompt language item from the same file is used as the user prompt:
[
{"role": "system", "content": ".. helpful_chatbot_system_prompt language item .. "},
{"role": "user", "content": ".. hello_prompt language item .. "}
]If specified within a model block, the cot_prefix parameter takes a path to a file that captures COT few shot content
as a file in the LLM chat conversation JSON format. This
will be incorporated into the conversation structure to ensure this files content is used as few shot / COT for the
model generation
Below is an example showing a PDL file constructing message contexts for prompts to chained LLM calls from fragments in a Wordloom library, providing a clean separation of concerns between prompt language management, prompt construction, and LLM workflow management and orchestration. The keys in the YAML file in black use the PDL language. Those in red are generative_redfoot extensions shown in order of appearance: (mlx) prefix caching, COT few-shot loading, reading from a wordloom file, using Google's google/gemma-7b-aps-it model to perform "abstractive proposition segmentation" from LLM output, etc.:

