A Proposal for Context Engineering for Quarkus Langchain4j #1979

cescoffier · 2025-11-28T10:48:31Z

cescoffier
Nov 28, 2025
Maintainer

Context Engineering Proposal

Version: 0.2

Executive Summary

This issue describes a proposal architecture for context engineering in Quarkus Langchain4j: a system for managing AI context in a composable, deterministic, and observable manner.

Core Vision

Context engineering is the practice of identifying and providing the most relevant information from the surrounding system, data, and interaction history to an LLM, so that inference is more accurate, reliable, and aligned with user intent.

In this proposal, context engineering in Quarkus Langchain4j follows a multi-message prompt-first design: developers declare, configure, and control all context providers directly in their message templates using Qute tags, structuring their prompts as sequences of messages that optimize for caching, conversation flow, and context positioning.

This removes hidden magic, makes context assembly explicit and auditable, and places complete control in developers’ hands.

Characteristics

Explicit over Implicit: All context sources declared in templates, not hidden in configuration
Multi-Message Structure: Prompts structured as sequences of messages for optimal caching and context positioning
Composable Fragments: Context as typed, provenance-tracked units from pluggable providers
Budget-Aware: Token/length limits enforced at query time with overflow policies
Cache-Optimized: Static contexts separated from dynamic ones to leverage LLM prompt caching
Observable: Full lineage tracking from provider to prompt
Type-Safe: Provider-specific parameter validation at build and runtime (hopefully)

What Problems Does This Solve?

RAG Magic Problem: Traditional RAG implementations hide context assembly behind opaque augmentation. Developers can't see what's included in the prompt or why.
Context Budget Chaos: Without explicit budget management, prompts grow unbounded or unpredictably, causing token limit (context windows) errors or hallucinations.
Memory Fragmentation: Chat memory, RAG retrieval, user preferences, and policies live in separate silos with no unified composition model.
Auditability Gap: Enterprise AI needs to trace every piece of context: where it came from, when, and why it was included.
Testing Difficulty: Hard-coded context assembly makes it difficult to test prompts with different context configurations.
Inefficient Caching: Without structure, LLM prompt caching cannot be leveraged effectively, leading to unnecessary costs and latency.
Context Window Management: No clear way to structure multi-turn conversations while managing the limited context window.

Introduction & Motivation

The Context Engineering Challenge

Modern AI applications need to combine context from multiple sources:

Conversation history (what was said before)
Retrieved documents (relevant knowledge from RAG)
User preferences (personalization data)
Episodic memory (what happened when)
System policies (compliance rules, guidelines)

Existing langchain4j features (ChatMemory, ContentRetriever) handle individual sources well but lack a unified model for:

Composing multiple sources in one prompt
Budgeting tokens across competing sources
Tracking provenance and observability
Controlling context assembly explicitly

Three Mental Models for Context Engineering

Before diving into the solution, it's essential to understand the different perspectives on how context relates to prompts:

Model A: Context is Part of the Prompt

Mental Model: The prompt template is the container; context providers inject dynamic data into it.
Developer Perspective: "I'm writing a prompt, and I need to pull in some contextual data”
Structure:

Prompt Template
├── Static instructions
├── {#context} declarations (inject dynamic data)
└── Task specification

Example:

@SystemMessage("""
    You are a helpful assistant.

    {#context slot="policies" provider="static-policies" /}

    Policies: {context:policies}
    """)
String chat(String message);

Characteristics:

Explicit control over prompt structure
Full visibility into what the LLM sees
Easy to test and audit
Developer must manage prompt structure

Model B: Prompt is Part of the Context

Mental Model: Everything sent to the LLM is "context" (instructions, data, history, etc.)
Developer Perspective: "Context engineering encompasses everything the LLM sees.”
Structure:

Context Engineering
├── Instruction context (system message)
├── Data context (RAG, memory, etc.)
└── Task context (user message)

Example:

@InstructionContext(provider="system-instructions")
@DataContext(provider="policies")
@DataContext(provider="rag", query="{message}")
@TaskContext(content="{message}")
String chat(String message);

Characteristics:

High-level, declarative API
Framework handles assembly
Less visibility into final prompt structure
More framework magic

Model C: Separate but Complementary

Mental Model: Prompt = static structure/instructions; Context = dynamic data
Developer Perspective: "Prompts define the task; context provides the information"
Clear separation between what the developer writes versus what's retrieved.

Example:

@Prompt("You are a helpful assistant. Answer using: {context}")
@ContextSources({"policies", "rag", "history"})
String chat(String message);

Characteristics:

Clear separation of concerns
Less control over how context is integrated
Framework decides rendering

In this proposal, we would use the model A with a Multi-Message Extension

This proposal adopts Model A (context is part of the prompt) because it provides:

Maximum visibility and control
Explicit declaration of all context sources
Auditability and testability
Flexibility in how context is rendered

However, we extend it with multi-message support to enable:

Effective LLM prompt caching
Proper conversation history management
Optimal context positioning (addressing "lost in the middle")
Clear structure for single and multi-turn interactions

Model B could be built as a higher-level API on top of this foundation in the future.

Why Multi-Message Prompt-First? (And Yes, I’m bad at naming things)

Traditional approaches hide context assembly in code or configuration:

// Hidden magic: where does context come from? how much? in what order?
@SystemMessage("You are a helpful assistant")
String chat(@UserMessage String message);

Multi-message prompt-first design makes everything explicit:

@SystemMessage("""
    You are a helpful assistant for Acme Corp.

    {#context slot="policies" provider="static-policies" cacheable=true /}
    {#context slot="preferences" provider="user-prefs" userId="{userId}" /}

    ## Company Policies
    {context:policies}

    ## User Preferences
    {context:preferences}
    """)

@UserMessage("""
    {#context slot="toolbox" provider="tools" cacheable=true /}
    {#context slot="schema" provider="structured-output" cacheable=true /}

    ## Available Tools
    {context:toolbox}

    ## Response Schema
    {context:schema}
    """)

@ConversationMemory(maxTokens=2000, maxTurns=10)

@UserMessage("""
    {#context slot="rag" provider="docs" query="{query}" /}

    ## Relevant Knowledge
    {context:rag}

    ## User Request
    {query}
    """)
Person findSomeone(@MemoryId String conversationId, String userId, String query);

Benefits:

Visibility: Developers see exactly what context is included and in which message
Control: All parameters (budgets, providers, queries) are explicit in templates
Caching: Static contexts marked as cacheable, reducing costs and latency
Structure: Clear separation between system instructions, static context, conversation history, and current request
Testability: Easy to test with different provider configurations
Auditability: The rendered messages show exactly what the LLM sees

Design Philosophy

Templates are the source of truth for context requirements
Multi-message structure optimizes for caching, conversation flow, and context positioning
Providers are pluggable implementations discovered via CDI
Fragments are composable units with provenance and metadata
Budgets are explicit constraints declared in templates
Caching is first-class with explicit cacheable markers
Observability is built-in at every layer

Core Principles

1. Multi-Message Prompt-First Design

Principle: Message templates are the control plane for context engineering.

All context providers declared in templates using Qute tags
Multiple messages can be defined to structure the conversation/memory optimally
All provider parameters specified inline (no external config required, but possible)
Template rendering produces the final message sequence seen by LLM
Changes to context strategy = changes to templates (visible in code review, in the source control…)
Cacheable contexts explicitly marked to leverage LLM prompt caching

2. Determinism & Reproducibility

Principle: Same inputs → same context → same prompt (modulo time-based queries).

All context composition steps logged for audit
Fragment provenance tracked (source, timestamp, query)
Budget decisions recorded (what was included, what was truncated)
Templates can be rendered in isolation for testing

3. Composability

Principle: Context sources are independent, orthogonal building blocks.

Providers expose a uniform interface (read/write fragments)
No assumptions about what other providers exist
Fragments from different providers can coexist in one prompt
New providers added without modifying existing code

4. Observability

Principle: Full visibility into what context was used and why.

Fragment-level provenance (which provider, when retrieved)
Budget tracking (tokens requested vs allocated)
Rejection logging (why fragments have been excluded)
Integration with OpenTelemetry and Micrometer

5. Type Safety & Validation

(That's the hard one)

Principle: Provider parameters validated at build time when possible.

Provider-specific parameter schemas
Compile-time validation for known providers
Runtime validation with clear error messages

Architectural Overview

High-Level Architecture

flowchart TD
    A["Prompt Template (Qute)<br/>{#context slot='X' provider='Y' maxTokens=N query='...' /}"]
    B["Context Resolution Engine<br/>• Parse {#context} tags<br/>• Create QuerySpec from tag parameters<br/>• Validate parameters against provider schema<br/>• Route to appropriate ContextFragmentProvider"]
    C["ContextFragmentProvider (CDI Beans)"]
    C1["Chat Memory<br/>Provider"]
    C2["RAG<br/>Provider"]
    C3["User Prefs<br/>Provider"]
    C4["..."]
    D["ContextFragment[]<br/>• id, type, format, payload<br/>• attributes (score, metadata)<br/>• source (provenance)"]
    E["Budget Enforcement<br/>• Count tokens in fragments<br/>• Apply maxTokens limit<br/>• Truncate or reject overflow<br/>• Log budget decisions"]
    F["Rendered Context String<br/>• Injected into template as {context:slotName}<br/>• Final prompt sent to LLM"]

    A --> B
    B --> C
    C --> C1
    C --> C2
    C --> C3
    C --> C4
    C1 --> D
    C2 --> D
    C3 --> D
    C4 --> D
    D --> E
    E --> F

Component Responsibilities

Component	Responsibility
Prompt Template	Declares context requirements, budgets, and provider configuration
ContextFragmentProvider	Fetches/writes context fragments from/to specific data sources (maybe renamed to `ContextProvider`?)
ContextFragment	Immutable, typed unit of context with provenance metadata
Budget Enforcement	Ensures token limits respected, handles overflow
Observability Hooks	Emits telemetry events for monitoring

Core Abstractions (incomplete list)

ContextFragment

A ContextFragment is the fundamental unit of context—an immutable, self-describing piece of information with provenance.

/**
 * An immutable fragment of context retrieved from a provider.
 */
public interface ContextFragment {

    /**
     * Unique identifier for this fragment within its provider.
     * Used for deduplication and reference tracking.
     */
    String id();

    /**
     * Mime-Type of the payload.
     * Examples: "text/plain", "text/markdown", "application/json"
     */
    String type();

    /**
     * The actual content as bytes.
     * Encoding determined by format (UTF-8 for text formats).
     */
    byte[] payload();

    /**
     * Metadata attributes for filtering, scoring, and processing.
     * Common attributes:
     * - "score": relevance score (Double)
     * - "timestamp": when this fragment was created/retrieved (Instant)
     * - "priority": importance for budget allocation (Integer)
     * - "pii_detected": whether PII detection flagged this (Boolean)
     * - Custom provider-specific attributes
     */
    Map<String, Object> attributes();

    /**
     * Provenance: where this fragment came from.
     */
    Source source();
}

ContextFragmentProvider / ContextProvider

A ContextFragmentProvider/ContextProvider is a pluggable source/sink of context fragments, discovered via CDI.

/**
 * A provider of context fragments, registered as a CDI bean.
 *
 * Providers are discovered by name and invoked based on template
 * declarations. Each provider handles a specific type of context
 * (e.g., chat memory, RAG retrieval, user preferences).
 */
public interface ContextFragmentProvider<T extends QuerySpec> {

    /**
     * The unique name of this provider.
     * Referenced in templates: {#context provider="this-name" /}
     */
    String name();

    /**
     * Creates an instance of `T` from _default_ `QuerySpec` created from the Qute tag.
     * This method can also be used to validate attributes.
     */
    T buildQuerySpec(QuerySpec qs);

    /**
     * Read context fragments matching the query, within budget.
     *
     * @param spec Query parameters from template tag - created using the `createQuerySpec` method
     * @param budget Resource limits (tokens, fragments, latency)
     * @return Fragments matching query, ordered by relevance/priority
     */
    Uni<List<ContextFragment>> read(T spec, ContextBudget budget);

    /**
     * Check if this context provider is read only.
     * 
     * @return {@code true} is the context provider is read-only, {@code false} otherwise.
     * /
    boolean isReadyOnly();

    /**
     * Write/update context fragments for non-ready-only providers.
     *
     * Not all providers support writing (e.g., static FAQs are read-only) - see `isReadOnly`.
     *
     * @param update What to write and how
     */
    Uni<Void> write(UpdateSpec update); // TODO I'm still not sure we need an UpdateSpec type.
}

QuerySpec

A QuerySpec encapsulates query parameters extracted from template tags.
Note that each context provider will provide an implementation of QuerySpec (and a factory method).

/**
 * Query specification created from {#context} tag attributes.
 *
 * Provides type-safe access to parameters with validation.
 */
public interface QuerySpec {

    /**
     * The slot name this query is for.
     * Extracted from: {#context slot="slotName" /}
     */
    String slot();

    /**
     * The provider name to query.
     * Extracted from: {#context provider="providerName" /}
     */
    String provider();

    /**
     * Get a required parameter, throwing if missing or wrong type.
     */
    <T> T requiredParam(String key, Class<T> type);

    /**
     * Get an optional parameter with default value.
     */
    <T> T optionalParam(String key, Class<T> type, T defaultValue);

    /**
     * Get all parameters as map (for providers that need raw access).
     */
    Map<String, Object> allParams();

    /**
     * Check if parameter exists.
     */
    boolean hasParam(String key);
}

Creation from Template Tag:

{#context slot="knowledge"
          provider="rag-retriever"
          maxTokens=1000
          query="{userMessage}"
          minScore=0.7
          category="technical" /}

Becomes:

QuerySpec spec = QuerySpec.builder()
    .slot("knowledge")
    .provider("rag-retriever")
    .param("maxTokens", 1000)
    .param("query", userMessage)  // Variable resolved from template context
    .param("minScore", 0.7)
    .param("category", "technical")
    .build();

This default query spec will be passed to the buildQuerySpec of the provider to validate and create a custom QuerySpec object.
This mechanism should allow build-time checking of the attributes.

ContextBudget

A ContextBudget represents resource limits for context operations.

/**
 * Resource constraints for context retrieval and composition.
 */
public record ContextBudget(
    /**
     * Maximum tokens this context can consume.
     * Enforced by counting tokens in fragment payloads.
     */
    int maxTokens,

    /**
     * Maximum number of fragments to return.
     * Prevents overwhelming the prompt with too many small fragments.
     */
    int maxFragments,

    /**
     * Maximum latency allowed for retrieval.
     * Providers should timeout if exceeded.
     */
    Duration maxLatency
) {

    /**
     * Unlimited budget (don't use in production).
     */
    public static ContextBudget unlimited() {
        return new ContextBudget(Integer.MAX_VALUE, Integer.MAX_VALUE, Duration.ofMinutes(5));
    }

    /**
     * Create budget from maxTokens, using sensible defaults for other dimensions.
     */
    public static ContextBudget ofTokens(int maxTokens) {
        return new ContextBudget(maxTokens, 100, Duration.ofSeconds(10));
    }
}

Budget Creation from Template:

{#context slot="history" provider="chat-memory" maxTokens=2000 /}

Creates: ContextBudget.ofTokens(2000)

Budget Enforcement:

The context provider enforces budgets by:

Counting tokens in retrieved fragments using a TokenEstimator
Selecting highest-priority fragments up to maxTokens
Truncating fragments if needed
Logging which fragments included/excluded and why

Source

A Source captures provenance metadata for fragments.

/**
 * Provenance information: where a fragment came from.
 */
public record Source(
    /**
     * Name of the provider that created this fragment.
     */
    String providerName,

    /**
     * When this fragment was retrieved/created.
     */
    Instant retrievedAt,

    /**
     * Additional provider-specific metadata.
     * Examples:
     * - "query": the query that found this fragment
     * - "original_score": relevance score before reranking
     * - "cache_hit": whether retrieved from cache
     */
    Map<String, Object> metadata
) {
    public static Source of(String providerName, Instant retrievedAt) {
        return new Source(providerName, retrievedAt, Map.of());
    }

    public static Source of(String providerName, Instant retrievedAt,
                           Map<String, Object> metadata) {
        return new Source(providerName, retrievedAt, metadata);
    }
}

Multi-Message Prompt-First Design

The Central Principle

Message templates are the control plane for context engineering.

Instead of configuring context assembly in code or YAML files, developers declare their context requirements directly in message templates using Qute custom sections. Multiple @SystemMessage and @UserMessage annotations define the structure of the message sequence sent to the LLM.

Qute Integration: The `{#context}` Tag

The {#context} tag is the primary mechanism for declaring context requirements.

Syntax:

{#context slot="slotName"
          provider="providerName"
          maxTokens=N
          [cacheable=true|false]
          [required=true|false]
          [provider-specific-params] /}

Key Attributes:

slot: Unique identifier for this context within the template
provider: Name of the ContextFragmentProvider to query
maxTokens: Maximum tokens this context can consume
cacheable: (Optional) Hint that this context is static and can be cached by the LLM
required: (Optional) Whether this context must be present (see semantics below)
Additional provider-specific parameters

Template Variable Interpolation:

Tag parameters can reference template variables:

{#context slot="relevant-docs"
          provider="rag"
          maxTokens=1500
          query="{userMessage}"          ← template variable
          userId="{currentUserId}" /}    ← template variable

Complete Example

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Quarkus Corp.

        {#context slot="policies"
                  provider="static-policies"
                  maxTokens=500
                  category="customer-service"
                  cacheable=true
                  required=true /}

        {#context slot="user-profile"
                  provider="user-preferences"
                  maxTokens=300
                  userId="{userId}" /}

        ## Company Policies
        {context:policies}

        ## User Preferences
        {context:user-profile}
        """)

    @UserMessage("""
        {#context slot="toolbox"
                  provider="tools"
                  cacheable=true /}

        ## Available Tools
        {context:toolbox}
        """)

    @ConversationMemory(maxTokens=2000, maxTurns=10) ← inject chat history - the previous user message. If empty, merge the two adjacent user messages.

    @UserMessage("""
        {#context slot="knowledge"
                  provider="faq-rag"
                  maxTokens=1000
                  query="{userMessage}"
                  minScore=0.7 /}

        ## Relevant Knowledge Base Articles
        {context:knowledge}

        ## User Question
        {userMessage}
        """)
    String chat(@MemoryId String conversationId,
                String userId,
                String userMessage);
}

How this works:

First message (System): Static instructions + policies + user preferences (policies are cacheable)
Second message (User): Available tools (cacheable for static toolboxes)
Conversation history: Injected by @ConversationMemory annotation
Current turn (User): Fresh RAG retrieval + current user question

The conversion history can be empty. It will not contain the Second message. If empty, the second message and current turn are merged (appended).

Observability with Prompt-First Design

Every {#context} tag invocation is observable:

// CDI Event emitted for each tag
ContextRetrievalEvent {
    slot: "knowledge",
    provider: "faq-rag",
    queryParams: {query: "reset password", minScore: 0.7},
    fragmentsRetrieved: 3,
    fragmentsIncluded: 2,
    fragmentsExcluded: 1,  // (exceeded budget)
    tokensRequested: 1247,
    tokensAllocated: 1000,
    duration: 4500, // ns or ms
    timestamp: "2025-11-28T10:30:00Z"
}

Evolution from Single-Message Design

This proposal evolved from an initial single-message approach to the current multi-message model.

Original Approach: Single Message with Embedded History

Initial Concept:

@SystemMessage("""
    You are a helpful assistant.

    {#context slot="policies" provider="policies" maxTokens=500 /}
    {#context slot="history" provider="chat-memory" conversationId="{id}" maxTokens=2000 /}
    {#context slot="rag" provider="docs" query="{message}" maxTokens=1000 /}

    Policies: {context:policies}
    History: {context:history}
    Knowledge: {context:rag}

    User: {message}
    """)
String chat(@MemoryId String id, String message);

Problems Identified:

Conversation History as Text: Chat history rendered as text rather than proper message structures
- Loses the natural User/Assistant message alternation
- Can't leverage LLM's native conversation handling
- Less effective for multi-turn interactions
No Caching Strategy: Everything in one message means no clear separation between static and dynamic content
- Can't cache static policies separately from dynamic RAG
- The entire prompt is re-processed on every call
Poor Context Positioning: All context in one blob
- No control over what appears at the beginning vs end
- Can't optimize for "lost in the middle" bias
Single-Turn Bias: Design worked for simple queries but broke down for conversations
- Each turn looked like a fresh conversation to the LLM
- No proper conversation state management

Why Multi-Message is Better

The multi-message approach addresses all these issues:

@SystemMessage("""
    {#context slot="policies" provider="policies" cacheable=true /}
    Policies: {context:policies}
    """)
@UserMessage("""
    {#context slot="tools" provider="tools" cacheable=true /}
    Tools: {context:tools}
    """)
@ConversationMemory(maxTokens=2000)
@UserMessage("""
    {#context slot="rag" provider="docs" query="{message}" /}
    Knowledge: {context:rag}
    User: {message}
    """)
String chat(@MemoryId String id, String message);

Benefits:

Proper message structure (System/User/Assistant alternation)
Explicit caching boundaries
Clear conversation history management via @ConversationMemory
Optimal context positioning (static at start, dynamic at end)
Works naturally for both single-turn and multi-turn conversations

LLM Prompt Caching Strategy

Modern LLM providers (Anthropic, OpenAI) offer prompt caching to reduce costs and latency by caching portions of the prompt that don't change between requests.

How LLM Prompt Caching Works

Caching is prefix-based: The LLM caches everything up to a cache breakpoint.

Example with Anthropic:

SystemMessage: "You are helpful" + [policies]  ← Cache breakpoint
UserMessage: [tools] + [schema]                ← Cache breakpoint
UserMessage: [history turn 1]
AssistantMessage: [response 1]
UserMessage: [RAG] + "current query"           ← Not cached (dynamic)

Cache hits:

System message + policies: Cached (static)
Tools + schema: Cached if tools don't change
History + current turn: Not cached (changes every request)

Cost savings:

Anthropic: Cached tokens cost 90% less (read) vs 0% (write to cache)
OpenAI: Similar pricing model
Latency: Cached content processed instantly

Designing for Caching

Principle: Structure messages so static content precedes dynamic content.

Bad (no caching benefit):

@SystemMessage("""
    {#context slot="rag" provider="docs" query="{message}" /}  ← Dynamic!
    {#context slot="policies" provider="policies" /}            ← Static but after dynamic
    ...
    """)

Good (caching optimized):

@SystemMessage("""
    {#context slot="policies" provider="policies" cacheable=true /}  ← Static, cacheable
    Policies: {context:policies}
    """)
@UserMessage("""
    {#context slot="rag" provider="docs" query="{message}" /}  ← Dynamic, not cached
    Knowledge: {context:rag}
    """)

The `cacheable` Attribute

Usage: Mark contexts that are static across requests.

{#context slot="policies" provider="policies" cacheable=true /}

Semantics:

cacheable=true: Hint to framework that this context is static and should be included in cached prefix (when it can be controlled programmatically)
Quarkus Langchain4J uses this to determine cache breakpoint placement
All cacheable contexts in a message → entire message can be cached (if before dynamic content)
Quarkus Langchain4J would also cache the rendered version of the cached/static messages to avoid extra computation and I/O

Note: This is a future optimization. Initial implementation may not leverage caching, but the API is designed to support it.

Context Ordering and Positioning

The order and position of context in the message sequence significantly affects LLM behavior.

Known LLM Biases

1. Primacy Bias: LLMs pay more attention to content at the beginning

System messages and early user messages have a strong influence
Use for: Critical policies, core instructions

2. Recency Bias: LLMs pay more attention to content at the end

Final user message has highest attention
Use for: Current query, most relevant RAG results

3. Lost in the Middle: LLMs pay less attention to content in the middle of long contexts

Middle sections can be "forgotten" or ignored
Problem for: Long conversation histories, large knowledge bases in the middle

Recommended Message Structure

@SystemMessage("""
    [Core instructions]                    ← HIGH ATTENTION (primacy)
    {#context slot="policies" /}
    [Critical policies]
""")

@UserMessage("""
    [Static reference material]            ← MEDIUM ATTENTION (middle)
    {#context slot="toolbox" /}
    {#context slot="schema" /}
""")

@ConversationMemory(maxTokens=2000)        ← LOWER ATTENTION (middle, variable)

@UserMessage("""
    [Current context]                      ← HIGHEST ATTENTION (recency)
    {#context slot="rag" query="{message}" /}
    {message}
""")

Attention Levels:

Highest: Current user message (recency bias)
High: System message (primacy bias)
Medium: Static user messages after system
Lower: Conversation history (middle of context)

Mitigating "Lost in the Middle"

Strategy 1: Keep history concise

@ConversationMemory(maxTokens=2000, maxTurns=10)  // Limit history size

Strategy 2: Position critical info at boundaries

Static policies → System message (beginning)
Current RAG → Final user message (end)
History → Middle (budget-controlled)

Strategy 3: Semantic filtering (future)

@ConversationMemory(
    maxTokens=2000,
    strategy="semantic-similarity",  // Only include relevant past turns
    query="{message}"
)

Context Rotting

Problem: Old information in cached messages becomes stale.

Example:

@SystemMessage("""
    {#context slot="stock-prices" provider="prices" cacheable=true /}  ← BAD!
    Current prices: {context:stock-prices}
""")

If cached, stock prices from the first call persist for the whole interaction.

Solutions:

Don't cache time-sensitive data:

{#context slot="stock-prices" provider="prices" cacheable=false /}

Cache static policies only:

{#context slot="policies" provider="policies" cacheable=true /}  ← Good (policies don't change)

Use cache TTL (future):

{#context slot="user-prefs" provider="prefs" cacheable=true cacheTTL="1h" /}

Invalidation on write (future):

When user preferences updated → invalidate cache for that user
Requires cache key management

Guidelines:

Only mark truly static content as cacheable=true
Time-sensitive data should never be cacheable
User-specific data can be cacheable if changes are rare
Consider the cache TTL of your LLM provider (typically minutes to hours)

Conversation Memory: The `@ConversationMemory` Annotation

The @ConversationMemory annotation controls how conversation history is injected into the message sequence.

Basic Usage

@SystemMessage("You are a helpful assistant")
@ConversationMemory(maxTokens=2000, maxTurns=10)
@UserMessage("User: {message}")
String chat(@MemoryId String conversationId, String message);

How It Works

Placement: The @ConversationMemory annotation marks where conversation history should be injected in the message sequence.

Turn 1:

SystemMessage: "You are a helpful assistant"
[No history yet]
UserMessage: "User: Hello"
AssistantMessage: "Hi! How can I help you?"

Turn 2:

SystemMessage: "You are a helpful assistant"
UserMessage: "User: Hello"           ← From @ConversationMemory
AssistantMessage: "Hi! How can..."   ← From @ConversationMemory
UserMessage: "User: What's 2+2?"     ← Current turn
AssistantMessage: "2+2 equals 4"     ← Current response

Turn 3:

SystemMessage: "You are a helpful assistant"
UserMessage: "User: Hello"           ← From @ConversationMemory (turn 1)
AssistantMessage: "Hi! How can..."   ← From @ConversationMemory (turn 1)
UserMessage: "User: What's 2+2?"     ← From @ConversationMemory (turn 2)
AssistantMessage: "2+2 equals 4"     ← From @ConversationMemory (turn 2)
UserMessage: "User: And 3+3?"        ← Current turn

Configuration Options

@ConversationMemory(
    maxTokens = 2000,           // Maximum tokens for history
    maxTurns = 10,              // Maximum number of conversation turns
    strategy = "recent"         // Selection strategy (see below)
)

Parameters:

maxTokens: Maximum token budget for conversation history
- If history exceeds budget, older turns are dropped
maxTurns: Maximum number of turns to include
- A "turn" = one User message + one Assistant message
strategy (future): How to select which turns to include
- "recent": Most recent turns (default)
- "semantic-similarity": Most relevant to current query (future)
- "importance": Based on turn importance scoring (future)

Interaction with ChatMemory

The @ConversationMemory annotation works with the existing ChatMemoryStore:

After each AI service method call, the framework stores the new turn (user + assistant messages) in ChatMemoryStore
On the next call, @ConversationMemory retrieves relevant turns from the store
These turns are injected into the message sequence at the annotation's position

Budget Management

The framework enforces the token budget by:

Retrieving all turns from ChatMemoryStore (up to maxTurns)
Counting tokens in each turn (User + Assistant message)
Including turns from the most recent backward until maxTokens exceeded
Dropping the oldest turns if over budget

Example:

Turn 1: 500 tokens
Turn 2: 600 tokens
Turn 3: 700 tokens
Turn 4: 400 tokens
maxTokens = 2000

Included: Turns 2, 3, 4 (1700 tokens total)
Dropped: Turn 1 (would exceed 2000)

Positioning Guidelines

Where to place @ConversationMemory?

Option 1: Before current turn (recommended)

@SystemMessage("...")
@UserMessage("...")  // Static context (cacheable)
@ConversationMemory(maxTokens=2000)
@UserMessage("...")  // Current turn

Benefits:

Clear separation between static, history, and current context
Current turn always at the end (recency bias)

Option 2: After system message

@SystemMessage("...")
@ConversationMemory(maxTokens=2000)
@UserMessage("...")  // Current turn with context

Benefits:

History closer to the beginning (primacy bias)
Good for instruction-following tasks where history is critical

Context Providers

This section describes the four context providers that can be used to better understand the configuration mechanism.

### 1. Episodic Long-Term Memory Provider

Purpose: Provides access to timestamped event history (what happened when).

Name: episodic-memory

Behavior:

Read: Query events within a time window or matching filters
Write: Append new events to the timeline
Storage: Time-series database, append-only event log

Configuration in Template:

{#context slot="recent-events"
          provider="episodic-memory"
          maxTokens=800
          userId="{userId}"
          since="24h"
          eventType="task-completion" /}

Event Structure:

Events are stored in a structured format:

{
  "id": "evt-12345",
  "userId": "user-789",
  "eventType": "task-completion",
  "timestamp": "2025-11-28T10:30:00Z",
  "data": {
    "taskId": "task-456",
    "outcome": "success",
    "duration": "5m"
  }
}

Read Behavior:

Query the event store with a time range and filters
Order: newest to oldest (reverse chronological)
Convert events → StructuredFragment or TextFragment
Apply maxTokens budget

Fragment Format:

StructuredFragment {
    id: "evt-12345",
    type: "application/json",
    payload: <JSON bytes>,
    attributes: {
        eventType: "task-completion",
        timestamp: "2025-11-28T10:30:00Z",
        userId: "user-789"
    },
    source: Source("episodic-memory", now, {query: "since=24h"})
}

Or as formatted text:

TextFragment {
    id: "evt-12345",
    type: "text/plain",
    payload: "[2025-11-28 10:30] User completed task-456 successfully (5m)",
    attributes: {...},
    source: Source("episodic-memory", now)
}

2. User Preferences Provider (Long-Term Memory)

Purpose: Stores and retrieves user-specific preferences, settings, and profile data.

Name: user-preferences

Behavior:

Read: Retrieve preferences for a user (all or filtered by key)
Write: Update user preferences
Storage: Key-value store, user profile database

Configuration in Template:

{#context slot="user-profile"
          provider="user-preferences"
          maxTokens=300
          userId="{userId}"
          keys="language,timezone,notification-settings" /}

Preference Structure:

Preferences stored as key-value pairs:

{
  "userId": "user-789",
  "preferences": {
    "language": "en-US",
    "timezone": "America/New_York",
    "notification-settings": {
      "email": true,
      "sms": false
    },
    "theme": "dark"
  }
}

Read Behavior:

Fetch user preferences from store
Filter by keys if specified
Format as JSON or human-readable text
Return as StructuredFragment or TextFragment

Fragment Format (text):

TextFragment {
    id: "user-789-prefs",
    type: "text/plain",
    payload: """
        User Preferences:
        - Language: en-US
        - Timezone: America/New_York
        - Theme: dark
        - Notifications: Email enabled, SMS disabled
        """,
    attributes: {userId: "user-789"},
    source: Source("user-preferences", now)
}

Write Behavior:

// Update preferences
UpdateSpec update = UpdateSpec.builder()
    .param("userId", "clement")
    .param("preferences", Map.of("theme", "light", "language", "fr-FR"))
    .build();

userPrefsProvider.write(update);

3. RAG Provider (Document Retrieval)

Purpose: Retrieves relevant documents from a vector store based on semantic similarity.

Name: rag (or domain-specific names like faq-rag, docs-rag)

Behavior:

Read: Semantic search in embedding store
Write: Add/update documents in the store (for ingestion)
Storage: Vector database

Configuration in Template:

{#context slot="knowledge"
          provider="faq-rag"
          maxTokens=1500
          query="{userMessage}"
          minScore=0.7
          maxResults=5
          filter="category:technical" /}

Read Behavior:

Generate embedding for query string
Search embedding store for similar documents
Filter by minScore
Limit to maxResults
Convert Content → ContextFragment
Apply maxTokens budget (keep highest-scoring fragments)

Fragment Format:

TextFragment {
    id: "doc-12345",
    type: "text/markdown", // Specially true if using docling
    payload: <document text>,
    attributes: {
        score: 0.87,
        title: "How to reset your password",
        url: "https://docs.example.com/reset-password",
        category: "technical"
    },
    source: Source("faq-rag", now, {
        query: "reset password",
        originalScore: 0.87
    })
}

Type Safety & Validation

The Challenge

Template tag parameters are inherently stringly-typed:

{#context provider="faq-rag" query="..." minScore="0.7" /}

How do we:

Validate that "minScore" is a valid parameter for "faq-rag"?
Ensure "0.7" is parsed as Double, not String?
Catch typos like "minSore" at build time?
Provide IDE autocomplete for provider parameters?

Proposal: Type-Safe QuerySpec Subclasses

Instead of using a generic schema-based validation approach, each provider defines its own QuerySpec subclass with strongly-typed parameters.
This provides compile-time type safety for provider implementations while maintaining flexibility for template-based configuration.

// TODO It's still unclear how this will allow build-time validation. We need to extract some sort of schema.

Advanced Topics and Implementation Details

Special Context Providers

Toolbox Provider

Purpose: Provides the list of available tools/functions for the LLM to call.

Name: toolbox

Usage:

@UserMessage("""
    {#context slot="toolbox" provider="tools" cacheable=true /}

    ## Available Tools
    {context:toolbox}
    """)

How it works:

The toolbox provider discovers all @Tool annotated methods in the application
Generates tool descriptions in the format required by the LLM provider
Returns as a ContextFragment (typically JSON or formatted text)
Usually cacheable (tools don't change frequently)

Fragment Format:

StructuredFragment {
    id: "toolbox",
    type: "application/json",
    payload: """
        [
          {
            "name": "getWeather",
            "description": "Get current weather for a city",
            "parameters": {
              "type": "object",
              "properties": {
                "city": {"type": "string", "description": "City name"}
              },
              "required": ["city"]
            }
          }
        ]
        """,
    attributes: {toolCount: 5},
    source: Source("tools", now)
}

Integration with Function Calling:

Toolbox provider generates tool descriptions
Descriptions included in prompt via {context:toolbox}
LLM returns tool call in response
Framework executes tool
Tool result injected as next message
LLM generates final response

Dynamic Tool Selection (future):

{#context slot="toolbox" provider="tools" category="weather" cacheable=false /}

Only include tools matching certain criteria.

Structured Output Provider

Purpose: Provides the JSON schema for structured output responses.

Name: structured-output

Usage:

@UserMessage("""
    {#context slot="schema" provider="structured-output" cacheable=true /}

    ## Response Format
    Your response must conform to this schema:
    {context:schema}
    """)
Person findPerson(String query);

How it works:

Provider inspects the method return type (e.g., Person.class)
Generates JSON schema from the class structure
Returns schema as ContextFragment
Usually cacheable (schemas don't change)

Fragment Format:

StructuredFragment {
    id: "schema-Person",
    type: "application/json",
    payload: """
        {
          "$schema": "http://json-schema.org/draft-07/schema#",
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "email": {"type": "string", "format": "email"}
          },
          "required": ["name", "age"]
        }
        """,
    attributes: {returnType: "Person"},
    source: Source("structured-output", now)
}

Build-time Generation:

Schema generated at build time for known types
Cached and reused across requests
Eliminates runtime reflection overhead

The `required` Attribute Semantics

The required attribute controls what happens when a context provider returns empty or no results.

Syntax:

{#context slot="policies" provider="policies" required=true /}
{#context slot="rag" provider="docs" query="{q}" required=false /}

Semantics:

`required=true` (fail-fast)

Behavior:

If the provider returns empty results or throws an exception → fail the entire request
The framework throws an exception, preventing the LLM call
Use for critical context that must be present

Example:

{#context slot="policies" provider="compliance-policies" required=true /}

If compliance policies can't be loaded:

Throw RequiredContextMissingException
Don't call the LLM (unsafe without policies)

Use cases:

Compliance policies that must always be enforced
Critical safety guidelines
Required user information

CDI Scopes for Context Providers

Context providers are CDI beans, and their scope affects lifecycle and state management.

Recommended Scopes

@ApplicationScoped (default recommendation)

@ApplicationScoped
@Named("policies") // @Identified?
public class StaticPoliciesProvider implements ContextFragmentProvider<PolicyQuerySpec> {
    // Singleton, shared across all requests
}

Use for:

Stateless providers
Providers with cacheable internal state
Most providers (this should be the default)

Benefits:

Single instance, minimal memory overhead
Can maintain internal caches (e.g., embedding cache)
Thread-safe (must be implemented carefully)

@RequestScoped

@RequestScoped
@Named("current-user-context")
public class CurrentUserContextProvider implements ContextFragmentProvider<UserQuerySpec> {
    @Inject
    SecurityIdentity identity;  // Request-scoped

    // Instance per HTTP request
}

Use for:

Providers that need request-scoped dependencies
Providers that maintain per-request state
Providers that inject @RequestScoped beans

Scope Considerations

Thread Safety:

@ApplicationScoped providers must be thread-safe
Use immutable state or thread-safe collections
Avoid mutable shared state

Caching:

@ApplicationScoped providers can maintain application-wide caches
@RequestScoped providers can cache within a request
Consider using Quarkus @CacheResult for method-level caching

Provider Dependencies and Injection

Providers can depend on each other and inject services via CDI.

Injecting Other Providers

@ApplicationScoped
@Named("enriched-rag")
public class EnrichedRAGProvider implements ContextFragmentProvider<RAGQuerySpec> {

    @Inject
    @Named("base-rag")
    ContextFragmentProvider<RAGQuerySpec> baseRAGProvider;

    @Inject
    @Named("reranker")
    RerankerService rerankerService;

    @Override
    public Uni<List<ContextFragment>> read(RAGQuerySpec spec, ContextBudget budget) {
        return baseRAGProvider.read(spec, budget)
            .map(fragments -> rerankerService.rerank(fragments, spec.query()));
    }
}

Use cases:

Provider composition (wrapping, enriching)
Reranking, filtering, transformation
Building meta-providers

Verifying Context Provider Inclusion

How do you verify that a context provider was actually executed and included in the prompt?

Observability Events

Every context provider invocation emits telemetry:

ContextRetrievalEvent {
    slot: "rag",
    provider: "docs",
    queryParams: {query: "reset password"},
    fragmentsRetrieved: 5,
    fragmentsIncluded: 3,
    fragmentsExcluded: 2,  // Over budget
    tokensRequested: 1500,
    tokensAllocated: 1000,
    duration: 45ms,
    timestamp: "2025-11-28T10:30:00Z"
}

Testing with Mocks

Unit test with mock provider:

@QuarkusTest
class CustomerSupportBotTest {

    @InjectMock
    @Named("policies")
    ContextFragmentProvider<PolicyQuerySpec> mockPoliciesProvider;

    @Inject
    CustomerSupportBot bot;

    @Test
    void testPoliciesIncluded() {
        // Arrange
        when(mockPoliciesProvider.read(any(), any()))
            .thenReturn(Uni.createFrom().item(List.of(policyFragment)));

        // Act
        bot.chat("conv-123", "user-456", "help");

        // Assert
        verify(mockPoliciesProvider).read(
            argThat(spec -> spec.slot().equals("policies")),
            any()
        );
    }
}

Rendered Prompt Inspection (Dev Mode)

Dev Mode Feature (future): Render and display the final prompt before sending to LLM.

@ConfigProperty(name = "quarkus.langchain4j.context.log-rendered-prompts", defaultValue = "false")
boolean logRenderedPrompts;

Output:

[Context Engineering] Rendered prompt for CustomerSupportBot.chat():
---
SystemMessage:
  You are a customer support assistant.

  Policies: [500 tokens from 'policies' provider]
  - Refunds within 30 days
  - ...

UserMessage:
  Tools: [300 tokens from 'tools' provider]
  - getWeather(city)
  - ...

[Conversation History: 2 turns, 1200 tokens]

UserMessage:
  Knowledge: [1000 tokens from 'docs' provider]
  - FAQ: How to reset password
  - ...

  User: How do I reset my password?
---

Passing Fragments as Method Parameters

Use case: Manually retrieve context fragments and pass them to the AI service method.

Why: Advanced scenarios where you want full control over fragment retrieval.

Example

@ApplicationScoped
public class AdvancedCustomerService {

    @Inject
    @Named("policies")
    ContextFragmentProvider<PolicyQuerySpec> policiesProvider;

    @Inject
    CustomerSupportBot bot;

    public String advancedChat(String userId, String message) {
        // Manually retrieve policy fragments
        PolicyQuerySpec policySpec = new PolicyQuerySpec("policies", "customer-service");
        ContextBudget budget = ContextBudget.ofTokens(500);

        List<ContextFragment> policyFragments =
            policiesProvider.read(policySpec, budget).await().indefinitely();

        // Pass fragments to bot method
        return bot.chatWithManualContext(userId, message, policyFragments);
    }
}

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant.

        Policies: {policies}
        """)
    @UserMessage("User: {message}")
    String chatWithManualContext(
        String userId,
        String message,
        @V("policies") List<ContextFragment> policies  // Manually passed
    );
}

Future Directions

Composer & Operator Pipeline

Idea: Allow cross-provider composition and transformation of fragments.

Features:

Deduplication: Remove duplicate fragments across providers (exact, semantic)
Reranking: Rescore fragments using cross-encoder or RRF
Compression: Summarize fragments to fit budget (extractive or abstractive)
Filtering: Remove low-quality fragments (score threshold, PII detection)

Example (Future):

{#context slot="knowledge"
          providers="faq-rag,docs-rag,support-tickets"
          maxTokens=2000
          operators="dedup,rerank,compress"
          query="{userMessage}" /}

The composer would:

Fetch from all three providers in parallel
Merge results
Apply deduplication (remove exact/semantic duplicates)
Apply reranking (rescore with cross-encoder)
Apply compression (summarize to fit 2000 tokens)
Return final fragment list

Advanced Budget Management

Dynamic Allocation:

If one slot underutilizes budget, reallocate to other slots
Priority-based allocation (high-priority slots get budget first)

Quality-Based Allocation:

Allocate more budget to high-quality fragments
Reduce budget for low-relevance fragments

Caching & Performance Optimization

Fragment Caching:

Cache retrieved fragments by (provider, query, timestamp)
TTL-based expiration
Invalidation on write

Embedding Caching:

Cache query embeddings for RAG providers
Reduce embedding API calls

Multi-Modal Context

Idea: Support non-text context (images, audio, video).

Example:

{#context slot="product-image"
          provider="image-db"
          productId="{productId}"
          maxTokens=1000 /}  

Product Image:
{context:product-image}  ← Rendered as image URL or base64

Challenges:

Token counting for images (OpenAI uses "image tokens")
Rendering in prompt (some models accept URLs, others base64)
Budget allocation (images expensive vs text)

Provider Stereotypes

Idea: Define reusable "stereotypes" that aggregate multiple context providers with predefined configuration.

Problem: Repetition across similar AI services:

// Repeated in every customer support method
@SystemMessage("""
    {#context slot="policies" provider="policies" cacheable=true /}
    {#context slot="guidelines" provider="guidelines" cacheable=true /}
    Policies: {context:policies}
    Guidelines: {context:guidelines}
    """)
@UserMessage("""
    {#context slot="tools" provider="tools" cacheable=true /}
    Tools: {context:tools}
    """)
@ConversationMemory(maxTokens=2000)
@UserMessage("...")

Solution: Stereotypes - Named configurations that can be referenced:

Define a stereotype:

@Stereotype("customer-support-base")
@SystemMessage("""
    {#context slot="policies" provider="policies" cacheable=true /}
    {#context slot="guidelines" provider="guidelines" cacheable=true /}
    Policies: {context:policies}
    Guidelines: {context:guidelines}
    """)
@UserMessage("""
    {#context slot="tools" provider="customer-support-tools" cacheable=true /}
    Tools: {context:tools}
    """)
@ConversationMemory(maxTokens=2000)
public @interface CustomerSupportBase {
}

Use the stereotype:

@RegisterAiService
public interface RefundBot {

    @CustomerSupportBase  // Apply the stereotype
    @UserMessage("""
        {#context slot="refund-policies" provider="refund-policies" /}
        Refund policies: {context:refund-policies}

        User: {query}
        """)
    String handleRefund(@MemoryId String id, String query);
}

Expansion: At build time, @CustomerSupportBase expands to the full message sequence.

geoand · 2025-11-28T13:09:05Z

geoand
Nov 28, 2025
Maintainer

On the implementation side, I am not yet sure how {#context ....} with an arbitrary number of parameters will be handled - remains to be seen

10 replies

FroMage Nov 28, 2025

Most of the things you will want to do in Qute should be done at evaluation time, not via pre-processing. Even type-safe validation of your templates is achieved by the normal declaration of your template tags or variables.

geoand Nov 28, 2025
Maintainer

Pointers to what you described above

FroMage Nov 28, 2025

Pre-processing to inspect the template at build-time? I can show you how but from what I'm seeing in this proposal, I don't understand why you'd have to do that.

FroMage Nov 28, 2025

Let's talk Monday, it'll be simpler with a call. Unless you plan to work on this all weekend?

geoand Nov 28, 2025
Maintainer

I'd like to have a first look just to set the scene for the call but I'm not in a hurry

FroMage · 2025-11-28T17:02:24Z

FroMage
Nov 28, 2025

So first, if you start having things like:

{#tag something="{other}"}

You're doing recursive templates (template in a template string expression) and that's horrible: don't go there, don't do that.

If you want to pass a variable value to a template, use the proper template syntax:

{#tag something=other}

0 replies

FroMage · 2025-11-28T17:19:37Z

FroMage
Nov 28, 2025

Second: your ContextFragmentProvider and the problem of declaring the QuerySpec members and their types.

One option you're listing here is via subtypes of QuerySpec:

public interface ContextFragmentProvider<T extends QuerySpec> {

 // This is super weird: you take a QuerySpec and you return a subtype of it? might as well build the proper subtype from the start, no?
    T buildQuerySpec(QuerySpec qs);

 // This is a bit annoying because it forces us to separate the list of parameters from the method which is going to use them. It can be useful, but it can also be isolating
    Uni<List<ContextFragment>> read(T spec, ContextBudget budget);

Using QuerySpec to get the parameters is also annoying using requiredParam("name", Type.class).

This seems like a problem we've already solved in Quarkus in many places:

// I'm not even sure the interface is required anymore
public class FaqRagProvider implements ContextFragmentProvider {

// this defines a fragment provider and all its parameters: names, types, default values, required or not
 @FragmentProvider(/* not sure what this does but I've seen it in the spec*/ required = true
 // the name could be a parameter here too, instead of a method on ContextFragmentProvider
)
    Uni<List<ContextFragment>> read(String query, /* primitive implies required perhaps */ float minScore, 
                          @Required Integer maxResults, 
                          @DefaultsTo("true") boolean optionalParam,
                          List<CategoryEnum> filter,
                          ContextBudget budget) {
 // Now we have all the useful parameter values we can use
 }
}

For your:

{#context slot="knowledge"
          provider="faq-rag"
          maxTokens=1500
          query="{userMessage}"
          minScore=0.7
          maxResults=5
          filter="category:technical" /}

Here I'm talking about the user-facing API. Of course, we will need to extract all this annotation/signature info into an API for implementation consumption.

Now, if you do keep the write method we could do it differently, I still have to read up on what that one does.

5 replies

geoand Dec 1, 2025
Maintainer

if you do keep the write method we could do it differently

What write method are you referring to here?

geoand Dec 1, 2025
Maintainer

@FragmentProvider

This indeed sounds super interesting! In that case, would you even need ContextFragmentProvider ?

geoand Dec 1, 2025
Maintainer

Wouldn't it be enough to annotate the class with @Context("somename") in addition to using @FragmentProvider?

FroMage Dec 1, 2025

What write method are you referring to here?

I meant ContextFragmentProvider.write which I didn't really understand yet.

This indeed sounds super interesting! In that case, would you even need ContextFragmentProvider ?

Internally probably, as something you build to abstract over all scanned @FragmentProvider

Wouldn't it be enough to annotate the class with @context("somename") in addition to using @FragmentProvider

Well, as a place to put the getName() you mean? I'm not sure what @Context is, do you mean the JAX-RS one? I would have put the name as an argument to @FragmentProvider("name") myself…

geoand Dec 1, 2025
Maintainer

I'm not sure what @context is, do you mean the JAX-RS one

No, I'm talking about having a new annotation that would be used to match the code to the "slot" used in the template.

FroMage · 2025-11-28T17:35:43Z

FroMage
Nov 28, 2025

As to the Qute aspects:

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Acme Corp.

        {#context slot="policies"
                  provider="static-policies"
                  maxTokens=500
                  category="customer-service"
                  required=true /}

        {#context slot="history"
                  provider="chat-memory"
                  maxTokens=2000
                  conversationId
                  maxMessages=20 /}

        {#context slot="knowledge"
                  provider="faq-rag"
                  maxTokens=1000
                  query=userMessage
                  minScore=0.7 /}

        {#context slot="user-profile"
                  provider="user-preferences"
                  maxTokens=300
                  userId /}

        ## Company Policies
        {context:policies}

        ## Conversation History
        {context:history}

        ## Relevant Knowledge Base Articles
        {context:knowledge}

        ## User Preferences
        {context:user-profile}

        Now respond to the user's question.
        """)
    String chat(@UserMessage String userMessage,
                @MemoryId String conversationId,
                String userId);
}

NOTE: in Qute we can simplify userId=userId into just userId on template invocations. I've also fixed the recursive-template syntax.

Do we expect {#context …} declarations to be shared between prompts? If yes, then we should actually extract them from individual prompts. Perhaps it's only their results that we need to assemble differently in different prompts?

Is there any value in separating the context declaration from its insertion in the template? This looks awfully like we're defining variables (what you call "slots" for a reason I cannot immediatly understand) and only using them once. If we can use them more than once, then fine. If not, we can simplify:

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Acme Corp.

        ## Company Policies
        {#context
                  provider="static-policies"
                  maxTokens=500
                  category="customer-service"
                  required=true /}

        ## Conversation History
        {#context
                  provider="chat-memory"
                  maxTokens=2000
                  conversationId
                  maxMessages=20 /}

        ## Relevant Knowledge Base Articles
        {#context
                  provider="faq-rag"
                  maxTokens=1000
                  query=userMessage
                  minScore=0.7 /}

        ## User Preferences
        {#context
                  provider="user-preferences"
                  maxTokens=300
                  userId /}

        Now respond to the user's question.
        """)
    String chat(@UserMessage String userMessage,
                @MemoryId String conversationId,
                String userId);
}

If we absolutely must define new variables because we can reuse them, or we may want to call functions on them or what, then let's be clear about them being variables:

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Acme Corp.

        {#define name="policies"
                  provider="static-policies"
                  maxTokens=500
                  category="customer-service"
                  required=true /}

        {#define name="history"
                  provider="chat-memory"
                  maxTokens=2000
                  conversationId
                  maxMessages=20 /}

        {#define name="knowledge"
                  provider="faq-rag"
                  maxTokens=1000
                  query=userMessage
                  minScore=0.7 /}

        {#define name="userProfile"
                  provider="user-preferences"
                  maxTokens=300
                  userId /}

        ## Company Policies
        {policies.toLowerCase}

        ## Conversation History
        {history.or("No history")}

        ## Relevant Knowledge Base Articles
        {knowledge}

        ## User Preferences
        {userProfile}

        Now respond to the user's question.
        """)
    String chat(@UserMessage String userMessage,
                @MemoryId String conversationId,
                String userId);
}

And finally, given that we know statically all the template fragment providers we could just as well auto-declare their Qute tags and write:

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Acme Corp.

        {#define-static-policies name="policies"
                  maxTokens=500
                  category="customer-service"
                  required=true /}

        {#define-chat-memory name="history"
                  maxTokens=2000
                  conversationId
                  maxMessages=20 /}

        {#define-faq-rag name="knowledge"
                  maxTokens=1000
                  query=userMessage
                  minScore=0.7 /}

        {#define-user-preferences name="userProfile"
                  maxTokens=300
                  userId /}

        ## Company Policies
        {policies.toLowerCase}

        ## Conversation History
        {history.or("No history")}

        ## Relevant Knowledge Base Articles
        {knowledge}

        ## User Preferences
        {userProfile}

        Now respond to the user's question.
        """)
    String chat(@UserMessage String userMessage,
                @MemoryId String conversationId,
                String userId);
}

But again, if all your fragment providers produce String and you don't actually need to make them variables in Qute, this could simply be:

@RegisterAiService
public interface CustomerSupportBot {

    @SystemMessage("""
        You are a customer support assistant for Acme Corp.

        ## Company Policies
        {#context-static-policies
                  maxTokens=500
                  category="customer-service"
                  required=true /}

        ## Conversation History
        {#context-chat-memory
                  maxTokens=2000
                  conversationId
                  maxMessages=20 /}

        ## Relevant Knowledge Base Articles
        {#context-faq-rag
                  maxTokens=1000
                  query=userMessage
                  minScore=0.7 /}

        ## User Preferences
        {#context-user-preferences
                  maxTokens=300
                  userId /}

        Now respond to the user's question.
        """)
    String chat(@UserMessage String userMessage,
                @MemoryId String conversationId,
                String userId);
}

4 replies

geoand Nov 29, 2025
Maintainer

Do we expect {#context …} declarations to be shared between prompts?

This is a very good question for which only real-world usage will provide an answer

geoand Nov 29, 2025
Maintainer

if all your fragment providers produce String

When sending the data to the model, this will be the case, however we probably want to gives the ability to tweak how the fragments are translated to instructions (hence what Clement showed in his examples)

geoand Dec 1, 2025
Maintainer

however we probably want to gives the ability to tweak how the fragments are translated to instructions

Actually, I am wondering if this could be incorporated into the ContextFragmentProvider API...

FroMage Dec 1, 2025

hence what Clement showed in his examples

I guess I missed that part…

geoand · 2025-11-29T07:55:58Z

geoand
Nov 29, 2025
Maintainer

Your input is very much appreciated here @FroMage! I will take a close look on Monday

0 replies

emmanuelbernard · 2025-12-01T15:47:50Z

emmanuelbernard
Dec 1, 2025

I like the idea of aggregation of various sources and of the budget control. I'm a bit skeptical that prompt is the control center though. Feels like it's a uber things that could have multiple prompts as input.
It reminds me of portlets and portal servers and well I'm not a fan. It feels the thing that should control is the controller, not the view (MVC style) and I seem to associate the prompt as a view. So you'll stick more and more logic in this template structure.

1 reply

geoand Dec 2, 2025
Maintainer

So you'll stick more and more logic in this template structure.

Are you sure? The way I see it, the proposal is sticking sections into the prompt, sections that are configurable and that end up being converted to strings that will be part of the user message.

emmanuelbernard · 2025-12-01T15:48:26Z

emmanuelbernard
Dec 1, 2025

You mention auditability but that means logging not only fragment composition but all fragments and the end prompt, is that was you had in mind?

1 reply

cescoffier Dec 2, 2025
Maintainer Author

Yes. I think it would follow what we are doing in other places: build an event with all this data and (CDI) fire it, so you can implement your auditing with the data you want.

emmanuelbernard · 2025-12-01T15:48:56Z

emmanuelbernard
Dec 1, 2025

Using the system message for all that “seems wrong” as it giving priorities to all contexts equally. I find it strange to set all the things in SystemMessage and have a UserMessage seemingly disconnected below. Might you want to embed the user message in the final message (e.g. in the middle).

1 reply

cescoffier Dec 2, 2025
Maintainer Author

Depends on the context slot:

user preferences, policies are fine in the system message
everything else should be in the user message

emmanuelbernard · 2025-12-01T15:49:34Z

emmanuelbernard
Dec 1, 2025

Could it be that one might want to compress a composed prompt? How would you do that?
I guess that's the excluded for now design part related to the pipelining.

1 reply

cescoffier Dec 2, 2025
Maintainer Author

Exactly, it's something I had in mind before, but, it's complex enough already and this can be added in a second steps.
This pipeline would allows compression, but also de-duplication and so on.

emmanuelbernard · 2025-12-01T15:49:44Z

emmanuelbernard
Dec 1, 2025

Will prompts be externalised by some teams and thus be “untypesafe”?

2 replies

geoand Dec 2, 2025
Maintainer

If the prompts aren't know at build time, then yes

cescoffier Dec 2, 2025
Maintainer Author

It's a good question. I think I added somewhere the idea to check the type safety of the prompt are build time. However, if the prompt is loaded from the disk (or any remote resource), we cannot do that.

emmanuelbernard · 2025-12-01T15:50:14Z

emmanuelbernard
Dec 1, 2025

maxToken should be in a capacity to ask for compression (wrapped ContextFragment)?

1 reply

cescoffier Dec 2, 2025
Maintainer Author

Yes, however, it requires a token estimator (like we have a cost estimator bean).

emmanuelbernard · 2025-12-01T15:51:06Z

emmanuelbernard
Dec 1, 2025

How are attributes of context fragment used to alter behavior? In qute code?
What's a slot?
I’m not 100% clear on what a QuerySpec means concretely, I guess it means the #context definition? so should Should QuerySpec be named ContextSpec?

1 reply

cescoffier Dec 2, 2025
Maintainer Author

I don't mind changing the name. Query spec was related to Update spec: query to retrieve context fragment and update to write a message in the context provider.

A query spec is intended to be created from the {context} attributes.

emmanuelbernard · 2025-12-01T15:51:11Z

emmanuelbernard
Dec 1, 2025

I don’t really understand the UpdateSpec behavior, when it is called?
will conversation state update be handled that way?

1 reply

cescoffier Dec 2, 2025
Maintainer Author

It's called every time there is a new message from the AI or User. Thus the context providers can maintain their own view of the conversation/state.

For example, an episodic context provider will determine if the message is an action and if so add it to its journal.

emmanuelbernard · 2025-12-01T15:51:59Z

emmanuelbernard
Dec 1, 2025

Is the prompt the root of all things in a given request / operation? I doubt it.

3 replies

cescoffier Dec 2, 2025
Maintainer Author

What do you mean? There will be a prompt eventually to start the interaction.

dliubarskyi Dec 5, 2025

There is much more than the (text) prompt that is being sent to the LLM (inside the ChatRequest) and that has a direct effect on the response:

available tools
desired output structure
model name
temperature and other sampling parameters

dliubarskyi Dec 5, 2025

I guess the output of such a mechanism should be ChatRequest, to be able to control all inputs to the LLM

emmanuelbernard · 2025-12-01T15:52:20Z

emmanuelbernard
Dec 1, 2025

Will the prompt composer be the same/similar and duplicated in a lot of places?
What if I forget to add the policy context in one of them
-> attack vector by mistake

1 reply

cescoffier Dec 2, 2025
Maintainer Author

That's a good question. In this case, you can imagine a "composite" context provider making sure you have everything you need.

jmartisk · 2025-12-04T12:30:24Z

jmartisk
Dec 4, 2025
Maintainer

I am trying to compare and reconcile this with the current RetrievalAugmentor stuff that we have right now. Because there are many similarities and I'm thinking whether we should create something completely new and separate, or build something on top of existing things.

The ContextFragmentProvider looks like the current ContentRetriever
The QuerySpec sounds like the current Query, just with parameters added

I could envision using this template-based approach that would, at build time, automatically transform the template into an instance of RetrievalAugmentor. A significant portion of the functionality needed to achieve this is already there, except (this and maybe some more stuff):

the parameters for Query that a ContentRetriever can access (that would include the maxTokens for budgeting too)
names for ContentRetrievers, that would be achieved by adding some annotation on them for CDI
the provenance tracking of generated fragments - this could be achieved by adding some extra metadata to the Content instances created by ContentRetrievers that would identify which content retriever it came from (basically the Source as from the description)
observability, of course we could fire CDI events for all stuff that happens (ContentRetriever is called, etc.), I think that's not there right now

That way, we could offer the typesafe templating as proposed by Clement, but build on top of existing components instead of creating something completely new.
One benefit would be that if additional flexibility is needed, one could still resort to building a RetrievalAugmentor manually instead of using templates, which may be inflexible for some cases, like, for example would it be possible to skip/change one fragment based on something that happens in another fragment? (example: one fragment retrieves some security metadata and another has to take that into account)..
I know you wrote "No assumptions about what other providers exist", but I guess in some cases some dependencies may be needed.

1 reply

jmartisk Dec 10, 2025
Maintainer

Ok; I think this is out of the question now with the multi-message approach because Retrieval augmentors aren't designed for splitting the context into multiple messages.

jmartisk · 2025-12-04T12:30:33Z

jmartisk
Dec 4, 2025
Maintainer

As for the multi-modal payloads, I'm not sure how this fits into the templating approach, if you have some non-text content, it has to be submitted to the LLM as a separate UserMessage, no? You can't embed non-text content into a textual prompt AFAIK

2 replies

cescoffier Dec 4, 2025
Maintainer Author

You can embed non-text content into a prompt. The content is base64 encoded.

dliubarskyi Dec 5, 2025

UserMessage is basically a List<Content>. So if there is an image in the middle of the prompt, resulting UserMessage could be:
List.of(TextContent("text before the image"), ImageContent, TextContent("text after the image"))

dliubarskyi · 2025-12-05T16:12:36Z

dliubarskyi
Dec 5, 2025

Nice! I especially like the idea of explicitly specifying "slots" and controlling token budgets.
I have a few concerns though, I will post them in separate threads.

0 replies

dliubarskyi · 2025-12-05T16:13:04Z

dliubarskyi
Dec 5, 2025

Tools and structured outputs are missing in this proposal, but they are core LLM features and are included in the "final" prompt that LLM eventually "sees" before generating the answer. Token budgets, priorities, ordering, observability, etc are also applicable to them.

5 replies

cescoffier Dec 7, 2025
Maintainer Author

Totally agree, I will add them to the proposal.

dliubarskyi Dec 10, 2025

Thanks, but I am not sure the updated proposal fixes the issue.
There are multiple problems:

{context:toolbox} inside the @UserMessage looks strange, as tools are normally not specified inside messages, they are specified separately in the LLM API reqeust
ContextFragment assumes that payload is unstructured data and uses byte array type, but tools are strucutred data and are specified using ToolSpecification type. I am not sure converting ToolSpecification into JSON, then into byte array and back (to send to the LLM) makes sense on each request. It is also hard to estimate current token usage for each tool (to see which ones are fitting in the budget) when operating with array of bytes, it is much easier to do when operating with ToolSpecification type.

dliubarskyi Dec 10, 2025

Same thing with {#context slot="schema" BTW, with most LLM providers there is no need to include schema inside messages, it is specified separately. And we have ResponseFormat type that should eventually be used to create a ChatReqeust

cescoffier Dec 11, 2025
Maintainer Author

One of the issue I have with not having them: it's magic, meaning that you have no idea how it is passed. If you don't see it, if it's not part of the budget... it can be pretty hard to debug.

dliubarskyi Dec 11, 2025

Tools and schema should definitely be a part of the ebudget (as they consume tokens), but they do not need to be inside the messages, as they are specified separately. Adding them to messages will just duplicate the budget

dliubarskyi · 2025-12-05T16:13:59Z

dliubarskyi
Dec 5, 2025

IIUC, this solution seems to assume that only 2 messages are always sent in the request to the LLM: SystemMessage (with all the instructions, RAG data, conversational history, etc) and UserMessage (with an actual user input). But LLM APIs are designed to consume a list of messages and trained on longer conversations. There is a discrepancy between the shape of the assembled context and the input to the LLM provider API.

Not sure how tools will work here as well. Many LLM providers are expecting to have a specific ordering of specific types of messages, for example: UserMessage("What is the weather in London?") -> AiMessage(ToolExecutionRequest("get_weather("London")")) -> ToolExecutionResultMessage("as usual") -> AiMessage("The weather in London is rainy."). For tools to work properly, LLMs should receive AiMessage and ToolExecutionResultMessage after the tool call is done, but as far as I understand, in the current proposal they will be a part of the SystemMessage, perhaps somewhere in the middle, which I am not sure LLMs will handle correctly.

1 reply

cescoffier Dec 7, 2025
Maintainer Author

No, they won't be part of the system message, it's a mistake in the proposal. You decide which slot you want in each message.

dliubarskyi · 2025-12-05T16:14:10Z

dliubarskyi
Dec 5, 2025

Some context providers might depend on the others, for example RAG provider might need to know the whole converational history and user preferences in order to provide the most relevant pieces of information.

1 reply

cescoffier Dec 7, 2025
Maintainer Author

That's where CDI is helping a lot. A context provider can access other providers, just by injecting them.

dliubarskyi · 2025-12-05T16:14:37Z

dliubarskyi
Dec 5, 2025

We need to make sure to not put questionable content in the system message (also in examples) to reduce chances of prompt injections. Especially not converstanion history (that contains user queries), maybe not even RAG.

1 reply

cescoffier Dec 7, 2025
Maintainer Author

+1

dliubarskyi · 2025-12-05T16:15:17Z

dliubarskyi
Dec 5, 2025

I would also include "efficiently taking advantage of prompt caching" in the list of requirements, as it is an important LLM feature and can dramatically impact latency and cost

9 replies

cescoffier Dec 8, 2025
Maintainer Author

What I fail to understand is if the whole message is the caching granularity or if parts of the messages can be cached.

cescoffier Dec 8, 2025
Maintainer Author

Asking because, I'm starting to wonder if we can dispatch each slot in different UserMessages.

geoand Dec 9, 2025
Maintainer

Asking because, I'm starting to wonder if we can dispatch each slot in different UserMessages.

Yeah, I've been wondering whether the single message strategy actually makes sense...

dliubarskyi Dec 9, 2025

AFAIU parts of the messages can be cached as well, but not for all providers. In general it seems that the order of content is what matters, not how it is split between messages. But it is a bit weird to send N consecutive user messages, some providers might even not allow this

dliubarskyi Dec 9, 2025

This can be easily tested btw, providers return info on cached tokens in the response

geoand · 2025-12-08T08:51:26Z

geoand
Dec 8, 2025
Maintainer

One thing that I just realized could be a nice thing about this work: @angelozerr is working on integrating the Qute debugger into Quarkus LangChain4j 😉

5 replies

angelozerr Dec 8, 2025

Thanks @geoand to promote Qute debugger. Just to give you an idea, I have started to implement Qute debugging in Java annotation with TemplateContents:

Once this implementation will be finished and merged, it will be super easy to do the same thing for Quarkus LangChain4j

angelozerr Dec 9, 2025

I am implementing this feature for Quarkus LangChain4j in #1989, here a demo:

geoand Dec 9, 2025
Maintainer

Super cool!

angelozerr Dec 9, 2025

And you benefit from all Qute debugger support like show available variables, evaluate expression:

angelozerr Dec 9, 2025

cescoffier · 2025-12-09T17:30:05Z

cescoffier
Dec 9, 2025
Maintainer Author

I'm preparing a revamped version of the proposal. I hope to be able to publish it tomorrow.

0 replies

cescoffier · 2025-12-10T08:34:26Z

cescoffier
Dec 10, 2025
Maintainer Author

I've posted a new version of the proposal. Hard to keep everything in sync, so please review.

The main difference is the switch to a multi-message approach:

Previous Approach (Single-Message)

One big message: All context (policies, history, RAG, etc.) embedded in a single @SystemMessage or @Usermessage
History as text: Conversation history rendered as formatted text, not proper message structures
No caching: Everything re-processed on every call - no way to cache static vs dynamic content
Poor positioning: All context in one blob - can't optimize for "lost in the middle" bias

New Approach (Multi-Message)

Message sequence: Multiple @SystemMessage and @Usermessage annotations structure the prompt optimally
@ConversationMemory: Conversation history injected as proper User/Assistant/Tool message sequences (not text)
Explicit caching: cacheable=true marks static contexts for LLM prompt caching (90% cost reduction)
Strategic positioning: Static content (policies, tools) at start (cached), dynamic content (RAG, current query) at end
Tool calls preserved: Function calling history stored as native message types, so LLM recognizes its own previous tool calls
Same API for single/multi-turn: Works naturally for both simple queries and complex conversations without changing the model

0 replies

FroMage · 2025-12-10T11:02:43Z

FroMage
Dec 10, 2025

I'm still seeing confusion as to the Qute syntax, in all the points I raised in the proposal.

I also still don't see any reason to separate the {#context... declaration with its insertion in the prompt via {context:slot}. Unless there's a justification for this, I find it confusing and perplexing.

As for the caching, I am not seeing any reason why the caching should be defined in the {#context calls rather than in the annotation itself? Especially when one annotation will contain two context calls, what happens if they define two different cachine strategies? Do we still re-render the template, how do they combine? Moving the caching strategy to the entire promt annotation would solve the composition issues.

2 replies

geoand Dec 10, 2025
Maintainer

I also still don't see any reason to separate the {#context... declaration with its insertion in the prompt via {context:slot}.

The more I look at the examples, the more I agree

cescoffier Dec 10, 2025
Maintainer Author

I agree, I just used the same syntax as the initial proposition. And honestly, I have no idea what is possible :-)

That being said, if we can do this with a single line instead of 2, definitely +1 (I can edit the proposal with this)

jmartisk · 2025-12-10T13:43:20Z

jmartisk
Dec 10, 2025
Maintainer

When applying multiple annotations of the same type (@UserMessage) on a method, you need to have a wrapper annotation for them. If we want to allow multiple @UserMessage and also interleave it with other annotations (@ConversationMemory etc) while keeping defined ordering between them, we will need a generic wrapper, like

@Messages(value={
    @SystemMessage(...),
    @UserMessage(...),
    @ConversationMemory(...),
    @UserMessage(...)
  }
)

22 replies

cescoffier Dec 11, 2025
Maintainer Author

{context:toolbox} inside the @Usermessage looks strange, as tools are normally not specified inside messages, they are specified separately in the LLM API reqeust

I actually have better result also including them in the prompt.

dliubarskyi Dec 11, 2025

For example, {#context slot='toolbox'/} would not be rendered in the prompt, but be recorder in the side-effect Map we generate while rendering and then extracted to be passed to the LLM API request in the proper place. I'm confident we can do that. It would appear to be in the prompt, because it would be specified in the prompt, but it would not end up in the rendered prompt we send to the LLM.

I am afraid this will be confusing for users. This proposal tries to remove "hidden magic" in one place, but this suggestion introduces it in another. Isn't the whole idea to have a WYSIWYG?

dliubarskyi Dec 11, 2025

Isn't it covered by:

{#context slot="toolbox" provider="tools" cacheable=true /}
{#context slot="schema" provider="structured-output" cacheable=true /}
Basically both would be covered by their own context provider.
Structured output will inject the schema, most probably in the metadata/section, and the tools one the list of tools and description (in prompt, and also as metadata/section)

Both slots are specified inside one of the messages, but they should not be in messages at all (in 95% of cases). If you check APIs of every LLM prvider, tools and output schema are specified separately from messages (unstructured data) for multiple reasons (LLM providers use JSON schemas to restrict sampling, validate LLM response before returning it, etc).

I believe this is the same problem as @emmanuelbernard mentioned earlier: prompt (messages) itself is not the center of everything, or maybe it is not the whole picture. Perhaps the center should be ChatRequest (that contains messages as well as all other pieces like tools, output format, etc)

dliubarskyi Dec 11, 2025

I actually have better result also including them in the prompt.

Perhaps it works better for some weaker models. There was some evidence of this earlier, but I believe this will go away eventually. There is normally no need to duplicate tools both in the messages and in the tools section. This will just lead to more tokens used.

User should have a choice if they want tools to be duplicated in the prompt or not. By default I would not put tools and structured output inside messages.

FroMage Dec 12, 2025

I am afraid this will be confusing for users.

Well, it all depends on how we document and explain things. We can very reasonably argue that certain LLM APIs expect certain data to come from different places (prompt, API, whatever) and so we turn an LLM-agnostic prompt into an LLM-specific API call. That's reasonable, IMO.

This proposal tries to remove "hidden magic" in one place, but this suggestion introduces it in another. Isn't the whole idea to have a WYSIWYG?

Again, we could argue (and I do) that this might be intuitive and useful, but I can't argue that this new proposal is not hidden magic. It is, for sure. Quarkus is full of hidden magic, though, so users are used to this, and it's one of our selling point that the magic makes sense and is intuitive and useful.

If it is a hard requirement that there be no magic, then this proposal is not good and we're back to the drawing board 🤷

cescoffier · 2026-01-12T16:05:10Z

cescoffier
Jan 12, 2026
Maintainer Author

Closing in favor of #2060.

0 replies

A Proposal for Context Engineering for Quarkus Langchain4j #1979

Uh oh!

Uh oh!

cescoffier Nov 28, 2025 Maintainer

Context Engineering Proposal

Executive Summary

Core Vision

Characteristics

What Problems Does This Solve?

Introduction & Motivation

The Context Engineering Challenge

Three Mental Models for Context Engineering

Model A: Context is Part of the Prompt

Model B: Prompt is Part of the Context

Model C: Separate but Complementary

In this proposal, we would use the model A with a Multi-Message Extension

Why Multi-Message Prompt-First? (And Yes, I’m bad at naming things)

Design Philosophy

Core Principles

1. Multi-Message Prompt-First Design

2. Determinism & Reproducibility

3. Composability

4. Observability

5. Type Safety & Validation

Architectural Overview

High-Level Architecture

Component Responsibilities

Core Abstractions (incomplete list)

ContextFragment

ContextFragmentProvider / ContextProvider

QuerySpec

ContextBudget

Source

Multi-Message Prompt-First Design

The Central Principle

Qute Integration: The {#context} Tag

Complete Example

Observability with Prompt-First Design

Evolution from Single-Message Design

Original Approach: Single Message with Embedded History

Why Multi-Message is Better

LLM Prompt Caching Strategy

How LLM Prompt Caching Works

Designing for Caching

The cacheable Attribute

Context Ordering and Positioning

Known LLM Biases

Recommended Message Structure

Mitigating "Lost in the Middle"

Context Rotting

Conversation Memory: The @ConversationMemory Annotation

Basic Usage

How It Works

Configuration Options

Interaction with ChatMemory

Budget Management

Positioning Guidelines

Context Providers

### 1. Episodic Long-Term Memory Provider

2. User Preferences Provider (Long-Term Memory)

3. RAG Provider (Document Retrieval)

Type Safety & Validation

The Challenge

Proposal: Type-Safe QuerySpec Subclasses

Advanced Topics and Implementation Details

Special Context Providers

Toolbox Provider

Structured Output Provider

The required Attribute Semantics

required=true (fail-fast)

CDI Scopes for Context Providers

Recommended Scopes

Scope Considerations

Provider Dependencies and Injection

Injecting Other Providers

Verifying Context Provider Inclusion

Observability Events

Testing with Mocks

Rendered Prompt Inspection (Dev Mode)

Passing Fragments as Method Parameters

cescoffier
Nov 28, 2025
Maintainer

Qute Integration: The `{#context}` Tag

The `cacheable` Attribute

Conversation Memory: The `@ConversationMemory` Annotation

The `required` Attribute Semantics

`required=true` (fail-fast)

Replies: 32 comments 89 replies

geoand
Nov 28, 2025
Maintainer

geoand Nov 28, 2025
Maintainer

geoand Nov 28, 2025
Maintainer

FroMage
Nov 28, 2025

FroMage
Nov 28, 2025

geoand Dec 1, 2025
Maintainer

geoand Dec 1, 2025
Maintainer

geoand Dec 1, 2025
Maintainer