diff --git a/examples/partners/temporal_agents_with_knowledge_graphs/Appendix.ipynb b/examples/partners/temporal_agents_with_knowledge_graphs/Appendix.ipynb new file mode 100644 index 0000000000..308fe7f973 --- /dev/null +++ b/examples/partners/temporal_agents_with_knowledge_graphs/Appendix.ipynb @@ -0,0 +1,671 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dd2b3250-1764-4cef-b3e0-b5fa1be924e9", + "metadata": {}, + "source": [ + "
\n",
+ " Clearly define core entity types (e.g., Person
, Organization
, Event
) and relationships. Design the schema with versioning and flexibility in mind, enabling future schema evolution with minimal downtime.\n",
+ "
\n", + " Use high-cardinality fields (such as timestamps or unique entity IDs) for partitioning to preserve query performance as data volume grows. This is particularly important for temporally-aware data. For example:\n", + "
\n", + " \n", + " ```sql \n", + " CREATE TABLE statements (\n", + " statement_id UUID PRIMARY KEY,\n", + " entity_id UUID NOT NULL,\n", + " text TEXT NOT NULL,\n", + " valid_from TIMESTAMP NOT NULL,\n", + " valid_to TIMESTAMP,\n", + " status VARCHAR(16) DEFAULT 'active',\n", + " embedding VECTOR(1536),\n", + " ...\n", + " ) PARTITION BY RANGE (valid_from);\n", + " ```\n", + "\n",
+ " Avoid deleting or overwriting records. Instead mark outdated facts as inactive by setting a status
(e.g., inactive
).\n",
+ "
\n",
+ " Index temporal fields (valid_from
, valid_to
) to support efficient querying of both current and historical states.\n",
+ "
\n",
+ " Set up a background task that periodically queries for records with e.g., valid_to < NOW() - INTERVAL 'X days'
and moves them to an archival table for long-term storage.\n",
+ "
\n", + " Tailor retention durations by data source or entity type. For example, high-authority sources like government publications may warrant longer retention than less reliable data such as scraped news headlines or user-generated content.\n", + "
\n", + "\n",
+ " Introduce a numeric relevance_score
column (or columns) that incorporate metrics such as recency, source trustworthiness, and production query frequency.\n",
+ "
\n", + " Schedule a routine job to prune or archive facts falling below a predefined relevance threshold.\n", + "
\n", + "\n",
+ " Compute a composite relevance_score
, for example:\n",
+ "
relevance_score = β1 * recency_score + β2 * source_trust_score + β3 * retrieval_count
\n",
+ " \n", + " Where:\n", + "
\n", + "recency_score
: exponential decay from valid_from
source_trust_score
: source-domain trust valueretrieval_count
: production query frequency\n", + " Log and compare outputs from the pruned vs. original graph before routing production traffic.\n", + "
\n", + "\n", + " Recompute relevance (e.g., nightly) to ensure new or frequently accessed facts surface back to the top.\n", + "
\n", + "\n",
+ " Begin by collecting documents in batches of e.g., 100–500 using a job queue like Redis or Amazon SQS. Process these documents in parallel, splitting each into their respective chunks. The chunking stage should often optimize for I/O parallelization as document reading is often the bottleneck. You can then store the chunks and their respective metadata in your chunk_store
table, using bulk insert operations to minimize overhead.\n",
+ "
\n", + " Pull chunks in batches of e.g., 50–100 and send them to your chosen LLM (e.g., GPT-4.1-mini) using parallel API requests. Implement rate limiting with semaphores or other methods to stay safely within OpenAI's API limits whilst maximizing your throughputs. We've covered rate limiting in more detail in our cookbook on How to handle rate limits. Once extracted, you can then write these to the relevant table in your database.\n", + "
\n", + "\n", + " You can then similarly group the statements we've just extracted into batches, and run the entity extraction processes in a similar vein before storing them.\n", + "
\n", + "\n",
+ " Group extracted statement IDs by their associated entity clusters (e.g., all statements related to a specific entity like “Acme Corp.”). Send each cluster to your LLM (e.g., GPT-4.1-mini) in parallel to assess which statements are outdated or superseded. Use the model’s output to update the status
field in your statements
table—e.g., setting status = 'inactive'
. Parallelize invalidation jobs for performance and consider scheduling periodic sweeps for consistency.\n",
+ "
\n",
+ " Take batches of newly extracted entity mentions and compute embeddings using your model’s embedding endpoint. Insert these into your entity_registry
table, assigning each a provisional or canonical entity_id
. Perform approximate nearest-neighbor (ANN) searches using pgvector
to identify near-duplicates or aliases. You can then update the entities
table with resolved canonical IDs, ensuring downstream tasks reference unified representations.\n",
+ "
\n", + " For multi-hop questions, the Controller can prompt a model (e.g., GPT-4.1) with the partial subgraph and ask “Which next edge should I traverse?” This allows dynamic, context-aware traversal rather than blind breadth-first search.\n", + "
\n", + "\n",
+ " For frequently asked queries or subgraph patterns, cache the results (e.g., in Redis or a Postgres Materialized View) with a TTL equal to the fact’s valid_to
date, so that subsequent requests hit the cache instead of re-traversing.\n",
+ "
\n", + " Deploy the Traversal Worker Agents in a Kubernetes cluster with Horizontal Pod Autoscalers. Use CPU and memory metrics (and average queue length) to scale out during peak usage.\n", + "
\n", + "\n", + " Introducing a persona to the model is an effective way to drive performance. Once you have narrowed down the specialism of the component you are developing the prompt for, you can create a persona in the system prompt that helps to shape the model's behaviour. We used this in our planner model to create a system prompt like this:\n", + "
\n", + "initial_planner_system_prompt = (\n",
+ " \"You work for the leading financial firm, ABC Incorporated, one of the largest financial firms in the world. \"\n",
+ " \"Due to your long and esteemed tenure at the firm, various equity research teams will often come to you \"\n",
+ " \"for guidance on research tasks they are performing. Your expertise is particularly strong in the area of \"\n",
+ " \"ABC Incorporated's proprietary knowledge base of earnings call transcripts. This contains details that have been \"\n",
+ " \"extracted from the earnings call transcripts of various companies with labelling for when these statements are, or \"\n",
+ " \"were, valid. You are an expert at providing instructions to teams on how to use this knowledge graph to answer \"\n",
+ " \"their research queries. \\n\"\n",
+ ")
\n",
+ " \n", + " Persona prompts can become much more developed and specific than this, but this should provide an insight into what this looks like in practice.\n", + "
\n", + "\n", + " For extraction-related tasks, such as statement extraction, a concise few-shot prompt (2–5 examples) will typically deliver higher precision than a zero-shot prompt at a marginal increase in cost.\n", + "
\n", + "\n", + " For e.g., temporal reconciliation tasks, chain-of-thought methods where you guide the model through comparison logic are more appropriate. This can look like:\n", + "
\n", + "Example 1: [Old fact], [New fact] → Invalidate\n",
+ "Example 2: [Old fact], [New fact] → Coexist\n",
+ "Now: [Old fact], [New fact] →
\n",
+ " \n",
+ " You can also lean on other LLMs or more structured methods to prune and prepare material that will be dynamically passed to prompts. We saw an example of this when building the tools for our retriever above, where the timeline_generation
tool sorts the retrieved material before passing it back to the central orchestrator.\n",
+ "
\n", + " Steps to clean up the context or compress it mid-run can also be highly effective for longer-running queries.\n", + "
\n", + "\n",
+ " Maintain a set of prompt templates in a version-controlled directory (e.g., prompts/statement_extraction.json
, prompts/entity_extraction.json
) to enable you to audit past changes and revert if necessary. You can utilize OpenAI's reusuable prompts for this. In the OpenAI dashboard, you can develop reusable prompts to use in API requests. This enables you to build and evaluate your prompts, deploying updated and improved versions without ever changing the code.\n",
+ "
\n", + " Automate A/B testing by periodically sampling extracted facts from the pipeline, re-running them through alternative prompts, and comparing performance scores (you can track this in a separate evaluation harness).\n", + "
\n", + "\n", + " Track key performance indicators (KPIs) such as extraction latency, error rates, and invalidation accuracy.\n", + "
\n", + "\n", + " If any metric drifts beyond a threshold (e.g., invalidation accuracy drops below 90%), trigger an alert and roll back to a previous prompt version.\n", + "
\n", + "\n", + " A key challenge in developing knowledge-driven AI systems is maintaining a database that stays current and relevant. While much attention is given to boosting retrieval accuracy with techniques like semantic similarity and re-ranking, this guide focuses on a fundamental—yet frequently overlooked—aspect: systematically updating and validating your knowledge base as new data arrives.\n", + "
\n", + "\n", + " No matter how advanced your retrieval algorithms are, their effectiveness is limited by the quality and freshness of your database. This cookbook demonstrates how to routinely validate and update knowledge graph entries as new data arrives, helping ensure that your knowledge base remains accurate and up to date.\n", + "
\n", + "\n", + " Learn how to combine OpenAI models (such as o3, o4-mini, GPT-4.1, and GPT-4.1-mini) with structured graph queries via tool calls, enabling the model to traverse your graph in multiple steps across entities and relationships.\n", + "
\n", + "\n", + " This method lets your system answer complex, multi-faceted questions that require reasoning over several linked facts, going well beyond what single-hop retrieval can accomplish.\n", + "
\n", + "\n", + " Traditional knowledge graphs treat facts as static, but real-world information evolves constantly. What was true last quarter may be outdated today, risking errors or misinformed decisions if the graph does not capture change over time. Temporal knowledge graphs allow you to precisely answer questions like “What was true on a given date?” or analyse how facts and relationships have shifted, ensuring decisions are always based on the most relevant context.\n", + "
\n", + "\n", + " A Temporal Agent is a pipeline component that ingests raw data and produces time-stamped triplets for your knowledge graph. This enables precise time-based querying, timeline construction, trend analysis, and more.\n", + "
\n", + "\n", + " The pipeline starts by semantically chunking your raw documents. These chunks are decomposed into statements ready for our Temporal Agent, which then creates time-aware triplets. An Invalidation Agent can then perform temporal validity checks, spotting and handling any statements that are invalidated by new statements that are incident on the graph.\n", + "
\n", + "\n", + " Direct, single-hop queries frequently miss salient facts distributed across a graph's topology. Multi-step (multi-hop) retrieval enables iterative traversal, following relationships and aggregating evidence across several hops. This methodology surfaces complex dependencies and latent connections that would remain hidden with one-shot lookups, providing more comprehensive and nuanced answers to sophisticated queries.\n", + "
\n", + "\n", + " Planners orchestrate the retrieval process. Task-orientated planners decompose queries into concrete, sequential subtasks. Hypothesis-orientated planners, by contrast, propose claims to confirm, refute, or evolve. Choosing the optimal strategy depends on where the problem lies on the spectrum from deterministic reporting (well-defined paths) to exploratory research (open-ended inference).\n", + "
\n", + "\n", + " Tool design spans a continuum: Fixed tools provide consistent, predictable outputs for specific queries (e.g., a service that always returns today’s weather for San Francisco). At the other end, Free-form tools offer broad flexibility, such as code execution or open-ended data retrieval. Semi-structured tools fall between these extremes, restricting certain actions while allowing tailored flexibility—specialized sub-agents are a typical example. Selecting the appropriate paradigm is a trade-off between control, adaptability, and complexity.\n", + "
\n", + "\n", + " High-fidelity evaluation hinges on expert-curated \"golden\" answers, though these are costly and labor-intensive to produce. Automated judgments, such as those from LLMs or tool traces, can be quickly generated to supplement or pre-screen, but may lack the precision of human evaluation. As your system matures, transition towards leveraging real user feedback to measure and optimize retrieval quality in production.\n", + "
\n", + "\n", + " A proven workflow: Start with synthetic tests, benchmark on your curated human-annotated \"golden\" dataset, and iteratively refine using live user feedback and ratings.\n", + "
\n", + "\n", + " Established archival policies and assign numeric relevance scores to each edge (e.g., recency x trust x query-frequency). Automate the archival or sparsification of low-value nodes and edges, ensuring only the most critical and frequently accessed facts remain for rapid retrieval.\n", + "
\n", + "\n", + " Transition from a linear document → chunk → extraction → resolution pipeline to a staged, asynchronous architecture. Assign each processing phase its own queue and dedicated worker pool. Apply clustering or network-based batching for invalidation jobs to maximize efficiency. Batch external API requests (e.g., OpenAI) and database writes wherever possible. This design increases throughput, introduces backpressure for reliability, and allows you to scale each pipeline stage independently.\n", + "
\n", + "\n", + " Enforce rigorous output validation: standardise temporal fields (e.g., ISO-8601 date formatting), constrain entity types to your controlled vocabulary, and apply lightweight model-based sanity checks for output consistency. Employ structured logging with traceable identifiers and monitor real-time quality and performance metrics in real lime to proactively detect data drift, regressions, or pipeline anomalised before they impact downstream applications.\n", + "
\n", + "\n", + " Build a pipeline that extracts entities and relations from unstructured text, resolves temporal conflicts, and keeps your graph up-to-date as new information arrives.\n", + "
\n", + "\n", + " Use structured queries and language model reasoning to chain multiple hops across your graph and answer complex questions.\n", + "
\n", + "\n", + " Move from experimentation to deployment. This section covers architectural tips, integration patterns, and considerations for scaling reliably.\n", + "
\n", + "Industry | \n", + "Example question | \n", + "Risk if database is not temporal | \n", + "
---|---|---|
Financial Services | \n", + "\"How has Moody’s long‑term rating for Bank YY evolved since Feb 2023?\" | \n", + "Mispricing credit risk by mixing historical & current ratings | \n", + "
\"Who was the CFO of Retailer ZZ when the FY‑22 guidance was issued?\" | \n", + "Governance/insider‑trading analysis may blame the wrong executive | \n", + "|
\"Was Fund AA sanctioned under Article BB at the time it bought Stock CC in Jan 2024?\" | \n", + "Compliance report could miss an infraction if rules changed later | \n", + "|
Manufacturing / Automotive | \n", + "\"Which ECU firmware was deployed in model Q3 cars shipped between 2022‑05 and 2023‑03?\" | \n", + "Misdiagnosing field failures due to firmware drift | \n", + "
\"Which robot‑controller software revision ran on Assembly Line 7 during Lot 8421?\" | \n", + "Root‑cause analysis may blame the wrong software revision | \n", + "|
\"What torque specification applied to steering‑column bolts in builds produced in May 2024?\" | \n", + "Safety recall may miss affected vehicles | \n", + "
\n", + " Builds on Graphiti's prompt design to identify temporal spans and episodic context without requiring auxiliary reference statements.\n", + "
\n", + "\n", + " Introduces bidirectionality checks and constrains comparisons by episodic type. This retains Zep's non-lossy approach while reducing unnecessary evaluations.\n", + "
\n", + "\n",
+ " Differentiates between Fact
, Opinion
, Prediction
, as well as between temporal classes Static
, Dynamic
, Atemporal
.\n",
+ "
\n", + " Handles compound sentences and nested date references in a single pass.\n", + "
\n", + "\n", + " Labels each statement as Atemporal, Static, or Dynamic:\n", + "
\n", + "\n", + " Identifies relative or partial dates (e.g., “Tuesday”, “three months ago”) and resolves them to an absolute date using the document timestamp or fallback heuristics (e.g., default to the 1st or last of the month if only the month is known).\n", + "
\n", + "\n",
+ " Ensures every statement includes a t_created
timestamp and, when applicable, a t_expired
timestamp. The agent then compares the candidate triplet to existing knowledge graph entries to:\n",
+ "
t_invalid
invalidated_by
\n", + " Maximize correctness and reduce prompt-debug time while you build out the core pipeline logic.\n", + "
\n", + "\n", + " Once prompts and logic are stable, switch to smaller variants for lower latency and cost-effective inference.\n", + "
\n", + "\n", + " Use OpenAI's Model Distillation to train smaller models with high-quality outputs from a larger 'teacher' model such as GPT-4.1, preserving (or even improving) performance relative to GPT-4.1.\n", + "
\n", + "definition
\n", + " Provides a concise description of what the label represents. It establishes the conceptual boundaries of the statement or temporal type and ensures consistency in interpretation across examples.\n", + "
\n", + "date_handling_guidance
\n",
+ " Explains how to interpret the temporal validity of a statement associated with the label. It describes how the valid_at
and invalid_at
dates should be derived when processing instances of that label.\n",
+ "
date_handling_examples
\n", + " Includes illustrative examples of how real-world statements would be labelled and temporally annotated under this label. These will be used as few-shot examples to the LLMs downstream.\n", + "
\n", + "\n", + " Extract statements that can stand on their own as complete subject-predicate-object expressions without relying on surrounding context.\n", + "
\n", + "\n", + " Break down complex or compound sentences into minimal, indivisible factual units, each expressing a single relationship.\n", + "
\n", + "\n", + " Replace pronouns or abstract references (e.g., \"he\" or \"The Company\") with specific entities (e.g., \"John Smith\", \"AMD\") using the main subject for disambiguation.\n", + "
\n", + "\n", + " Retain explicit dates, durations, and quantities to anchor each fact precisely in time and scale.\n", + "
\n", + "\n",
+ " Every statement is annotated with a StatementType
and a TemporalType
.\n",
+ "
\n",
+ " We instruct the assistant to behave like a domain expert in finance and clearly define the two subtasks: (i) extracting atomic, declarative statements, and (ii) labelling each with a statement_type
and a temporal_type
.\n",
+ "
\n", + " The rules for extraction help to enforce consistency and clarity. Statements must:\n", + "
\n", + "\n",
+ " The {% if definitions %}
block makes it easy to inject structured definitions such as statement categories, temporal types, and domain-specific terms.\n",
+ "
\n", + " We provide an annotated example chunk and the corresponding JSON output to demonstrate to the model how it should behave.\n", + "
\n", + "\n",
+ " The prompt instructs our model to determine when a statement became true (valid_at
) and optionally when it stopped being true (invalid_at
).\n",
+ "
\n",
+ " By dynamically incorporating {{ inputs.temporal_type }}
and {{ inputs.statement_type }}
, the prompt guides the model in interpreting temporal nuances based on the nature of each statement (like distinguishing facts from predictions or static from dynamic contexts).\n",
+ "
\n", + " To maintain clarity and consistency, the prompt requires all dates to be converted into standardized ISO 8601 date-time formats, normalized to UTC. It explicitly anchors relative expressions (like \"last quarter\") to known publication dates, making temporal information precise and reliable.\n", + "
\n", + "\n", + " Recognizing the practical need for quarter-based reasoning common in business and financial contexts, the prompt can interpret and calculate temporal ranges based on business quarters, minimizing ambiguity.\n", + "
\n", + "\n",
+ " Specific rules ensure the semantic integrity of statements—for example, opinions might only have a start date (valid_at
) reflecting the moment they were expressed, while predictions will clearly define their forecast window using an end date (invalid_at
).\n",
+ "
\n",
+ " The agent is specifically instructed to ignore temporal relationships, as these are captured separately within the TemporalValidityRange
.\n",
+ " Defined Predicates
are deliberately designed to be time-neutral—for instance, HAS_A
covers both present (HAS_A
) and past (HAD_A
) contexts.\n",
+ "
\n",
+ " The prompt yields structured RawExtraction
outputs, supported by detailed examples that clearly illustrate:\n",
+ "
Statement
Entities
with corresponding Triplets
values
Triplets
involving the same Entity
primary
and secondary
—for event comparison. The assessment checks if the primary
event is invalidated by the secondary
event.primary
and secondary
) include timestamp details (valid_at
and invalid_at
) along with semantic context through either Statement
, Triplet
, or both. This context ensures accurate and relevant comparisons.\n",
+ " We start by identifying when events happen with get_incoming_temporal_bounds()
. This function checks the event's valid_at
and, if it's dynamic, its invalid_at
. Atemporal events aren't included here.\n",
+ "
\n",
+ " We use select_events_temporally()
to filter events by:\n",
+ "
invalid_at
, or events with various overlaps.\n",
+ " Then, filter_by_embedding_similarity()
compares events based on semantic similarity:\n",
+ "
_similarity_threshold = 0.5
) are filtered out._top_k = 10
).\n",
+ " With select_temporally_relevant_events_for_invalidation()
, we:\n",
+ "
\n",
+ " The LLM-based invalidation_step()
(powered by GPT-4.1-mini) determines whether the incoming event invalidates another event:\n",
+ "
invalid_at
to match the secondary event's valid_at
.expired_at
with the current timestamp.invalidated_by
with the ID of the secondary event.\n",
+ " We use bi_directional_event_invalidation()
to check:\n",
+ "
\n",
+ " Lastly, resolve_duplicate_invalidations()
ensures clean invalidation:\n",
+ "
\n", + " | id | \n", + "text | \n", + "company | \n", + "date | \n", + "quarter | \n", + "
---|---|---|---|---|---|
0 | \n", + "f2f5aa4c-ad2b-4ed5-9792-bcbddbc4e207 | \n", + "\\n\\nRefinitiv StreetEvents Event Transcript\\nE... | \n", + "NVDA | \n", + "2020-08-19T00:00:00 | \n", + "Q2 2021 | \n", + "
1 | \n", + "74d42583-b614-4771-80c8-1ddf964a4f1c | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2016-07-21T00:00:00 | \n", + "Q2 2016 | \n", + "
2 | \n", + "26e523aa-7e15-4741-986a-6ec0be034a33 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2016-11-10T00:00:00 | \n", + "Q3 2017 | \n", + "
3 | \n", + "74380d19-203a-48f6-a1c8-d8df33aae362 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2018-05-10T00:00:00 | \n", + "Q1 2019 | \n", + "
4 | \n", + "7d620d30-7b09-4774-bc32-51b00a80badf | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2017-07-25T00:00:00 | \n", + "Q2 2017 | \n", + "
5 | \n", + "1ba2fc55-a121-43d4-85d7-e221851f2c7f | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2017-01-31T00:00:00 | \n", + "Q4 2016 | \n", + "
6 | \n", + "db1925df-b5a5-4cb2-862b-df269f53be7e | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2017-11-09T00:00:00 | \n", + "Q3 2018 | \n", + "
7 | \n", + "fe212bc0-9b3d-44ed-91ca-bfb856b21aa6 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2019-02-14T00:00:00 | \n", + "Q4 2019 | \n", + "
8 | \n", + "7c0a6f9c-9279-4714-b25e-8be20ae8fb99 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2019-04-30T00:00:00 | \n", + "Q1 2019 | \n", + "
9 | \n", + "10f95617-e5b2-4525-a207-cec9ae9a3211 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2019-01-29T00:00:00 | \n", + "Q4 2018 | \n", + "
10 | \n", + "aab926b2-5a23-4b39-a29c-c1e7ceef5a55 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2020-04-28T00:00:00 | \n", + "Q1 2020 | \n", + "
11 | \n", + "6d45f413-3aa5-4c76-b3cf-d0fdb0a03787 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2019-08-15T00:00:00 | \n", + "Q2 2020 | \n", + "
12 | \n", + "ad10e284-d209-42f1-8a7c-8c889af0914e | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2019-10-29T00:00:00 | \n", + "Q3 2019 | \n", + "
13 | \n", + "a30da2d4-3327-432e-9ce0-b57795a0fe26 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2018-04-25T00:00:00 | \n", + "Q1 2018 | \n", + "
14 | \n", + "038e0986-a689-4374-97d2-651b05bdfae8 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2018-11-15T00:00:00 | \n", + "Q3 2019 | \n", + "
15 | \n", + "6ff24a98-ad3b-4013-92eb-45ac5b0f214d | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2016-02-17T00:00:00 | \n", + "Q4 2016 | \n", + "
16 | \n", + "34d010f1-7221-4ed4-92f4-c69c4a3fd779 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2020-02-13T00:00:00 | \n", + "Q4 2020 | \n", + "
17 | \n", + "e5e31dd4-2587-40af-8f8c-56a772831acd | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2017-10-24T00:00:00 | \n", + "Q3 2017 | \n", + "
18 | \n", + "60e56971-9ab8-4ebd-ac2a-e9fce301ca33 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2016-08-11T00:00:00 | \n", + "Q2 2017 | \n", + "
19 | \n", + "1d4b2c13-4bf0-4c0f-90fe-a48c6e03c73a | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2018-08-16T00:00:00 | \n", + "Q2 2019 | \n", + "
20 | \n", + "b6b5df13-4736-4ecd-9c41-cf62f4639a4a | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2016-04-21T00:00:00 | \n", + "Q1 2016 | \n", + "
21 | \n", + "43094307-3f8f-40a2-886b-f4f1da64312c | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2017-05-01T00:00:00 | \n", + "Q1 2017 | \n", + "
22 | \n", + "e6902113-4b71-491d-b7de-8ff347b481cd | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2018-07-25T00:00:00 | \n", + "Q2 2018 | \n", + "
23 | \n", + "dbaa7a7c-1db2-4b0c-9130-8ca48f10be6f | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2017-02-09T00:00:00 | \n", + "Q4 2017 | \n", + "
24 | \n", + "6ec75a2d-d449-4f52-bb93-17b1770dbf6c | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2018-02-08T00:00:00 | \n", + "Q4 2018 | \n", + "
25 | \n", + "bcf360a8-0784-4c31-8a09-ca824a26264f | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2017-05-09T00:00:00 | \n", + "Q1 2018 | \n", + "
26 | \n", + "01d2252f-10a2-48f7-8350-ffe17bb8e18d | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2016-05-12T00:00:00 | \n", + "Q1 2017 | \n", + "
27 | \n", + "d4c10451-d7b2-4c13-8f15-695596e49144 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2016-10-20T00:00:00 | \n", + "Q3 2016 | \n", + "
28 | \n", + "6c832314-d5ef-42cd-9fa0-914c5480d7be | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2016-01-19T00:00:00 | \n", + "Q4 2015 | \n", + "
29 | \n", + "1207115e-20ed-479c-a903-e28dfda52ebd | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2018-01-30T00:00:00 | \n", + "Q4 2017 | \n", + "
30 | \n", + "259fe893-9d28-4e4d-bc55-2edf646e150b | \n", + "\\n\\nRefinitiv StreetEvents Event Transcript\\nE... | \n", + "AMD | \n", + "2020-07-28T00:00:00 | \n", + "Q2 2020 | \n", + "
31 | \n", + "02b1212b-cd3f-4c19-8505-8d1aea6d3ae2 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "NVDA | \n", + "2020-05-21T00:00:00 | \n", + "Q1 2021 | \n", + "
32 | \n", + "fa199b2c-1f58-4663-af8c-29c531fc97d6 | \n", + "\\n\\nThomson Reuters StreetEvents Event Transcr... | \n", + "AMD | \n", + "2019-07-30T00:00:00 | \n", + "Q2 2019 | \n", + "
\n", + " A planner utilising GPT 4.1 will decompose the user's question into a small sequence of proposed graph operations. This is then passed to the orchestrator to execute\n", + "
\n", + "\n", + " Considering the user query and the plan, the Orchestrator (o4-mini) makes a series of initial tool calls to retrieve information from the knowledge graph\n", + "
\n", + "\n", + " The responses to the tool calls are fed back to the Orchestrator which can then decide to either make more queries to the graph or answer the user's question\n", + "
\n", + "\n", + " The planner outlines the concrete subtasks the downstream agentic blocks should execute. The tasks are phrased in an action-orientated sense such as \"1. Extract information on R&D activities of Company IJK between 2018–2020.\" These planners are typically preferred when the goal is mostly deterministic and the primary risk is skipping or duplicating work.\n", + "
\n", + "\n", + " Example tasks where this approach is useful:\n", + "
\n", + "\n", + " The plan is framed as a set of hypotheses the system can confirm, reject, or refine in response to the user's question. Each step represents a testable claim, optionally paired with suggested actions. This approach excels in open-ended research tasks where new information can significantly reshape the solution space.\n", + "
\n", + "\n", + " Example tasks where this approach is useful:\n", + "
\n", + "\n",
+ " If the provided predicate is not found in the PREDICATE_DEFINITIONS
dictionary, this step uses GPT-4.1-nano to coerce it into a valid predicate\n",
+ "
\n", + " Performs fuzzy matching to identify the corresponding entity nodes within the networkx graph\n", + "
\n", + "\n", + " Retrieves both inbound and outbound edges associated with the identified entity nodes\n", + "
\n", + "\n", + " Structures the collected information into a well-formatted response that is easy for the orchestrator to consume\n", + "
\n", + "\n", + " The traditional baseline for retrieval evaluation: a curated set of query → gold answer pairs,\n", + " vetted by domain experts. \n", + " Metrics such as precision@k or recall@k are computed by matching retrieved passages\n", + " against these gold spans.\n", + "
\n", + "\n",
+ " Pros: Highest reliability, clear pass/fail thresholds, excellent for regression testing
\n",
+ " Cons: Expensive to create, slow to update, narrow coverage (quickly becomes stale\n",
+ " when the knowledge base evolves)\n",
+ "
\n", + " Use an LLM to generate reference answers or judgments, enabling rapid, low-cost expansion\n", + " of the evaluation set. Three common pathways:\n", + "
\n", + "\n",
+ " Pros: Fast, infinitely scalable, easier to keep pace with a dynamic application specification
\n",
+ " Cons: Judgement quality is typically of lower quality than expert human-annotated solutions\n",
+ "
\n", + " Collect ratings directly from end-users or domain reviewers (thumbs-up/down, five-star scores, pairwise\n", + " comparisons). Can be in-the-loop (model trains continuously on live feedback) or\n", + " offline (periodic eval rounds).\n", + "
\n", + "\n",
+ " Pros: Captures real-world utility, surfaces edge-cases synthetic tests miss
\n",
+ " Cons: Noisy and subjective; requires thoughtful aggregation (e.g., ELO\n",
+ " scoring), risk of user biases becoming incorporated in the model\n",
+ "
\n", + " Appendix section A.1. \"Storing and Retrieving High-Volume Graph Data\"\n", + "
\n", + "\n", + " Manage scalability through thoughtful schema design, sharding, and partitioning. Clearly define entities, relationships, and ensure schema flexibility for future evolution. Use high-cardinality fields like timestamps for efficient data partitioning.\n", + "
\n", + "\n", + " Appendix section A.1.2. \"Temporal Validity & Versioning\"\n", + "
\n", + "\n", + " Include temporal markers (valid_from, valid_to) for each statement. Maintain historical records non-destructively by marking outdated facts as inactive and indexing temporal fields for efficient queries.\n", + "
\n", + "\n", + " Appendix section A.1.3. \"Indexing & Semantic Search\"\n", + "
\n", + "\n", + " Utilize B-tree indexes for efficient temporal querying. Leverage PostgreSQL’s pgvector extension for semantic search with approximate nearest-neighbor algorithms like ivfflat, ivfpq, and hnsw to optimize query speed and memory usage.\n", + "
\n", + "\n", + " Appendix section A.2. \"Managing and Pruning Datasets\"\n", + "
\n", + "\n", + " Establish TTL and archival policies for data retention based on source reliability and relevance. Implement automated archival tasks and intelligent pruning with relevance scoring to optimize graph size.\n", + "
\n", + "\n", + " Appendix section A.3. \"Implementing Concurrency in the Ingestion Pipeline\"\n", + "
\n", + "\n", + " Implement batch processing with separate, scalable pipeline stages for chunking, extraction, invalidation, and entity resolution. Optimize throughput and parallelism to manage ingestion bottlenecks.\n", + "
\n", + "\n", + " Appendix section A.4. \"Minimizing Token Cost\"\n", + "
\n", + "\n", + " Use caching strategies to avoid redundant API calls. Adopt service tiers like OpenAI's flex option to reduce costs and replace expensive model queries with efficient embedding and nearest-neighbor search.\n", + "
\n", + "\n", + " Appendix section A.5. \"Scaling and Productionizing our Retrieval Agent\"\n", + "
\n", + "\n", + " Use a controller and traversal workers architecture to handle multi-hop queries. Implement parallel subgraph extraction, dynamic traversal with chained reasoning, caching, and autoscaling for high performance.\n", + "
\n", + "\n", + " Appendix section A.6. \"Safeguards\"\n", + "
\n", + "\n", + " Deploy multi-layered output verification, structured logging, and monitoring to ensure data integrity and operational reliability. Track critical metrics and perform regular audits.\n", + "
\n", + "\n", + " Appendix section A.7. \"Prompt Optimization\"\n", + "
\n", + "\n", + " Optimize LLM interactions with personas, few-shot prompts, chain-of-thought methods, dynamic context management, and automated A/B testing of prompt variations for continuous performance improvement.\n", + "
\n", + "