This document provides a deeper dive into the technical implementation, Looker-specific SQL considerations, and design choices for the langchain-looker-agent.
-
LookerSQLDatabaseClass (src/langchain_looker_agent/agent.py):- Purpose: Acts as a Pythonic wrapper around the Looker Open SQL Interface (OSQI) using its Avatica-based JDBC driver. It mimics the interface of
langchain_community.utilities.SQLDatabaseto be compatible with LangChain's SQL agent tooling. - Connectivity (
_connect):- Uses
jaydebeapiandJPype1to establish a JDBC connection. - Requires the Looker Avatica JDBC driver JAR (e.g.,
avatica-<version>-looker.jar). - Driver Class:
org.apache.calcite.avatica.remote.looker.LookerDriver. - JDBC URL Format:
jdbc:looker:url=https://<your_looker_instance_url>. - Authentication: Uses Looker API3 Client ID (as JDBC
user) and Client Secret (as JDBCpassword).
- Uses
- Dialect (
dialectproperty): Returns"calcite"to inform the LLM, as Looker's OSQI uses a Calcite SQL parser. - Metadata Retrieval:
- Relies on the standard
java.sql.DatabaseMetaDatainterface, accessed viaconnection.jconn.getMetaData(). get_usable_table_names(): CallsDatabaseMetaData.getTables()using thelookml_model_name(provided at initialization) as theschemaPatternto list available Looker Explores (which are treated as queryable "tables").get_table_info():- For each Explore, calls
DatabaseMetaData.getColumns()(again, usinglookml_model_nameasschemaPatternand Explore name astableNamePattern) to fetch field details. - Extracts standard JDBC column info (
COLUMN_NAME,TYPE_NAME). - Crucially, it also extracts Looker-specific metadata columns returned by this driver, such as
HIDDEN,FIELD_LABEL,FIELD_ALIAS,FIELD_DESCRIPTION, andFIELD_CATEGORY. - Filters out fields where
HIDDENis true. - Formats the schema as a
CREATE TABLEstring using Looker's required backtick notation:CREATE TABLE `model_name`.`explore_name` ( ... ). - Enriches column definitions with Looker metadata as SQL comments:
`view_name.field_name` VARCHAR -- label: 'User-Friendly Label'; category: DIMENSION; description: '...'. - Optionally fetches sample rows by executing a
SELECTquery for the first few visible columns with aLIMITclause (avoids problematicSELECT *).
- For each Explore, calls
- Relies on the standard
- Query Execution (
runand_run_query_internal):- Takes a SQL command string.
- Pre-processing: Automatically strips trailing semicolons (
;) and common markdown code fences (sql ...) from the input command before execution, as the Looker JDBC driver expects single statements without these. - Executes the cleaned SQL using a
jaydebeapicursor. - Formats results (column names and rows) into a string for the LLM.
- Purpose: Acts as a Pythonic wrapper around the Looker Open SQL Interface (OSQI) using its Avatica-based JDBC driver. It mimics the interface of
-
LookerSQLToolkit(BaseToolkit):- Takes an instance of
LookerSQLDatabase. - Uses
Tool.from_functionto create LangChainToolobjects for:sql_db_list_tables(wrapsdb.get_usable_table_names, joins list into a string).sql_db_schema(wrapsdb.get_table_infovia a helper_get_table_info_wrapperto handle string parsing of table names).sql_db_query(wrapsdb.run).
- Tool descriptions are carefully crafted to guide the LLM on input format and Looker specifics (e.g., Explores, backticks, no semicolons).
- Takes an instance of
-
create_looker_sql_agent(llm, toolkit, ...):- A factory function to create a LangChain ReAct agent.
- Prompt Engineering: This is critical. It combines:
LOOKER_SQL_SYSTEM_INSTRUCTIONS_TEMPLATE: Detailed instructions to the LLM about the Calcite SQL dialect, Looker's data structure (Model as schema, Explore as table,view.fieldas column), mandatory backtick syntax for all identifiers, use ofAGGREGATE()for LookML measures, restrictions (noJOINs, subqueries, window functions, DML), and the "no semicolon" rule. It also guides the LLM on how to use the schema information provided by thesql_db_schematool to construct valid queries.REACT_CORE_PROMPT_STRUCTURE: A standard template for the ReAct agent's operational loop (Thought, Action, Action Input, Observation).
- Uses
langchain.agents.create_react_agentwith the combined prompt and the tools fromLookerSQLToolkit. - Returns an
AgentExecutor, which can be configured with memory (e.g.,ConversationBufferMemory).
The LangChain agent (and the LLM it uses) must generate SQL that adheres to the following specifics of Looker's OSQI (Avatica/Calcite):
- Query Type: Only
SELECTstatements are supported. - Identifiers (Backticks
``):- LookML Model Name (Schema):
`your_model_name` - LookML Explore Name (Table):
`your_explore_name` - LookML Field Name (Column):
`view_name.field_name` - All these identifiers MUST be enclosed in backticks in SQL queries.
- FROM Clause:
FROM \model_name`.`explore_name``
- LookML Model Name (Schema):
- LookML Measures:
- Must be queried using the
AGGREGATE(\view_name.measure_name`)` function. - The
view_namewithinAGGREGATE()must be the view where the measure is defined. - Measures (fields wrapped in
AGGREGATE()) cannot be used in aGROUP BYclause. Only dimensions can.
- Must be queried using the
- Standard SQL Aggregates: Functions like
COUNT(*),SUM(\dimension_name`),AVG(`dimension_name`)can be used on dimension fields. Do not wrap dimensions inAGGREGATE()`. - Joins: No explicit
JOINoperators (e.g.,LEFT JOIN,INNER JOIN). Joins between views are pre-defined within the LookML Explores. Query fields from all joined views as if they are part of one large table for the given Explore. - Unsupported SQL Features:
- Subqueries (nested
SELECTstatements). - SQL window functions (e.g.,
ROW_NUMBER() OVER (...),RANK(),LAG()). - DML (
INSERT,UPDATE,DELETE) and DDL (CREATE TABLE, etc.).
- Subqueries (nested
- Semicolons (
;): SQL statements sent programmatically via JDBC should NOT end with a semicolon. TheLookerSQLDatabaseclass attempts to strip these. - Filters (
always_filter,conditionally_filter, Filter-Only Fields): The Looker documentation details how Explores with these LookML parameters require correspondingWHERE/HAVINGclauses or special JSON syntax for filter-only fields. While this agent prototype doesn't explicitly build logic to automatically satisfy these, the LLM might be able to construct valid queries if given enough context or if it learns from errors. Queries failing due to unsatisfied mandatory filters are a known limitation if not handled by LLM prompting or by querying Explores without such strict requirements.
A key design decision for this prototype was to implement a custom LookerSQLDatabase class that mimics the interface of LangChain's SQLDatabase utility, rather than attempting to create a new SQLAlchemy dialect for Looker's Open SQL Interface. The primary reasons for this approach are:
- Nature of Looker's SQL Interface: It's an abstraction layer accessed via a specific JDBC driver, not a standard relational database directly supported by Python DB-API v2.0 drivers that SQLAlchemy typically uses. We use
JayDeBeApito bridge Python to Java's JDBC. - Complexity of Full SQLAlchemy Dialect: Creating a SQLAlchemy dialect is a significant undertaking, requiring deep knowledge of SQLAlchemy internals and complex mapping for SQL compilation, type handling, and introspection. This complexity is largely unnecessary for an agent primarily executing LLM-generated SQL strings.
- LangChain's
SQLDatabaseUtility: This provides a simpler, targeted interface (methods for dialect, listing tables, getting schema, running queries) sufficient for LangChain SQL agents. Our custom class implements this interface. - Focus and Reusability: This approach allows us to leverage existing LangChain agent infrastructure (
create_react_agent,AgentExecutor) and tools (by wrapping our methods withTool.from_function) with minimal friction. - Minimal Dependencies: Avoids making SQLAlchemy a hard dependency for this specific Looker integration.