Skip to content

Latest commit

 

History

History
283 lines (195 loc) · 14.5 KB

File metadata and controls

283 lines (195 loc) · 14.5 KB

Malloy Project Overview

Malloy is a language for describing data relationships and transformations. It's both a semantic modeling language and a query language that uses existing SQL engines (BigQuery, Snowflake, PostgreSQL, MySQL, Trino, Presto, DuckDB) to execute queries. The project includes a VS Code extension for building Malloy data models and creating visualizations.

This is a monorepo managed with npm workspaces and Lerna, containing multiple interconnected packages that form the complete Malloy ecosystem.

Architecture Overview

Query Execution Pipeline

Malloy Source → Parser → AST → IR Generation → SQL Compilation → Database → Results → Renderer

Two-Phase Compilation Architecture

The Malloy compiler is split into two distinct parts:

  1. Translator (packages/malloy/src/lang/) - See packages/malloy/CONTEXT.md

    • Uses ANTLR-generated parser to create parse tree
    • Generates Abstract Syntax Tree (AST) from parse tree
    • Transforms AST into Intermediate Representation (IR)
    • IR is a serializable data format that fully describes the semantic model
  2. Compiler (packages/malloy/src/model/) - See packages/malloy/CONTEXT.md

    • Takes IR and translates it to SQL queries
    • Produces SQL + metadata needed to feed query results back into Malloy or render them with Malloy semantics.

Language Structures

Sources and Queries

At its simplest, a source is anything you can hand to a SQL database and get a schema back - either a table name or a SELECT statement. The initial "fields" of a source are the columns in that schema.

However, Malloy lets you extend sources by adding other types of fields:

  • Joins: Model the graph structure of data as a property of the source (not the query, unlike SQL)
  • Dimensions: Treated like columns, but are expressions referencing other columns or dimensions
  • Measures: Aggregate expressions like sum(x + y) computed from a set of rows
  • Calculations: Like measures, but implemented with window functions

Symmetric Aggregates

Malloy uses symmetric aggregates to handle joined data correctly. Aggregation paths like line_items.amount.sum() specify which grain to aggregate at. This lets you query normalized (joined) data as if it were denormalized and get correct results - Malloy avoids double-counting automatically.

Nested Data Access

The Malloy language uses dotted path notation to access nested data. Nested data might be actually be part of a row through a record data type (or array, or array of records), or it might be in a separate table where the nesting is hidden by "normalizing" the nested portion of the data which is then joined onto the current table. Unlike SQL, the access path to nested data is identical no matter which way the nesting is stored in the database.

Annotations and Tags

Objects in Malloy (sources, queries, joins, measures, dimensions, group_by, aggregate, etc.) can have metadata attached via annotations.

Annotation syntax:

  • # marks the beginning of an annotation
  • An annotation continues to end-of-line
  • Annotations apply to objects declared below them
  • In block declarations, block-level annotations apply to all items, and each item can have its own
  • ## marks model-level annotations that apply to the entire model

Annotations are just text - the design intentionally leaves room for multiple DSLs. Each application extracts its annotations via pattern matching and defines its own syntax. For details on the Malloy Tag Language used for parsing annotations, see packages/malloy-tag/CONTEXT.md.

Data Model and Type System

Malloy Data Types

In the Malloy language, the data types are: string, boolean, number, timestamp, timestamptz, date, json, "sql native", array and record.

Malloy reads the schema of any table referenced and creates a StructDef with the fields[] array filled out with the Malloy type for each column in the database mapped to a Malloy type. Types not supported by Malloy will be "sql native" which allows limited operation in the Malloy language.

Type System Hierarchy

The type system distinguishes between:

  • BasicAtomicType — Simple types whose TypeDef is fully described by just the type name: string | number | boolean | date | timestamp | timestamptz | json | sql native | error. The corresponding guard function is isBasicAtomicType().
  • AtomicTypeDef — Union of BasicAtomicTypeDef | BasicArrayTypeDef | RecordTypeDef | RepeatedRecordTypeDef. This is the general type for any atomic value including compound types.
  • Expression-only types — Types like null, error, duration, filter expression that arise during expression evaluation but never appear as column types in a table schema.

In the Malloy language, compound types can be written using the syntax type[] for arrays, {name :: type, ...} for records, and these nest arbitrarily: {x :: number, y :: string[]}[].

Field Types

  • Atomic Field: Can be stored in a single database column (includes arrays and records)
  • Basic Field: Atomic field with a single value (string, number, etc.)
  • Compound Field: Records and arrays and arrays of records.
  • Joined Field: References another SQL query or joined table (not in current table)

Structure Definitions

  • StructDef: Any namespace-containing object (records, arrays, table schemas, query schemas)
  • SourceDef: A StructDef that can be used as query input (tables, queries, but not plain records/arrays)
  • FieldSpace: Used by translator to construct and comprehend StructDefs

Special Handling

  • Arrays treated as records with one entry named "value" or "each" (SQL heritage)
  • Nested queries produce arrays of records, accessed via un-nested joins
  • Historical note: nested queries are called "turtles" in source code, that was once their user facing name.

Malloy Query Structure

A Malloy query consists of two main components:

  • Source: A SourceDef, has a schema defined by a field list
  • Pipeline: Array of query operations (similar to SELECT statements with grouping/filtering)
  • Query execution flows through the pipeline: source → first operation → second operation → etc.

Since query output is table-shaped, a query can also be source. This is how pipelining works: take a source, transform it with a query operation, use that output as input to the next operation.

Multi-Database Support

The system uses a Dialect pattern where each database adapter implements database-specific SQL generation while sharing the same semantic model. Database connections are abstracted through a common Connection interface.

The actual SQL writing portion of a Dialect is implemented in packages/malloy/dialect

Database Adapters

Each database has its own package with connection handling and dialect-specific optimizations:

  • malloy-db-bigquery/ - Google BigQuery adapter
  • malloy-db-duckdb/ - DuckDB adapter (includes WASM support)
  • malloy-db-postgres/ - PostgreSQL adapter
  • malloy-db-mysql/ - MySQL adapter
  • malloy-db-snowflake/ - Snowflake adapter
  • malloy-db-trino/ - Trino/Presto adapter
  • malloy-db-publisher/ - Publishing/caching layer

Supporting Libraries

  • malloy-interfaces/ - TypeScript interfaces and Thrift-generated types
  • malloy-render/ - Data visualization and rendering (see packages/malloy-render/CONTEXT.md)
  • malloy-syntax-highlight/ - Language syntax highlighting
  • malloy-filter/ - Query filtering utilities
  • malloy-tag/ - Tagged template literal support (see packages/malloy-tag/CONTEXT.md)
  • malloy-query-builder/ - Programmatic query building
  • malloy-malloy-sql/ - SQL integration utilities

Package Dependencies

The packages form a dependency graph where:

  • malloy-interfaces is the foundation (no dependencies)
  • malloy depends on interfaces, filter, and tag packages
  • Database adapters depend on the core malloy package
  • malloy-render depends on core malloy packages for data processing
  • The root manages all packages through npm workspaces

When making changes, build order matters: interfaces → core → database adapters → render components.

Common Development Commands

Setup and Building

npm install                    # Install dependencies for all packages
npm run dev                    # Fast build: codegen + tsc (for iterating)
npm run build                  # Full build: codegen + tsc + flow types + render
npm run clean                  # Clean build artifacts from all packages
npm run watch                  # Watch for TypeScript changes across the repo

NOTE FOR TOOLS RUNNING BUILD: build output is long, save it to a file:

npm run build > /tmp/build.log 2>&1 && echo Build OK || (tail -50 /tmp/build.log; exit 1)

Dev vs Build

  • npm run dev — Runs codegen (ANTLR, peggy) then tsc --build for each package. This is the fast command you run repeatedly while debugging. It skips the vite render build since tests don't need it.
  • npm run build — Everything in dev, plus the vite render bundle. Run this when you need fully built packages (e.g. for npm link).

When to rebuild

If you're editing code and running tests in the same package, you don't need to rebuild — just run npx jest directly on the test file. Changes to .ts files are picked up by ts-jest.

If you make changes in a different package than the test (or you're running tests from test/ and change any package), run npm run dev at the repo root first. It's fast — codegen is content-hash cached and tsc is incremental.

Codegen and femto-build

Some packages have codegen steps that generate source files from grammars or configs:

  • packages/malloy — ANTLR4 parser from .g4 grammar files
  • packages/malloy-filter — Peggy parsers from .peggy grammar files
  • packages/malloy-malloy-sql — Peggy parsers from .pegjs grammar files
  • packages/malloy-render — Vite bundle from TypeScript/Solid sources

These use scripts/femto-build.js, a tiny content-hash-based build caching tool. Each package with codegen has a femto-config.motly with named targets specifying input globs and commands. femto-build hashes the inputs and skips the commands if nothing changed. Targets can depend on other targets via deps. This survives git operations (unlike Make's timestamp-based approach).

To add codegen to a new package: create a femto-config.motly in the package directory:

targetName: {
  inputs = ["src/grammar/*.g4"]
  commands = ["mkdir -p out", "tool -o out src/grammar/File.g4"]
}

dependent-target: {
  deps = [targetName]
  inputs = ["src/other/*.g4"]
  commands = ["tool -o out src/other/File.g4"]
}

Then add to package.json: "codegen": "node ../../scripts/femto-build.js targetName"

Testing

IMPORTANT: Malloy has a large test suite which cannot run on a development machine. A CI run is needed to fully verify a change.

NOTES ON TOOLS RUNNING TESTS:

DO NOT RUN npm run test without restrictions - it requires active database connections for every database and will take a very long time and won't ever succeed.

NEVER run npm run test -- filename - this will take a very long time and won't ever succeed.

Running Individual Tests

The typical path when working on a fix is to run just the one test file containing the test, and a test pattern to identify the test. For example, to run the translator's source test:

npx jest packages/malloy/src/lang/test/source.spec.ts -t "TEST NAME PATTERN"

Database-Specific Tests

Some tests loop over all testable databases (for example, all tests in test/src/databases/all). For these it is important to restrict the databases under test to one available. Most developers use duckdb:

MALLOY_DATABASE=duckdb npx jest test/src/database/all/TEST.spec.ts -t "TEST NAME PATTERN"

Comprehensive Local Testing

The most comprehensive test you might run as a developer before letting CI build your code:

npm run test-duckdb  # Runs all tests, but only checks the duckdb dialect

Every developer will be able to run this and is a good sanity check.

Other Test Commands

npm run test-publisher        # Test with publisher database (all tests, publisher dialect only)
npm run ci-core              # CI: Core tests (malloy-core, malloy-render)
npm run ci-duckdb            # CI: DuckDB-specific tests
npm run ci-bigquery          # CI: BigQuery-specific tests
npm run ci-postgres          # CI: PostgreSQL-specific tests

For more details on test organization and infrastructure, see test/CONTEXT.md.

Code Quality

npm run lint                  # Run ESLint on all packages
npm run lint-fix              # Fix ESLint issues automatically

VS Code Integration

The VS Code extension is in a separate repository (malloy-vscode-extension), but this repo contains the language server and core functionality it depends on.

Copyright

For new files, this is the current correct copyright text (here in C/Java/Javascript style):

/*
 * Copyright Contributors to the Malloy project
 * SPDX-License-Identifier: MIT
 */

Commit and PR Guidelines

Do not include AI attribution (e.g., "Generated with Claude Code", "Co-Authored-By: Claude") in commits or pull requests.

Subsystem Context

For deeper context on specific subsystems, see:

Maintaining the CONTEXT Tree

This repository uses the CONTEXT.md convention for LLM-friendly documentation.

The idea is that for any file of interest, an LLM can walk up the directory tree reading CONTEXT.md files to gather layered context - from specific to general - without loading all context files at once.

Verification command: "Read the CONTEXT tree and verify it is up to date"