Malloy is a language for describing data relationships and transformations. It's both a semantic modeling language and a query language that uses existing SQL engines (BigQuery, Snowflake, PostgreSQL, MySQL, Trino, Presto, DuckDB) to execute queries. The project includes a VS Code extension for building Malloy data models and creating visualizations.
This is a monorepo managed with npm workspaces and Lerna, containing multiple interconnected packages that form the complete Malloy ecosystem.
Malloy Source → Parser → AST → IR Generation → SQL Compilation → Database → Results → Renderer
The Malloy compiler is split into two distinct parts:
-
Translator (
packages/malloy/src/lang/) - See packages/malloy/CONTEXT.md- Uses ANTLR-generated parser to create parse tree
- Generates Abstract Syntax Tree (AST) from parse tree
- Transforms AST into Intermediate Representation (IR)
- IR is a serializable data format that fully describes the semantic model
-
Compiler (
packages/malloy/src/model/) - See packages/malloy/CONTEXT.md- Takes IR and translates it to SQL queries
- Produces SQL + metadata needed to feed query results back into Malloy or render them with Malloy semantics.
At its simplest, a source is anything you can hand to a SQL database and get a schema back - either a table name or a SELECT statement. The initial "fields" of a source are the columns in that schema.
However, Malloy lets you extend sources by adding other types of fields:
- Joins: Model the graph structure of data as a property of the source (not the query, unlike SQL)
- Dimensions: Treated like columns, but are expressions referencing other columns or dimensions
- Measures: Aggregate expressions like
sum(x + y)computed from a set of rows - Calculations: Like measures, but implemented with window functions
Malloy uses symmetric aggregates to handle joined data correctly. Aggregation paths like line_items.amount.sum() specify which grain to aggregate at. This lets you query normalized (joined) data as if it were denormalized and get correct results - Malloy avoids double-counting automatically.
The Malloy language uses dotted path notation to access nested data. Nested data might be actually be part of a row through a record data type (or array, or array of records), or it might be in a separate table where the nesting is hidden by "normalizing" the nested portion of the data which is then joined onto the current table. Unlike SQL, the access path to nested data is identical no matter which way the nesting is stored in the database.
Objects in Malloy (sources, queries, joins, measures, dimensions, group_by, aggregate, etc.) can have metadata attached via annotations.
Annotation syntax:
#marks the beginning of an annotation- An annotation continues to end-of-line
- Annotations apply to objects declared below them
- In block declarations, block-level annotations apply to all items, and each item can have its own
##marks model-level annotations that apply to the entire model
Annotations are just text - the design intentionally leaves room for multiple DSLs. Each application extracts its annotations via pattern matching and defines its own syntax. For details on the Malloy Tag Language used for parsing annotations, see packages/malloy-tag/CONTEXT.md.
In the Malloy language, the data types are: string, boolean, number, timestamp, timestamptz, date, json, "sql native", array and record.
Malloy reads the schema of any table referenced and creates a StructDef with the fields[] array filled out with the Malloy type for each column in the database mapped to a Malloy type. Types not supported by Malloy will be "sql native" which allows limited operation in the Malloy language.
The type system distinguishes between:
BasicAtomicType— Simple types whose TypeDef is fully described by just the type name:string | number | boolean | date | timestamp | timestamptz | json | sql native | error. The corresponding guard function isisBasicAtomicType().AtomicTypeDef— Union ofBasicAtomicTypeDef | BasicArrayTypeDef | RecordTypeDef | RepeatedRecordTypeDef. This is the general type for any atomic value including compound types.- Expression-only types — Types like
null,error,duration,filter expressionthat arise during expression evaluation but never appear as column types in a table schema.
In the Malloy language, compound types can be written using the syntax type[] for arrays, {name :: type, ...} for records, and these nest arbitrarily: {x :: number, y :: string[]}[].
- Atomic Field: Can be stored in a single database column (includes arrays and records)
- Basic Field: Atomic field with a single value (string, number, etc.)
- Compound Field: Records and arrays and arrays of records.
- Joined Field: References another SQL query or joined table (not in current table)
- StructDef: Any namespace-containing object (records, arrays, table schemas, query schemas)
- SourceDef: A StructDef that can be used as query input (tables, queries, but not plain records/arrays)
- FieldSpace: Used by translator to construct and comprehend StructDefs
- Arrays treated as records with one entry named "value" or "each" (SQL heritage)
- Nested queries produce arrays of records, accessed via un-nested joins
- Historical note: nested queries are called "turtles" in source code, that was once their user facing name.
A Malloy query consists of two main components:
- Source: A SourceDef, has a schema defined by a field list
- Pipeline: Array of query operations (similar to SELECT statements with grouping/filtering)
- Query execution flows through the pipeline: source → first operation → second operation → etc.
Since query output is table-shaped, a query can also be source. This is how pipelining works: take a source, transform it with a query operation, use that output as input to the next operation.
The system uses a Dialect pattern where each database adapter implements database-specific SQL generation while sharing the same semantic model. Database connections are abstracted through a common Connection interface.
The actual SQL writing portion of a Dialect is implemented in packages/malloy/dialect
Each database has its own package with connection handling and dialect-specific optimizations:
malloy-db-bigquery/- Google BigQuery adaptermalloy-db-duckdb/- DuckDB adapter (includes WASM support)malloy-db-postgres/- PostgreSQL adaptermalloy-db-mysql/- MySQL adaptermalloy-db-snowflake/- Snowflake adaptermalloy-db-trino/- Trino/Presto adaptermalloy-db-publisher/- Publishing/caching layer
malloy-interfaces/- TypeScript interfaces and Thrift-generated typesmalloy-render/- Data visualization and rendering (see packages/malloy-render/CONTEXT.md)malloy-syntax-highlight/- Language syntax highlightingmalloy-filter/- Query filtering utilitiesmalloy-tag/- Tagged template literal support (see packages/malloy-tag/CONTEXT.md)malloy-query-builder/- Programmatic query buildingmalloy-malloy-sql/- SQL integration utilities
The packages form a dependency graph where:
malloy-interfacesis the foundation (no dependencies)malloydepends on interfaces, filter, and tag packages- Database adapters depend on the core
malloypackage malloy-renderdepends on core malloy packages for data processing- The root manages all packages through npm workspaces
When making changes, build order matters: interfaces → core → database adapters → render components.
npm install # Install dependencies for all packages
npm run dev # Fast build: codegen + tsc (for iterating)
npm run build # Full build: codegen + tsc + flow types + render
npm run clean # Clean build artifacts from all packages
npm run watch # Watch for TypeScript changes across the repoNOTE FOR TOOLS RUNNING BUILD: build output is long, save it to a file:
npm run build > /tmp/build.log 2>&1 && echo Build OK || (tail -50 /tmp/build.log; exit 1)
npm run dev— Runs codegen (ANTLR, peggy) thentsc --buildfor each package. This is the fast command you run repeatedly while debugging. It skips the vite render build since tests don't need it.npm run build— Everything indev, plus the vite render bundle. Run this when you need fully built packages (e.g. fornpm link).
If you're editing code and running tests in the same package, you don't need to rebuild — just run npx jest directly on the test file. Changes to .ts files are picked up by ts-jest.
If you make changes in a different package than the test (or you're running tests from test/ and change any package), run npm run dev at the repo root first. It's fast — codegen is content-hash cached and tsc is incremental.
Some packages have codegen steps that generate source files from grammars or configs:
packages/malloy— ANTLR4 parser from.g4grammar filespackages/malloy-filter— Peggy parsers from.peggygrammar filespackages/malloy-malloy-sql— Peggy parsers from.pegjsgrammar filespackages/malloy-render— Vite bundle from TypeScript/Solid sources
These use scripts/femto-build.js, a tiny content-hash-based build caching tool. Each package with codegen has a femto-config.motly with named targets specifying input globs and commands. femto-build hashes the inputs and skips the commands if nothing changed. Targets can depend on other targets via deps. This survives git operations (unlike Make's timestamp-based approach).
To add codegen to a new package: create a femto-config.motly in the package directory:
targetName: {
inputs = ["src/grammar/*.g4"]
commands = ["mkdir -p out", "tool -o out src/grammar/File.g4"]
}
dependent-target: {
deps = [targetName]
inputs = ["src/other/*.g4"]
commands = ["tool -o out src/other/File.g4"]
}
Then add to package.json: "codegen": "node ../../scripts/femto-build.js targetName"
IMPORTANT: Malloy has a large test suite which cannot run on a development machine. A CI run is needed to fully verify a change.
NOTES ON TOOLS RUNNING TESTS:
DO NOT RUN npm run test without restrictions - it requires active database connections for every database and will take a very long time and won't ever succeed.
NEVER run npm run test -- filename - this will take a very long time and won't ever succeed.
The typical path when working on a fix is to run just the one test file containing the test, and a test pattern to identify the test. For example, to run the translator's source test:
npx jest packages/malloy/src/lang/test/source.spec.ts -t "TEST NAME PATTERN"Some tests loop over all testable databases (for example, all tests in test/src/databases/all). For these it is important to restrict the databases under test to one available. Most developers use duckdb:
MALLOY_DATABASE=duckdb npx jest test/src/database/all/TEST.spec.ts -t "TEST NAME PATTERN"The most comprehensive test you might run as a developer before letting CI build your code:
npm run test-duckdb # Runs all tests, but only checks the duckdb dialectEvery developer will be able to run this and is a good sanity check.
npm run test-publisher # Test with publisher database (all tests, publisher dialect only)
npm run ci-core # CI: Core tests (malloy-core, malloy-render)
npm run ci-duckdb # CI: DuckDB-specific tests
npm run ci-bigquery # CI: BigQuery-specific tests
npm run ci-postgres # CI: PostgreSQL-specific testsFor more details on test organization and infrastructure, see test/CONTEXT.md.
npm run lint # Run ESLint on all packages
npm run lint-fix # Fix ESLint issues automaticallyThe VS Code extension is in a separate repository (malloy-vscode-extension), but this repo contains the language server and core functionality it depends on.
For new files, this is the current correct copyright text (here in C/Java/Javascript style):
/*
* Copyright Contributors to the Malloy project
* SPDX-License-Identifier: MIT
*/
Do not include AI attribution (e.g., "Generated with Claude Code", "Co-Authored-By: Claude") in commits or pull requests.
For deeper context on specific subsystems, see:
- packages/malloy/CONTEXT.md - Core language package (translator and compiler)
- packages/malloy/src/api/CONTEXT.md - API layers (Foundation, Stateless, Sessioned, Async)
- packages/malloy/src/connection/CONTEXT.md - Connection registry, config format, backend properties
- packages/malloy-tag/CONTEXT.md - Tag language for annotation parsing
- packages/malloy-render/CONTEXT.md - Data visualization and rendering
- test/CONTEXT.md - Test organization and infrastructure
This repository uses the CONTEXT.md convention for LLM-friendly documentation.
The idea is that for any file of interest, an LLM can walk up the directory tree reading CONTEXT.md files to gather layered context - from specific to general - without loading all context files at once.
Verification command: "Read the CONTEXT tree and verify it is up to date"