Skip to content

Commit 71c92d8

Browse files
authored
Create pydough from_string API (#383)
The from_string API parses a PyDough source code string and transforms it into a PyDough collection. You can then perform operations like explain(), to_sql(), or to_df() on the result.
1 parent 0b84cbf commit 71c92d8

File tree

5 files changed

+419
-4
lines changed

5 files changed

+419
-4
lines changed

documentation/usage.md

Lines changed: 186 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,13 @@ This document describes how to set up & interact with PyDough. For instructions
1313
- [Evaluation APIs](#evaluation-apis)
1414
* [`pydough.to_sql`](#pydoughto_sql)
1515
* [`pydough.to_df`](#pydoughto_df)
16+
- [Transformation APIs](#transformation-apis)
17+
* [`pydough.from_string`](#pydoughfrom_string)
1618
- [Exploration APIs](#exploration-apis)
1719
* [`pydough.explain_structure`](#pydoughexplain_structure)
1820
* [`pydough.explain`](#pydoughexplain)
1921
* [`pydough.explain_term`](#pydoughexplain_term)
20-
- [Logging] (#logging)
22+
- [Logging](#logging)
2123

2224
<!-- TOC end -->
2325

@@ -476,6 +478,187 @@ pydough.to_df(result, columns={"name": "name", "n_custs": "n"})
476478

477479
See the [demo notebooks](../demos/notebooks/1_introduction.ipynb) for more instances of how to use the `to_df` API.
478480

481+
<!-- TOC --><a name="evaluation-apis"></a>
482+
## Transformation APIs
483+
484+
This sections describes various APIs you can use to transform PyDough source code into a result that can be used as input for other evaluation or exploration APIs.
485+
486+
<!-- TOC --><a name="pydoughfrom_string"></a>
487+
### `pydough.from_string`
488+
489+
The `from_string` API parses a PyDough source code string and transforms it into a PyDough collection. You can then perform operations like `explain()`, `to_sql()`, or `to_df()` on the result.
490+
491+
#### Syntax
492+
```python
493+
def from_string(
494+
source: str,
495+
answer_variable: str | None = None,
496+
metadata: GraphMetadata | None = None,
497+
environment: dict[str, Any] | None = None,
498+
) -> UnqualifiedNode:
499+
```
500+
501+
The first argument `source` is the source code string. It can be a single pydough command or a multi-line pydough code with intermediate results stored in variables. It can optionally take in the following keyword arguments:
502+
503+
- `answer_variable`: The name of the variable that stores the final result of the PyDough code. If not provided, the API expects the final result to be in a variable named `result`. The API returns a PyDough collection holding this value. It is assumed that the PyDough code string includes a variable definition where the name of the variable is the same as `answer_variable` and the value is valid PyDough code; if not it raises an exception.
504+
- `metadata`: The PyDough knowledge graph to use for the transformation. If omitted, `pydough.active_session.metadata` is used.
505+
- `environment`: A dictionary representing additional environment context. This serves as the local namespace where the PyDough code will be executed.
506+
507+
Below are examples of using `pydough.from_string`, and examples of the SQL that could be potentially generated from calling `pydough.to_sql` on the output. All these examples use the TPC-H dataset that can be downloaded [here](https://github.com/lovasoa/TPCH-sqlite/releases) with the [graph used in the demos directory](../demos/metadata/tpch_demo_graph.json).
508+
509+
This first example is of Python code using `pydough.from_string` to generate SQL to get the count of customers in the market segment `"AUTOMOBILE"`. The result will be returned in a variable named `pydough_query` instead of the default `result`, and the market segment `"AUTOMOBILE"` is passed in an environment variable `SEG`.:
510+
```py
511+
import pydough
512+
513+
# Setup demo metadata. Make sure you have the TPC-H dataset downloaded locally.
514+
graph = pydough.active_session.load_metadata_graph("demos/metadata/tpch_demo_graph.json", "TPCH")
515+
pydough.active_session.connect_database("sqlite", database="tpch.db")
516+
517+
# Example of a single line pydough code snippet
518+
pydough_code = "pydough_query = TPCH.CALCULATE(n=COUNT(customers.WHERE(market_segment == SEG)))"
519+
# Transform the pydough code and get the result from pydough_query
520+
query = pydough.from_string(pydough_code, "pydough_query", graph, {"SEG":"AUTOMOBILE"})
521+
sql = pydough.to_sql(query)
522+
```
523+
524+
The value of `sql` is the following SQL query text as a Python string:
525+
```sql
526+
SELECT
527+
COUNT(*) AS n
528+
FROM main.customer
529+
WHERE
530+
c_mktsegment = 'AUTOMOBILE'
531+
```
532+
533+
This next example is of Python code to generate SQL to get the top 5 suppliers with the highest revenue. The code snippet uses variables provided in the environment context to filter by nation, ship mode and year (`TARGET_NATION`, `DESIRED_SHIP_MODE` and `REQUESTED_SHIP_YEAR`):
534+
```py
535+
# Example of a multi-line pydough code snippet with intermetiate results
536+
nation_name = "JAPAN"
537+
ship_mode = "TRUCK"
538+
ship_year = 1996
539+
env = {"TARGET_NATION" : nation_name, "DESIRED_SHIP_MODE" : ship_mode, "REQUESTED_SHIP_YEAR" : ship_year}
540+
541+
pydough_code="""
542+
# The supply records for the supplier that were from a medium part
543+
selected_records = supply_records.WHERE(STARTSWITH(part.name, "coral")).CALCULATE(supply_cost)
544+
545+
# The revenue generated by a specific lineitem
546+
line_revenue = extended_price * (1 - discount) * (1 - tax) - quantity * supply_cost
547+
548+
# The lineitem purchases for each record that were ordered in REQUESTED_SHIP_YEAR and shipped via DESIRED_SHIP_MODE
549+
lines_year = selected_records.lines.WHERE((YEAR(ship_date) == REQUESTED_SHIP_YEAR) & (ship_mode == DESIRED_SHIP_MODE)).CALCULATE(rev=line_revenue)
550+
551+
# For each supplier, list their name & selected revenue from REQUESTED_SHIP_YEAR
552+
selected_suppliers = suppliers.WHERE((nation.name == TARGET_NATION) & HAS(lines_year))
553+
supplier_info = selected_suppliers.CALCULATE(name, revenue_year=ROUND(SUM(lines_year.rev), 2))
554+
555+
# Pick the 5 suppliers with the highest revenue from REQUESTED_SHIP_YEAR
556+
result = supplier_info.TOP_K(5, by=revenue_year.DESC())
557+
"""
558+
# Transform the pydough code and get the result from result
559+
query = pydough.from_string(pydough_code, environment=env)
560+
sql = pydough.to_sql(query)
561+
```
562+
563+
The value of `sql` is the following SQL query text as a Python string:
564+
```sql
565+
WITH _s7 AS (
566+
SELECT
567+
ROUND(
568+
COALESCE(
569+
SUM(
570+
lineitem.l_extendedprice * (
571+
1 - lineitem.l_discount
572+
) * (
573+
1 - lineitem.l_tax
574+
) - lineitem.l_quantity * partsupp.ps_supplycost
575+
),
576+
0
577+
),
578+
2
579+
) AS revenue_year,
580+
partsupp.ps_suppkey
581+
FROM main.partsupp AS partsupp
582+
JOIN main.part AS part
583+
ON part.p_name LIKE 'coral%' AND part.p_partkey = partsupp.ps_partkey
584+
JOIN main.lineitem AS lineitem
585+
ON CAST(STRFTIME('%Y', lineitem.l_shipdate) AS INTEGER) = 1996
586+
AND lineitem.l_partkey = partsupp.ps_partkey
587+
AND lineitem.l_shipmode = 'TRUCK'
588+
AND lineitem.l_suppkey = partsupp.ps_suppkey
589+
GROUP BY
590+
partsupp.ps_suppkey
591+
)
592+
SELECT
593+
supplier.s_name AS name,
594+
_s7.revenue_year
595+
FROM main.supplier AS supplier
596+
JOIN main.nation AS nation
597+
ON nation.n_name = 'JAPAN' AND nation.n_nationkey = supplier.s_nationkey
598+
JOIN _s7 AS _s7
599+
ON _s7.ps_suppkey = supplier.s_suppkey
600+
ORDER BY
601+
revenue_year DESC
602+
LIMIT 5
603+
```
604+
605+
This final example is of Python code to generate an SQL query, using 'datetime.date' passed in through the environment.
606+
```py
607+
# For every customer, how many urgent orders have they made in year 1996 with a
608+
# total price over 100000, and what is the sum of the total prices of all such
609+
# orders they made? Sort the result by the sum from highest to lowest, and only
610+
# include customers with at least one such order
611+
612+
# For this query we set 1 environment variable and 1 function, the year 1996
613+
# and date function from datetime. Optionally, we could import datetime.date
614+
# from inside the pydough code string
615+
import datetime
616+
env = {"date" : datetime.date, "YEAR" : 1996}
617+
618+
pydough_code="""
619+
selected_orders=orders.WHERE((order_priority == '1-URGENT') & (total_price > 100000) &
620+
(order_date >= date(YEAR, 1, 1)) & (order_date < date(YEAR + 1, 1, 1)))
621+
622+
result = (
623+
customers
624+
.WHERE(HAS(selected_orders))
625+
.CALCULATE(name, n_orders=COUNT(selected_orders), total=SUM(selected_orders.total_price))
626+
.ORDER_BY(total.DESC())
627+
)
628+
"""
629+
630+
# Transform the pydough code and get the result from result
631+
query = pydough.from_string(pydough_code, environment=env)
632+
sql = pydough.to_sql(query)
633+
```
634+
635+
The value of `sql` is the following SQL query text as a Python string:
636+
```sql
637+
WITH _s1 AS (
638+
SELECT
639+
COALESCE(SUM(o_totalprice), 0) AS total,
640+
COUNT(*) AS n_rows,
641+
o_custkey
642+
FROM main.orders
643+
WHERE
644+
o_orderdate < '1997-01-01'
645+
AND o_orderdate >= '1996-01-01'
646+
AND o_orderpriority = '1-URGENT'
647+
AND o_totalprice > 100000
648+
GROUP BY
649+
o_custkey
650+
)
651+
SELECT
652+
customer.c_name AS name,
653+
_s1.n_rows AS n_orders,
654+
_s1.total
655+
FROM main.customer AS customer
656+
JOIN _s1 AS _s1
657+
ON _s1.o_custkey = customer.c_custkey
658+
ORDER BY
659+
total DESC
660+
```
661+
479662
<!-- TOC --><a name="exploration-apis"></a>
480663
## Exploration APIs
481664

@@ -882,6 +1065,8 @@ This term is singular with regards to the collection, meaning it can be placed i
8821065
For example, the following is valid:
8831066
TPCH.nations.WHERE(region.name == 'EUROPE').CALCULATE(AVG(customers.acctbal))
8841067
```
1068+
1069+
<!-- TOC --><a name="logging"></a>
8851070
## Logging
8861071

8871072
Logging is enabled and set to INFO level by default. We can change the log level by setting the environment variable `PYDOUGH_LOG_LEVEL` to the standard levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.

pydough/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"explain",
99
"explain_structure",
1010
"explain_term",
11+
"from_string",
1112
"get_logger",
1213
"init_pydough_context",
1314
"parse_json_metadata_from_file",
@@ -20,7 +21,7 @@
2021
from .exploration import explain, explain_structure, explain_term
2122
from .logger import get_logger
2223
from .metadata import parse_json_metadata_from_file
23-
from .unqualified import display_raw, init_pydough_context
24+
from .unqualified import display_raw, from_string, init_pydough_context
2425

2526
# Create a default session for the user to interact with.
2627
# In most situations users will just use this session and

pydough/unqualified/__init__.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
"UnqualifiedWhere",
2121
"UnqualifiedWindow",
2222
"display_raw",
23+
"from_string",
2324
"init_pydough_context",
2425
"qualify_node",
2526
"qualify_term",
@@ -44,4 +45,9 @@
4445
UnqualifiedWindow,
4546
display_raw,
4647
)
47-
from .unqualified_transform import init_pydough_context, transform_cell, transform_code
48+
from .unqualified_transform import (
49+
from_string,
50+
init_pydough_context,
51+
transform_cell,
52+
transform_code,
53+
)

pydough/unqualified/unqualified_transform.py

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,18 @@
33
variables with unqualified nodes by prepending with with `_ROOT.`.
44
"""
55

6-
__all__ = ["init_pydough_context", "transform_cell", "transform_code"]
6+
__all__ = ["from_string", "init_pydough_context", "transform_cell", "transform_code"]
77

88
import ast
99
import inspect
1010
import types
11+
from typing import Any
1112

1213
from pydough.metadata import GraphMetadata
1314

15+
from .errors import PyDoughUnqualifiedException
16+
from .unqualified_node import UnqualifiedNode
17+
1418

1519
class AddRootVisitor(ast.NodeTransformer):
1620
"""
@@ -413,6 +417,85 @@ def transform_cell(cell: str, graph_name: str, known_names: set[str]) -> str:
413417
return ast.unparse(new_tree)
414418

415419

420+
def from_string(
421+
source: str,
422+
answer_variable: str | None = None,
423+
metadata: GraphMetadata | None = None,
424+
environment: dict[str, Any] | None = None,
425+
) -> UnqualifiedNode:
426+
"""
427+
Parses and transforms a PyDough source string, returning an unqualified node
428+
on which operations like `explain()`, `to_sql()`, or `to_df()` can be
429+
called.
430+
431+
Args:
432+
`source`: a valid PyDough code string that will be executed to define
433+
the PyDough code.
434+
`answer_variable`: The name of the variable that holds the result of the
435+
PyDough code. If not provided, assumes the answer is `result`.
436+
`metadata`: The metadata graph to use. If not provided,
437+
`active_session.metadata` will be used.
438+
`environment`: A dictionary of variables that will be available
439+
in the environment where the PyDough code is executed. If not provided,
440+
uses an empty dictionary.
441+
442+
Returns:
443+
A PyDough UnualifiedNode object representing the result of the
444+
transformed PyDough code.
445+
"""
446+
import pydough
447+
448+
# Verify if graph is provided. Otherwise use pydough.active_session.metadata
449+
if metadata is None:
450+
metadata = pydough.active_session.metadata
451+
if metadata is None:
452+
raise ValueError(
453+
"No active graph set in PyDough session."
454+
" Please set a graph using"
455+
" pydough.active_session.load_metadata_graph(...)"
456+
)
457+
# Verify if environment is provided
458+
if environment is None:
459+
environment = {}
460+
461+
# Verify if answer_variable is provided
462+
if answer_variable is None:
463+
answer_variable = "result"
464+
465+
# Transform PyDough code into valid Python code
466+
known_names: set[str] = set(environment.keys())
467+
visitor: ast.NodeTransformer = AddRootVisitor("graph", known_names)
468+
try:
469+
tree: ast.AST = ast.parse(source)
470+
except SyntaxError as e:
471+
raise ValueError(f"Syntax error in source PyDough code:\n{str(e)}") from e
472+
assert isinstance(tree, ast.AST)
473+
new_tree: ast.AST = ast.fix_missing_locations(visitor.visit(tree))
474+
assert isinstance(new_tree, ast.AST)
475+
476+
# Execute the transformed PyDough code to get the UnqualifiedNode answer
477+
transformed_code: str = ast.unparse(new_tree)
478+
try:
479+
compile_ast = compile(transformed_code, filename="<ast>", mode="exec")
480+
except SyntaxError as e:
481+
raise ValueError(f"Syntax error in transformed PyDough code:\n{str(e)}") from e
482+
execution_context: dict[str, Any] = environment | {"graph": metadata}
483+
exec(compile_ast, {}, execution_context)
484+
485+
# Check if answer_variable exists in execution_context after code execution
486+
if answer_variable not in execution_context:
487+
raise PyDoughUnqualifiedException(
488+
f"PyDough code expected to store the answer in a variable named '{answer_variable}'."
489+
)
490+
ret_val = execution_context[answer_variable]
491+
# Check if answer is an UnqualifiedNode
492+
if not isinstance(ret_val, UnqualifiedNode):
493+
raise PyDoughUnqualifiedException(
494+
f"Expected variable {answer_variable!r} in the text to store PyDough code, instead found {ret_val.__class__.__name__!r}."
495+
)
496+
return ret_val
497+
498+
416499
def init_pydough_context(graph: GraphMetadata):
417500
"""
418501
Decorator that wraps around a PyDough function and transforms its body into

0 commit comments

Comments
 (0)