Skip to content

Commit 207f36b

Browse files
committed
merge
2 parents aaaada1 + 71c92d8 commit 207f36b

File tree

94 files changed

+7769
-295
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+7769
-295
lines changed

documentation/functions.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Below is the list of every function/operator currently supported in PyDough as a
2727
* [STRIP](#strip)
2828
* [REPLACE](#replace)
2929
* [STRCOUNT](#strcount)
30+
* [GETPART](#getpart)
3031
- [Datetime Functions](#datetime-functions)
3132
* [DATETIME](#datetime)
3233
* [YEAR](#year)
@@ -438,6 +439,46 @@ Customers.CALCULATE(count_substring= STRCOUNT(name, "")) # returns 0 by default
438439
| `'Alex Rodriguez'`| `STRCOUNT('Alex Rodriguez', 'e')`| `2` |
439440
| `'Hello World'`| `STRCOUNT('Hello World', 'll')` | `1` |
440441

442+
443+
<!-- TOC --><a name="getpart"></a>
444+
445+
### GETPART
446+
447+
The `GETPART` function extracts the N-th part from a string, splitting it by a specified delimiter.
448+
449+
- The first argument is the input string to split.
450+
- The second argument is the delimiter string.
451+
- The third argument is the index of the part to extract. This index can be positive (counting from the start, 0-based) or negative (counting from the end, -1 is the last part).
452+
453+
If the index is out of range, `GETPART` returns `None`. If the delimiter is an empty string, the function will not split the input string and the first part will be the entire string.
454+
455+
```py
456+
# Extracts the first name from a full name
457+
Customers.CALCULATE(first_name = GETPART(name, " ", 1))
458+
459+
# Extracts the last name from a full name
460+
Customers.CALCULATE(last_name = GETPART(name, " ", -1))
461+
462+
# Extracts the second part from a hyphen-separated string
463+
Parts.CALCULATE(second_code = GETPART(code, "-", 2))
464+
```
465+
466+
| **Input String** | **Delimiter** | **Index** | **GETPART Result** |
467+
|---------------------- |-------------- |-----------|--------------------|
468+
| `"Alex Rodriguez"` | `" "` | `1` | `"Alex"` |
469+
| `"Alex Rodriguez"` | `" "` | `0` | `"Alex"` |
470+
| `"Alex Rodriguez"` | `" "` | `2` | `"Rodriguez"` |
471+
| `"Alex Rodriguez"` | `" "` | `-1` | `"Rodriguez"` |
472+
| `"Alex Rodriguez"` | `""` | `1` | `"Alex Rodriguez"` |
473+
| `"a-b-c-d"` | `"-"` | `3` | `"c"` |
474+
| `"a-b-c-d"` | `"-"` | `-2` | `"c"` |
475+
| `"a-b-c-d"` | `"-"` | `5` | `None` |
476+
| `"a-b-c-d"` | `"-"` | `-5` | `None` |
477+
478+
> [!NOTE]
479+
> - Indexing is one-based from the start and negative indices count from the end.
480+
> - The 0 index will be treated as 1, returning the first part.
481+
441482
<!-- TOC --><a name="datetime-functions"></a>
442483

443484
## Datetime Functions

documentation/metadata.md

Lines changed: 249 additions & 5 deletions
Large diffs are not rendered by default.

documentation/usage.md

Lines changed: 186 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,13 @@ This document describes how to set up & interact with PyDough. For instructions
1313
- [Evaluation APIs](#evaluation-apis)
1414
* [`pydough.to_sql`](#pydoughto_sql)
1515
* [`pydough.to_df`](#pydoughto_df)
16+
- [Transformation APIs](#transformation-apis)
17+
* [`pydough.from_string`](#pydoughfrom_string)
1618
- [Exploration APIs](#exploration-apis)
1719
* [`pydough.explain_structure`](#pydoughexplain_structure)
1820
* [`pydough.explain`](#pydoughexplain)
1921
* [`pydough.explain_term`](#pydoughexplain_term)
20-
- [Logging] (#logging)
22+
- [Logging](#logging)
2123

2224
<!-- TOC end -->
2325

@@ -476,6 +478,187 @@ pydough.to_df(result, columns={"name": "name", "n_custs": "n"})
476478

477479
See the [demo notebooks](../demos/notebooks/1_introduction.ipynb) for more instances of how to use the `to_df` API.
478480

481+
<!-- TOC --><a name="evaluation-apis"></a>
482+
## Transformation APIs
483+
484+
This sections describes various APIs you can use to transform PyDough source code into a result that can be used as input for other evaluation or exploration APIs.
485+
486+
<!-- TOC --><a name="pydoughfrom_string"></a>
487+
### `pydough.from_string`
488+
489+
The `from_string` API parses a PyDough source code string and transforms it into a PyDough collection. You can then perform operations like `explain()`, `to_sql()`, or `to_df()` on the result.
490+
491+
#### Syntax
492+
```python
493+
def from_string(
494+
source: str,
495+
answer_variable: str | None = None,
496+
metadata: GraphMetadata | None = None,
497+
environment: dict[str, Any] | None = None,
498+
) -> UnqualifiedNode:
499+
```
500+
501+
The first argument `source` is the source code string. It can be a single pydough command or a multi-line pydough code with intermediate results stored in variables. It can optionally take in the following keyword arguments:
502+
503+
- `answer_variable`: The name of the variable that stores the final result of the PyDough code. If not provided, the API expects the final result to be in a variable named `result`. The API returns a PyDough collection holding this value. It is assumed that the PyDough code string includes a variable definition where the name of the variable is the same as `answer_variable` and the value is valid PyDough code; if not it raises an exception.
504+
- `metadata`: The PyDough knowledge graph to use for the transformation. If omitted, `pydough.active_session.metadata` is used.
505+
- `environment`: A dictionary representing additional environment context. This serves as the local namespace where the PyDough code will be executed.
506+
507+
Below are examples of using `pydough.from_string`, and examples of the SQL that could be potentially generated from calling `pydough.to_sql` on the output. All these examples use the TPC-H dataset that can be downloaded [here](https://github.com/lovasoa/TPCH-sqlite/releases) with the [graph used in the demos directory](../demos/metadata/tpch_demo_graph.json).
508+
509+
This first example is of Python code using `pydough.from_string` to generate SQL to get the count of customers in the market segment `"AUTOMOBILE"`. The result will be returned in a variable named `pydough_query` instead of the default `result`, and the market segment `"AUTOMOBILE"` is passed in an environment variable `SEG`.:
510+
```py
511+
import pydough
512+
513+
# Setup demo metadata. Make sure you have the TPC-H dataset downloaded locally.
514+
graph = pydough.active_session.load_metadata_graph("demos/metadata/tpch_demo_graph.json", "TPCH")
515+
pydough.active_session.connect_database("sqlite", database="tpch.db")
516+
517+
# Example of a single line pydough code snippet
518+
pydough_code = "pydough_query = TPCH.CALCULATE(n=COUNT(customers.WHERE(market_segment == SEG)))"
519+
# Transform the pydough code and get the result from pydough_query
520+
query = pydough.from_string(pydough_code, "pydough_query", graph, {"SEG":"AUTOMOBILE"})
521+
sql = pydough.to_sql(query)
522+
```
523+
524+
The value of `sql` is the following SQL query text as a Python string:
525+
```sql
526+
SELECT
527+
COUNT(*) AS n
528+
FROM main.customer
529+
WHERE
530+
c_mktsegment = 'AUTOMOBILE'
531+
```
532+
533+
This next example is of Python code to generate SQL to get the top 5 suppliers with the highest revenue. The code snippet uses variables provided in the environment context to filter by nation, ship mode and year (`TARGET_NATION`, `DESIRED_SHIP_MODE` and `REQUESTED_SHIP_YEAR`):
534+
```py
535+
# Example of a multi-line pydough code snippet with intermetiate results
536+
nation_name = "JAPAN"
537+
ship_mode = "TRUCK"
538+
ship_year = 1996
539+
env = {"TARGET_NATION" : nation_name, "DESIRED_SHIP_MODE" : ship_mode, "REQUESTED_SHIP_YEAR" : ship_year}
540+
541+
pydough_code="""
542+
# The supply records for the supplier that were from a medium part
543+
selected_records = supply_records.WHERE(STARTSWITH(part.name, "coral")).CALCULATE(supply_cost)
544+
545+
# The revenue generated by a specific lineitem
546+
line_revenue = extended_price * (1 - discount) * (1 - tax) - quantity * supply_cost
547+
548+
# The lineitem purchases for each record that were ordered in REQUESTED_SHIP_YEAR and shipped via DESIRED_SHIP_MODE
549+
lines_year = selected_records.lines.WHERE((YEAR(ship_date) == REQUESTED_SHIP_YEAR) & (ship_mode == DESIRED_SHIP_MODE)).CALCULATE(rev=line_revenue)
550+
551+
# For each supplier, list their name & selected revenue from REQUESTED_SHIP_YEAR
552+
selected_suppliers = suppliers.WHERE((nation.name == TARGET_NATION) & HAS(lines_year))
553+
supplier_info = selected_suppliers.CALCULATE(name, revenue_year=ROUND(SUM(lines_year.rev), 2))
554+
555+
# Pick the 5 suppliers with the highest revenue from REQUESTED_SHIP_YEAR
556+
result = supplier_info.TOP_K(5, by=revenue_year.DESC())
557+
"""
558+
# Transform the pydough code and get the result from result
559+
query = pydough.from_string(pydough_code, environment=env)
560+
sql = pydough.to_sql(query)
561+
```
562+
563+
The value of `sql` is the following SQL query text as a Python string:
564+
```sql
565+
WITH _s7 AS (
566+
SELECT
567+
ROUND(
568+
COALESCE(
569+
SUM(
570+
lineitem.l_extendedprice * (
571+
1 - lineitem.l_discount
572+
) * (
573+
1 - lineitem.l_tax
574+
) - lineitem.l_quantity * partsupp.ps_supplycost
575+
),
576+
0
577+
),
578+
2
579+
) AS revenue_year,
580+
partsupp.ps_suppkey
581+
FROM main.partsupp AS partsupp
582+
JOIN main.part AS part
583+
ON part.p_name LIKE 'coral%' AND part.p_partkey = partsupp.ps_partkey
584+
JOIN main.lineitem AS lineitem
585+
ON CAST(STRFTIME('%Y', lineitem.l_shipdate) AS INTEGER) = 1996
586+
AND lineitem.l_partkey = partsupp.ps_partkey
587+
AND lineitem.l_shipmode = 'TRUCK'
588+
AND lineitem.l_suppkey = partsupp.ps_suppkey
589+
GROUP BY
590+
partsupp.ps_suppkey
591+
)
592+
SELECT
593+
supplier.s_name AS name,
594+
_s7.revenue_year
595+
FROM main.supplier AS supplier
596+
JOIN main.nation AS nation
597+
ON nation.n_name = 'JAPAN' AND nation.n_nationkey = supplier.s_nationkey
598+
JOIN _s7 AS _s7
599+
ON _s7.ps_suppkey = supplier.s_suppkey
600+
ORDER BY
601+
revenue_year DESC
602+
LIMIT 5
603+
```
604+
605+
This final example is of Python code to generate an SQL query, using 'datetime.date' passed in through the environment.
606+
```py
607+
# For every customer, how many urgent orders have they made in year 1996 with a
608+
# total price over 100000, and what is the sum of the total prices of all such
609+
# orders they made? Sort the result by the sum from highest to lowest, and only
610+
# include customers with at least one such order
611+
612+
# For this query we set 1 environment variable and 1 function, the year 1996
613+
# and date function from datetime. Optionally, we could import datetime.date
614+
# from inside the pydough code string
615+
import datetime
616+
env = {"date" : datetime.date, "YEAR" : 1996}
617+
618+
pydough_code="""
619+
selected_orders=orders.WHERE((order_priority == '1-URGENT') & (total_price > 100000) &
620+
(order_date >= date(YEAR, 1, 1)) & (order_date < date(YEAR + 1, 1, 1)))
621+
622+
result = (
623+
customers
624+
.WHERE(HAS(selected_orders))
625+
.CALCULATE(name, n_orders=COUNT(selected_orders), total=SUM(selected_orders.total_price))
626+
.ORDER_BY(total.DESC())
627+
)
628+
"""
629+
630+
# Transform the pydough code and get the result from result
631+
query = pydough.from_string(pydough_code, environment=env)
632+
sql = pydough.to_sql(query)
633+
```
634+
635+
The value of `sql` is the following SQL query text as a Python string:
636+
```sql
637+
WITH _s1 AS (
638+
SELECT
639+
COALESCE(SUM(o_totalprice), 0) AS total,
640+
COUNT(*) AS n_rows,
641+
o_custkey
642+
FROM main.orders
643+
WHERE
644+
o_orderdate < '1997-01-01'
645+
AND o_orderdate >= '1996-01-01'
646+
AND o_orderpriority = '1-URGENT'
647+
AND o_totalprice > 100000
648+
GROUP BY
649+
o_custkey
650+
)
651+
SELECT
652+
customer.c_name AS name,
653+
_s1.n_rows AS n_orders,
654+
_s1.total
655+
FROM main.customer AS customer
656+
JOIN _s1 AS _s1
657+
ON _s1.o_custkey = customer.c_custkey
658+
ORDER BY
659+
total DESC
660+
```
661+
479662
<!-- TOC --><a name="exploration-apis"></a>
480663
## Exploration APIs
481664

@@ -882,6 +1065,8 @@ This term is singular with regards to the collection, meaning it can be placed i
8821065
For example, the following is valid:
8831066
TPCH.nations.WHERE(region.name == 'EUROPE').CALCULATE(AVG(customers.acctbal))
8841067
```
1068+
1069+
<!-- TOC --><a name="logging"></a>
8851070
## Logging
8861071

8871072
Logging is enabled and set to INFO level by default. We can change the log level by setting the environment variable `PYDOUGH_LOG_LEVEL` to the standard levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.

pydough/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"explain",
99
"explain_structure",
1010
"explain_term",
11+
"from_string",
1112
"get_logger",
1213
"init_pydough_context",
1314
"parse_json_metadata_from_file",
@@ -21,7 +22,7 @@
2122
from .exploration import explain, explain_structure, explain_term
2223
from .logger import get_logger
2324
from .metadata import parse_json_metadata_from_file
24-
from .unqualified import display_raw, init_pydough_context
25+
from .unqualified import display_raw, from_string, init_pydough_context
2526
from .user_collections import range_collection
2627

2728
# Create a default session for the user to interact with.

pydough/configs/session.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ def metadata(self) -> GraphMetadata | None:
5454
Get the active metadata graph.
5555
5656
Returns:
57-
GraphMetadata: The active metadata graph.
57+
The active metadata graph.
5858
"""
5959
return self._metadata
6060

@@ -64,7 +64,7 @@ def metadata(self, graph: GraphMetadata | None) -> None:
6464
Set the active metadata graph.
6565
6666
Args:
67-
graph (GraphMetadata | None): The metadata graph to set.
67+
graph: The metadata graph to set.
6868
"""
6969
self._metadata = graph
7070

@@ -74,7 +74,7 @@ def config(self) -> PyDoughConfigs:
7474
Get the active PyDough configuration.
7575
7676
Returns:
77-
PyDoughConfigs: The active PyDough configuration.
77+
The active PyDough configuration.
7878
"""
7979
return self._config
8080

@@ -84,7 +84,7 @@ def config(self, config: PyDoughConfigs) -> None:
8484
Set the active PyDough configuration.
8585
8686
Args:
87-
config (PyDoughConfigs): The PyDough configuration to set.
87+
`config`: The PyDough configuration to set.
8888
"""
8989
self._config = config
9090

@@ -94,7 +94,7 @@ def database(self) -> DatabaseContext:
9494
Get the active database context.
9595
9696
Returns:
97-
DatabaseContext: The active database context.
97+
The active database context.
9898
"""
9999
return self._database
100100

@@ -104,7 +104,7 @@ def database(self, context: DatabaseContext) -> None:
104104
Set the active database context.
105105
106106
Args:
107-
context (DatabaseContext): The database context to set.
107+
`context`: The database context to set.
108108
"""
109109
self._database = context
110110

@@ -114,13 +114,13 @@ def connect_database(self, database_name: str, **kwargs) -> DatabaseContext:
114114
the corresponding context in case the user wants/needs to modify it.
115115
116116
Args:
117-
database_name (str): The name of the database to connect to.
117+
`database_name`: The name of the database to connect to.
118118
**kwargs: Additional keyword arguments to pass to the connection.
119119
All arguments must be accepted using the supported connect API
120120
for the dialect. Most likely the database path will be required.
121121
122122
Returns:
123-
DatabaseContext: The newly created database context.
123+
The newly created database context.
124124
"""
125125
context: DatabaseContext = load_database_context(database_name, **kwargs)
126126
self.database = context
@@ -135,13 +135,13 @@ def load_metadata_graph(self, graph_path: str, graph_name: str) -> GraphMetadata
135135
property directly later.
136136
137137
Args:
138-
graph_path (str): The path to load the graph. At this time this must be on
138+
`graph_path`: The path to load the graph. At this time this must be on
139139
the user's local file system.
140-
graph_name (str): The name under which to load the graph from the file. This
140+
`graph_name`: The name under which to load the graph from the file. This
141141
is to allow loading multiple graphs from the same json file.
142142
143143
Returns:
144-
GraphMetadata: The loaded metadata graph.
144+
The loaded metadata graph.
145145
"""
146146
graph: GraphMetadata = parse_json_metadata_from_file(graph_path, graph_name)
147147
self.metadata = graph

0 commit comments

Comments
 (0)