diff --git a/docs/guide/formats/index.md b/docs/guide/formats/index.md deleted file mode 100644 index 574114cdc4..0000000000 --- a/docs/guide/formats/index.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -title: Data Formats -rank: 5 ---- - -# Data Formats - -Sail supports various data formats for reading and writing. - -You can use the `SparkSession.read`, `DataFrame.write`, and `DataFrame.writeTo()` API to load and save data in different -formats. -You can also use the `CREATE TABLE` SQL statement to create a table that refers to data stored in a specific format. - -Here is a summary of the supported (:white_check_mark:) and unsupported (:x:) data formats for reading and writing data. -There are also features that are planned in our roadmap (:construction:). - -| Format | Read Support | Write Support | -| ---------------------- | ---------------------------- | ---------------------------- | -| [Delta Lake](./delta) | :white_check_mark: (partial) | :white_check_mark: (partial) | -| [Iceberg](./iceberg) | :white_check_mark: (partial) | :white_check_mark: (partial) | -| Parquet | :white_check_mark: | :white_check_mark: | -| Binary (any file type) | :white_check_mark: | :x: | -| CSV | :white_check_mark: | :white_check_mark: | -| JSON | :white_check_mark: | :white_check_mark: | -| Text | :white_check_mark: | :white_check_mark: | -| Avro | :white_check_mark: | :white_check_mark: | -| Protocol Buffers | :construction: | :construction: | -| Hudi | :construction: | :construction: | -| ORC | :x: | :x: | diff --git a/docs/guide/integrations/jdbc.md b/docs/guide/integrations/jdbc.md deleted file mode 100644 index 1b66f04b68..0000000000 --- a/docs/guide/integrations/jdbc.md +++ /dev/null @@ -1,139 +0,0 @@ ---- -title: JDBC -rank: 3 ---- - -# JDBC Datasource - -Sail provides a database connector exposed under the `"jdbc"` format name for API parity with -vanilla PySpark — no actual JDBC driver or JVM is involved. - -## Installation - -```bash -pip install pysail[jdbc] -``` - -## Quick Start - -Register the datasource once per Spark session, then read using the standard PySpark API. - -```python -from pysail.spark.datasource.jdbc import JdbcDataSource - -spark.dataSource.register(JdbcDataSource) - -# Using format("jdbc") — full option control -df = ( - spark.read.format("jdbc") - .option("url", "jdbc:postgresql://localhost:5432/mydb") - .option("dbtable", "public.users") - .option("user", "alice") - .option("password", "secret") - .load() -) - -# Using spark.read.jdbc() shorthand -df = spark.read.jdbc( - "jdbc:postgresql://localhost:5432/mydb", - "public.users", - properties={"user": "alice", "password": "secret"}, -) -df.show() -``` - -## Supported Options - -Options are consistent with the [PySpark JDBC documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html). - -| Option | Required | Default | Description | -| ------------------- | --------- | ------- | --------------------------------------------------------------------------------------------------- | -| `url` | **Yes** | | JDBC URL: `jdbc:://:/` | -| `dbtable` | **Yes\*** | | Table name, optionally schema-qualified (`"schema.table"`). Mutually exclusive with `query`. | -| `query` | **Yes\*** | | Arbitrary SQL SELECT. Mutually exclusive with `dbtable`. | -| `user` | No | | Database username. Can also be passed in `properties` dict. | -| `password` | No | | Database password. Can also be passed in `properties` dict. | -| `partitionColumn` | No | | Numeric column for range-stride partitioning. Requires `lowerBound`, `upperBound`, `numPartitions`. | -| `lowerBound` | No | | Lower bound of partition stride (inclusive). | -| `upperBound` | No | | Upper bound of partition stride (inclusive on last partition). | -| `numPartitions` | No | `1` | Number of parallel read partitions. | -| `fetchsize` | No | `0` | Advisory rows-per-round-trip hint. | -| `pushDownPredicate` | No | `true` | Push `WHERE` filters to the database. Set to `false` to disable. | -| `customSchema` | No | | Spark DDL string to override inferred column types (e.g. `"id DECIMAL(38,0), name STRING"`). | - -\* Exactly one of `dbtable` or `query` is required. - -## Reading a Custom SQL Query - -Use `query` instead of `dbtable` to run arbitrary SQL: - -```python -df = ( - spark.read.format("jdbc") - .option("url", "jdbc:postgresql://localhost:5432/mydb") - .option("query", "SELECT id, name FROM users WHERE active = TRUE") - .option("user", "alice") - .option("password", "secret") - .load() -) -``` - -> **Note:** `query` and `partitionColumn` are mutually exclusive. To partition -> a custom query, wrap it in `dbtable` as a subquery: -> -> ```python -> .option("dbtable", "(SELECT * FROM events WHERE type='click') AS t") -> .option("partitionColumn", "user_id") -> ``` - -## Parallel Reads (Range Partitioning) - -Provide `partitionColumn`, `lowerBound`, `upperBound`, and `numPartitions` together to -split the read into parallel range strides — consistent with PySpark JDBC semantics: - -```python -df = ( - spark.read.format("jdbc") - .option("url", "jdbc:postgresql://localhost:5432/mydb") - .option("dbtable", "events") - .option("partitionColumn", "id") - .option("lowerBound", "1") - .option("upperBound", "10000000") - .option("numPartitions", "8") - .option("user", "alice") - .option("password", "secret") - .load() -) -``` - -Or equivalently: - -```python -df = spark.read.jdbc( - "jdbc:postgresql://localhost:5432/mydb", - "events", - column="id", - lowerBound=1, - upperBound=10_000_000, - numPartitions=8, - properties={"user": "alice", "password": "secret"}, -) -``` - -## Schema Override - -Use `customSchema` to override column types after reading: - -```python -df = ( - spark.read.format("jdbc") - .option("url", "jdbc:postgresql://localhost:5432/mydb") - .option("dbtable", "orders") - .option("customSchema", "amount DECIMAL(18,2), status STRING") - .option("user", "alice") - .option("password", "secret") - .load() -) -``` - -Columns not listed in `customSchema` retain their inferred types. diff --git a/docs/guide/formats/delta.md b/docs/guide/sources/delta/examples.md similarity index 89% rename from docs/guide/formats/delta.md rename to docs/guide/sources/delta/examples.md index 546a61566f..48eb62acdf 100644 --- a/docs/guide/formats/delta.md +++ b/docs/guide/sources/delta/examples.md @@ -1,18 +1,13 @@ --- -title: Delta Lake +title: Examples rank: 1 --- -# Delta Lake +# Examples -You can use the `delta` format in Sail to work with [Delta Lake](https://delta.io/). -You can use the Spark DataFrame API or Spark SQL to read and write Delta tables. + -## Examples - - - -### Basic Usage +## Basic Usage ::: code-group @@ -44,7 +39,7 @@ SELECT * FROM users; ::: -### Data Partitioning +## Data Partitioning You can work with partitioned Delta tables using the Spark DataFrame API. Partitioned Delta tables organize data into directories based on the values of one or more columns. @@ -78,7 +73,7 @@ SELECT * FROM metrics WHERE year > 2024; ::: -### Schema Evolution +## Schema Evolution Delta Lake handles schema evolution gracefully. By default, if you try to write data with a different schema than the one of the existing Delta table, an error will occur. @@ -96,7 +91,7 @@ But this works only if you set the write mode to `overwrite`. df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(path) ``` -### Time Travel +## Time Travel You can use the time travel feature to query historical versions of a Delta table. @@ -107,7 +102,7 @@ df = spark.read.format("delta").option("timestampAsOf", "2025-01-02T03:04:05.678 Time travel is not available for Spark SQL in Sail yet, but we plan to support it soon. -### Column Mapping +## Column Mapping You can write Delta tables with column mapping enabled. The supported column mapping modes are `name` and `id`. You must write to a new Delta table to enable column mapping. @@ -118,7 +113,7 @@ df.write.format("delta").option("columnMappingMode", "id").save(path) Existing Delta tables with column mapping can be read as usual. -### More Features +## More Features We will continue adding more examples for advanced Delta Lake features as they become available in Sail. In the meantime, feel free to reach out to us on [Slack](https://lakesail.com/slack) or [GitHub Discussions](https://github.com/lakehq/sail/discussions) if you have questions! diff --git a/docs/guide/sources/delta/features.md b/docs/guide/sources/delta/features.md new file mode 100644 index 0000000000..cabf21622c --- /dev/null +++ b/docs/guide/sources/delta/features.md @@ -0,0 +1,56 @@ +--- +title: Supported Features +rank: 2 +--- + +# Supported Features + +## Core Table Operations + +| Feature | Supported | +| ------------------------------------------- | ------------------ | +| Read | :white_check_mark: | +| Write (append) | :white_check_mark: | +| Write (overwrite) | :white_check_mark: | +| Data skipping (partition pruning) | :white_check_mark: | +| Data skipping (pruning via file statistics) | :white_check_mark: | +| Schema validation | :white_check_mark: | +| Schema evolution | :white_check_mark: | +| Time travel (by version) | :white_check_mark: | +| Time travel (by timestamp) | :white_check_mark: | + +Both non-partitioned and partitioned tables are supported for reading and writing. + +## DML Operations + +| Feature | Supported | +| ------------------------ | ------------------ | +| `DELETE` (copy-on-write) | :white_check_mark: | +| `MERGE` (copy-on-write) | :white_check_mark: | +| `DELETE` (merge-on-read) | :construction: | +| `MERGE` (merge-on-read) | :construction: | +| `UPDATE` | :construction: | + +The "merge-on-read" mode refers to updating the table with deletion vectors. This reduces the amount of data that needs to be rewritten during DML operations, but incurs additional read overhead when querying the table. + +## Table Maintenance Operations + +| Feature | Supported | +| ---------- | -------------- | +| `VACUUM` | :construction: | +| `OPTIMIZE` | :construction: | +| `RESTORE` | :construction: | + +## Protocol Internals + +| Feature | Supported | +| -------------------------------- | ------------------ | +| Checkpointing | :white_check_mark: | +| Log clean-up | :white_check_mark: | +| Column mapping | :white_check_mark: | +| Deletion vectors | :construction: | +| Constraints | :construction: | +| Identity columns | :construction: | +| Generated columns | :construction: | +| Transaction (conflict detection) | :construction: | +| Change data feed | :construction: | diff --git a/docs/guide/sources/delta/index.data.ts b/docs/guide/sources/delta/index.data.ts new file mode 100644 index 0000000000..191f6432dd --- /dev/null +++ b/docs/guide/sources/delta/index.data.ts @@ -0,0 +1,5 @@ +import { createContentLoader } from "vitepress"; + +export default createContentLoader([ + "/guide/sources/delta/!(index|_*/**|**/_*/**).md", +]); diff --git a/docs/guide/sources/delta/index.md b/docs/guide/sources/delta/index.md new file mode 100644 index 0000000000..16f3624428 --- /dev/null +++ b/docs/guide/sources/delta/index.md @@ -0,0 +1,18 @@ +--- +title: Delta Lake +rank: 1 +--- + +# Delta Lake + +You can use the `delta` format in Sail to work with [Delta Lake](https://delta.io/). +You can use the Spark DataFrame API or Spark SQL to read and write Delta tables. + +## Topics + + + + diff --git a/docs/guide/formats/iceberg.md b/docs/guide/sources/iceberg/examples.md similarity index 86% rename from docs/guide/formats/iceberg.md rename to docs/guide/sources/iceberg/examples.md index 2d9ae3a418..f68b0d5b27 100644 --- a/docs/guide/formats/iceberg.md +++ b/docs/guide/sources/iceberg/examples.md @@ -1,18 +1,13 @@ --- -title: Iceberg -rank: 2 +title: Examples +rank: 1 --- -# Iceberg +# Examples -You can use the `iceberg` format in Sail to work with [Apache Iceberg](https://iceberg.apache.org/). -You can use the Spark DataFrame API or Spark SQL to read and write Iceberg tables. + -## Examples - - - -### Basic Usage +## Basic Usage ::: code-group @@ -44,7 +39,7 @@ SELECT * FROM users; ::: -### Data Partitioning +## Data Partitioning You can work with partitioned Iceberg tables using the Spark DataFrame API. Partitioned Iceberg tables organize data into directories based on the values of one or more columns. @@ -78,7 +73,7 @@ SELECT * FROM metrics WHERE year > 2024; ::: -### Time Travel +## Time Travel You can use the time travel feature to query tags, branches, or historical versions of an Iceberg table. @@ -90,7 +85,7 @@ df = spark.read.format("iceberg").option("branch", "main").load(path) Time travel is not available for Spark SQL in Sail yet, but we plan to support it soon. -### More Features +## More Features We will continue adding more examples for advanced Iceberg features as they become available in Sail. In the meantime, feel free to reach out to us on [Slack](https://lakesail.com/slack) or [GitHub Discussions](https://github.com/lakehq/sail/discussions) if you have questions! diff --git a/docs/guide/sources/iceberg/features.md b/docs/guide/sources/iceberg/features.md new file mode 100644 index 0000000000..41c8231524 --- /dev/null +++ b/docs/guide/sources/iceberg/features.md @@ -0,0 +1,62 @@ +--- +title: Supported Features +rank: 2 +--- + +# Supported Features + +## Overview + +Here is a high-level overview of the features supported by Sail for Iceberg tables. + +| Feature | Supported | +| ----------------- | ------------------ | +| Read | :white_check_mark: | +| Write (append) | :white_check_mark: | +| Write (overwrite) | :white_check_mark: | +| `DELETE` | :white_check_mark: | +| `MERGE` | :construction: | +| `UPDATE` | :construction: | + +Both non-partitioned and partitioned tables are supported for reading and writing. + +The write operations currently follow "copy-on-write" semantics. +We plan to support delete files and deletion vectors, which would enable "merge-on-read" write operations in the future. + +## Version-specific Features + +We classify the supported features according to the [Iceberg specification](https://iceberg.apache.org/spec/). + +### Version 1: Analytic Data Tables + +| Feature | Supported | +| --------------------- | ------------------ | +| Metadata | :white_check_mark: | +| Manifest list | :white_check_mark: | +| File format (Parquet) | :white_check_mark: | +| File format (Avro) | :white_check_mark: | +| File format (ORC) | :construction: | +| Schema evolution | :white_check_mark: | +| Partition evolution | :construction: | +| Time travel | :white_check_mark: | +| Column statistics | :white_check_mark: | + +Reading existing branches and tags is supported (time travel). +We plan to support creating branches and tags in DDL operations in the future. + +### Version 2: Row-Level Deletes + +| Feature | Supported | +| ------------------- | ------------------ | +| Delete files | :construction: | +| Sequence numbers | :white_check_mark: | +| Manifest extensions | :construction: | + +### Version 3: Extended Types and Capabilities + +| Feature | Supported | +| --------------------- | -------------- | +| Deletion vectors | :construction: | +| Row lineage | :construction: | +| Column default values | :construction: | +| Encryption keys | :construction: | diff --git a/docs/guide/sources/iceberg/index.data.ts b/docs/guide/sources/iceberg/index.data.ts new file mode 100644 index 0000000000..b0b718411c --- /dev/null +++ b/docs/guide/sources/iceberg/index.data.ts @@ -0,0 +1,5 @@ +import { createContentLoader } from "vitepress"; + +export default createContentLoader([ + "/guide/sources/iceberg/!(index|_*/**|**/_*/**).md", +]); diff --git a/docs/guide/sources/iceberg/index.md b/docs/guide/sources/iceberg/index.md new file mode 100644 index 0000000000..5ec82a610e --- /dev/null +++ b/docs/guide/sources/iceberg/index.md @@ -0,0 +1,18 @@ +--- +title: Iceberg +rank: 2 +--- + +# Iceberg + +You can use the `iceberg` format in Sail to work with [Apache Iceberg](https://iceberg.apache.org/). +You can use the Spark DataFrame API or Spark SQL to read and write Iceberg tables. + +## Topics + + + + diff --git a/docs/guide/sources/index.md b/docs/guide/sources/index.md new file mode 100644 index 0000000000..fa2d53d150 --- /dev/null +++ b/docs/guide/sources/index.md @@ -0,0 +1,30 @@ +--- +title: Data Sources +rank: 5 +--- + +# Data Sources + +Sail supports various data sources for reading and writing. + +You can use the `SparkSession.read`, `DataFrame.write`, and `DataFrame.writeTo()` API to load and save data in different +formats. +You can also use the `CREATE TABLE` SQL statement to create a table that refers to a specific data source. + +Here is a summary of the supported (:white_check_mark:) and unsupported (:x:) data sources for reading and writing data. +There are also features that are planned in our roadmap (:construction:). + +| Format | Read Support | Write Support | +| ---------------------- | ------------------ | ------------------ | +| [Delta Lake](./delta/) | :white_check_mark: | :white_check_mark: | +| [Iceberg](./iceberg/) | :white_check_mark: | :white_check_mark: | +| Files (Parquet) | :white_check_mark: | :white_check_mark: | +| Files (CSV) | :white_check_mark: | :white_check_mark: | +| Files (JSON) | :white_check_mark: | :white_check_mark: | +| Files (Binary) | :white_check_mark: | :x: | +| Files (Text) | :white_check_mark: | :white_check_mark: | +| Files (Avro) | :white_check_mark: | :white_check_mark: | +| [Python](./python/) | :white_check_mark: | :white_check_mark: | +| [JDBC](./jdbc/) | :white_check_mark: | :construction: | +| Hudi | :construction: | :construction: | +| Files (ORC) | :construction: | :construction: | diff --git a/docs/guide/sources/jdbc/index.md b/docs/guide/sources/jdbc/index.md new file mode 100644 index 0000000000..365d484cf6 --- /dev/null +++ b/docs/guide/sources/jdbc/index.md @@ -0,0 +1,161 @@ +--- +title: JDBC +rank: 4 +--- + +# JDBC Data Source + +Sail provides a database connector exposed under the `jdbc` format name for API parity with vanilla PySpark. +The implementation is based on the Python `connectorx` library. +No actual JDBC driver or JVM is involved. + + + +## Installation + +You need to install the `pysail` package with the `jdbc` extra to use the JDBC data source. + +```bash +pip install pysail[jdbc] +``` + +## Quick Start + +Register the datasource once per Spark session. + +```python +from pysail.spark.datasource.jdbc import JdbcDataSource + +spark.dataSource.register(JdbcDataSource) +``` + +Then read from a database using the standard PySpark API. + +```python +df = ( + spark.read.format("jdbc") + .option("url", "jdbc:postgresql://localhost:5432/mydb") + .option("dbtable", "public.users") + .option("user", "alice") + .option("password", "secret") + .load() +) +``` + +Alternatively, you can use the `spark.read.jdbc()` shorthand method. + +```python +df = spark.read.jdbc( + "jdbc:postgresql://localhost:5432/mydb", + "public.users", + properties={"user": "alice", "password": "secret"}, +) +``` + +## Options + +The data source options are consistent with +the [PySpark JDBC documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html). + +| Name | Required | Default | Description | +| ------------------- | -------- | ------- | ---------------------------------------------------------------------------------------------------------------- | +| `url` | Yes | | The JDBC URL in the form of `jdbc:://:/`. | +| `dbtable` | Yes | | The table name, optionally qualified (`.`). This is mutually exclusive with `query`. | +| `query` | Yes | | An arbitrary SQL `SELECT` statement. This is mutually exclusive with `dbtable`. | +| `user` | No | | The database username. | +| `password` | No | | The database password. | +| `partitionColumn` | No | | The numeric column for range-stride partitioning. This requires `lowerBound`, `upperBound`, and `numPartitions`. | +| `lowerBound` | No | | The lower bound of partition stride (inclusive). | +| `upperBound` | No | | The upper bound of partition stride (inclusive on last partition). | +| `numPartitions` | No | `1` | The number of parallel read partitions. | +| `fetchsize` | No | `0` | An advisory rows-per-round-trip hint. | +| `pushDownPredicate` | No | `true` | Whether to push `WHERE` filters to the database. | +| `customSchema` | No | | A Spark DDL string to override inferred column types. | + +::: info +Exactly one of the `dbtable` or `query` options is required. +::: + +## Examples + +### Custom SQL Queries + +Use `query` instead of `dbtable` to run arbitrary SQL queries: + +```python +df = ( + spark.read.format("jdbc") + .option("url", "jdbc:postgresql://localhost:5432/mydb") + .option("query", "SELECT id, name FROM users WHERE active = TRUE") + .option("user", "alice") + .option("password", "secret") + .load() +) +``` + +The `query` and `partitionColumn` options are mutually exclusive. To partition a custom query, wrap it in `dbtable` as a +subquery: + +```python{4-5} +df = ( + spark.read.format("jdbc") + .option("url", "jdbc:postgresql://localhost:5432/mydb") + .option("dbtable", "(SELECT * FROM events WHERE type='click') AS t") + .option("partitionColumn", "user_id") + .option("user", "alice") + .option("password", "secret") + .load() +) +``` + +### Parallel Reads with Range Partitioning + +Provide `partitionColumn`, `lowerBound`, `upperBound`, and `numPartitions` together to split the read into parallel +range strides: + +```python +df = ( + spark.read.format("jdbc") + .option("url", "jdbc:postgresql://localhost:5432/mydb") + .option("dbtable", "events") + .option("partitionColumn", "id") + .option("lowerBound", "1") + .option("upperBound", "10000000") + .option("numPartitions", "8") + .option("user", "alice") + .option("password", "secret") + .load() +) +``` + +Or equivalently, you can use the `spark.read.jdbc()` method with the same options: + +```python +df = spark.read.jdbc( + "jdbc:postgresql://localhost:5432/mydb", + "events", + column="id", + lowerBound=1, + upperBound=10_000_000, + numPartitions=8, + properties={"user": "alice", "password": "secret"}, +) +``` + +### Schema Override + +Use `customSchema` to override column types after reading: + +```python +df = ( + spark.read.format("jdbc") + .option("url", "jdbc:postgresql://localhost:5432/mydb") + .option("dbtable", "orders") + .option("customSchema", "amount DECIMAL(18,2), status STRING") + .option("user", "alice") + .option("password", "secret") + .load() +) +``` + +Columns not listed in `customSchema` retain their inferred types. diff --git a/docs/guide/sources/python/index.md b/docs/guide/sources/python/index.md new file mode 100644 index 0000000000..c142d3a754 --- /dev/null +++ b/docs/guide/sources/python/index.md @@ -0,0 +1,29 @@ +--- +title: Python +rank: 3 +--- + +# Python Data Sources + +The Python data source allows you to extend the `SparkSession.read` and `DataFrame.write` APIs to support custom formats and external system integrations. +It optionally supports Arrow for zero-copy data exchange between the Python process and the Sail execution engine. This gives you flexibility in data source implementations without incurring performance penalties. + +You can define a Python class that inherits from the `pyspark.sql.datasource.DataSource` abstract class, and register it to the Spark session to create a custom data source that can be used in the standard PySpark API. The `DataSource` class provides methods for defining the name and schema of the data source, as well as methods for creating readers and writers. + +Currently, Sail supports Python data sources for batch reading and writing. + +## Examples + + + +### Batch Reader + +<<< @/../python/pysail/tests/spark/test_python_datasource_read.txt{python-console} + +### Batch Arrow Reader + +<<< @/../python/pysail/tests/spark/test_python_datasource_read_arrow.txt{python-console} + +### More Examples + +Please refer to the [Spark documentation](https://spark.apache.org/docs/latest/api/python/tutorial/sql/python_data_source.html) for more Python data source examples, including how to define a batch writer. We will also add more examples to this guide in the future. Stay tuned! diff --git a/docs/guide/sql/features.md b/docs/guide/sql/features.md index a1f629b64e..09ebe339b4 100644 --- a/docs/guide/sql/features.md +++ b/docs/guide/sql/features.md @@ -31,7 +31,7 @@ The following table lists the supported clauses in the `SELECT` statement. | Clause | Supported | | --------------------------------- | ------------------ | | `FROM ` | :white_check_mark: | -| `FROM .` (files) | :construction: | +| `FROM .` (files) | :white_check_mark: | | `WHERE` | :white_check_mark: | | `GROUP BY` | :white_check_mark: | | `HAVING` | :white_check_mark: | @@ -53,7 +53,7 @@ The following table lists the supported clauses in the `SELECT` statement. | `UNPIVOT` | :construction: | | `LATERAL VIEW` | :white_check_mark: | | `LATERAL ` | :construction: | -| `TABLESAMPLE` | :construction: | +| `TABLESAMPLE` | :white_check_mark: | | `TRANSFORM` | :construction: | The `EXPLAIN` statement is also supported, but the output shows the Sail logical and physical plan. @@ -91,7 +91,7 @@ But some extensions support these statements for lakehouse tables (e.g., Delta L | `CREATE VIEW` | :construction: | | `DESCRIBE DATABASE` | :construction: | | `DESCRIBE FUNCTION` | :construction: | -| `DESCRIBE TABLE` | :construction: | +| `DESCRIBE TABLE` | :white_check_mark: | | `DROP DATABASE` | :white_check_mark: | | `DROP FUNCTION` | :construction: | | `DROP TABLE` | :white_check_mark: | diff --git a/python/pysail/data/compatibility/functions/scalar/datetime.json b/python/pysail/data/compatibility/functions/scalar/datetime.json index 4cb477ae26..f004355a10 100644 --- a/python/pysail/data/compatibility/functions/scalar/datetime.json +++ b/python/pysail/data/compatibility/functions/scalar/datetime.json @@ -293,17 +293,17 @@ { "module": "pyspark.sql.functions", "function": "try_make_timestamp", - "status": "planned" + "status": "supported" }, { "module": "pyspark.sql.functions", "function": "try_make_timestamp_ltz", - "status": "planned" + "status": "supported" }, { "module": "pyspark.sql.functions", "function": "try_make_timestamp_ntz", - "status": "planned" + "status": "supported" }, { "module": "pyspark.sql.functions", diff --git a/python/pysail/tests/spark/datasource/test_jdbc.py b/python/pysail/tests/spark/datasource/test_jdbc.py index 265888b025..bfd55e6caf 100644 --- a/python/pysail/tests/spark/datasource/test_jdbc.py +++ b/python/pysail/tests/spark/datasource/test_jdbc.py @@ -13,6 +13,10 @@ from pysail.tests.spark.utils import pyspark_version +# We skip all the tests in this module for now since testcontainers have some issues +# on macOS and Windows. +pytest.skip("not working", allow_module_level=True) + if pyspark_version() < (4, 1): pytest.skip("Python data source requires Spark 4.1+", allow_module_level=True) diff --git a/python/pysail/tests/spark/test_python_datasource_read.txt b/python/pysail/tests/spark/test_python_datasource_read.txt index 9f2f58e763..b6e6dfdece 100644 --- a/python/pysail/tests/spark/test_python_datasource_read.txt +++ b/python/pysail/tests/spark/test_python_datasource_read.txt @@ -21,6 +21,7 @@ ... def read(self, partition: InputPartition) -> Iterator[Tuple]: ... yield ("Alice", 20) ... yield ("Bob", 30) +>>> >>> spark.dataSource.register(SimpleDataSource) >>> spark.read.format("simple").load().show() +-----+---+