Add Parquet documentation (#1448)

zaleslaw · web-flow · commit 3c01503b8d1b · 2025-09-18T13:09:43.000+02:00
* Add Parquet documentation with Apache Arrow integration details

Added a new topic on reading Parquet files via Apache Arrow to the documentation. Updated the data sources index and table of contents to include the Parquet guide. The documentation covers method overloads, examples, requirements, performance tips, and current limitations.

* Rewrite a Guide

* Add examples and tests for reading Parquet files via various input types
diff --git a/docs/StardustDocs/d.tree b/docs/StardustDocs/d.tree
@@ -209,6 +209,7 @@
         <toc-element topic="CSV-TSV.md"/>
         <toc-element topic="Excel.md"/>
         <toc-element topic="ApacheArrow.md"/>
+        <toc-element topic="Parquet.md"/>
         <toc-element topic="SQL.md">
             <toc-element topic="PostgreSQL.md"/>
             <toc-element topic="MySQL.md"/>
diff --git a/docs/StardustDocs/topics/dataSources/ApacheArrow.md b/docs/StardustDocs/topics/dataSources/ApacheArrow.md
@@ -24,6 +24,10 @@ and in [`%use dataframe`](SetupKotlinNotebook.md#integrate-kotlin-dataframe) for
 > when using Java 9+.
 > {style="warning"}
 
+> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
+> See an issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
+> {style="warning"}
+
 ## Read
 
 [`DataFrame`](DataFrame.md) supports both the 
diff --git a/docs/StardustDocs/topics/dataSources/Data-Sources.md b/docs/StardustDocs/topics/dataSources/Data-Sources.md
@@ -21,6 +21,7 @@ Below you'll find a list of supported sources along with instructions on how to
 - [CSV / TSV](CSV-TSV.md)
 - [Excel](Excel.md)
 - [Apache Arrow](ApacheArrow.md)
+- [Parquet](Parquet.md)
 - [SQL](SQL.md):
     - [PostgreSQL](PostgreSQL.md)
     - [MySQL](MySQL.md)
diff --git a/docs/StardustDocs/topics/dataSources/Parquet.md b/docs/StardustDocs/topics/dataSources/Parquet.md
@@ -0,0 +1,159 @@
+# Parquet
+
+<web-summary>
+Read Parquet files via Apache Arrow in Kotlin DataFrame — high‑performance columnar storage for analytics.
+</web-summary>
+
+<card-summary>
+Use Kotlin DataFrame to read Parquet datasets using Apache Arrow for fast, typed, columnar I/O.
+</card-summary>
+
+<link-summary>
+Kotlin DataFrame can read Parquet files through Apache Arrow’s Dataset API. Learn how and when to use it.
+</link-summary>
+
+Kotlin DataFrame supports reading [Apache Parquet](https://parquet.apache.org/) files through the Apache Arrow integration.
+
+Requires the [`dataframe-arrow` module](Modules.md#dataframe-arrow), which is included by default in the general [`dataframe`](Modules.md#dataframe-general) artifact and in and when using `%use dataframe` for Kotlin Notebook.
+
+> We currently only support READING Parquet via Apache Arrow; writing Parquet is not supported in Kotlin DataFrame.
+> {style="note"}
+
+> Apache Arrow is not supported on Android, so reading Parquet files on Android is not available.
+> {style="warning"}
+
+> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
+> See an issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
+> {style="warning"}
+
+## Reading Parquet Files
+
+Kotlin DataFrame provides four `readParquet()` methods that can read from different source types.
+All overloads accept optional `nullability` inference settings and `batchSize` for Arrow scanning.
+
+```kotlin
+// 1) URLs
+public fun DataFrame.Companion.readParquet(
+    vararg urls: URL,
+    nullability: NullabilityOptions = NullabilityOptions.Infer,
+    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
+): AnyFrame
+
+// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
+public fun DataFrame.Companion.readParquet(
+    vararg strUrls: String,
+    nullability: NullabilityOptions = NullabilityOptions.Infer,
+    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
+): AnyFrame
+
+// 3) Paths
+public fun DataFrame.Companion.readParquet(
+    vararg paths: Path,
+    nullability: NullabilityOptions = NullabilityOptions.Infer,
+    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
+): AnyFrame
+
+// 4) Files
+public fun DataFrame.Companion.readParquet(
+    vararg files: File,
+    nullability: NullabilityOptions = NullabilityOptions.Infer,
+    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
+): AnyFrame
+```
+
+These overloads are defined in the `dataframe-arrow` module and internally use `FileFormat.PARQUET` from Apache Arrow’s
+Dataset API to scan the data and materialize it as a Kotlin `DataFrame`.
+
+### Examples
+
+```kotlin
+// Read from file paths (as strings)
+val df1 = DataFrame.readParquet("data/sales.parquet")
+```
+
+<!---FUN readParquetFilePath-->
+
+```kotlin
+// Read from Path objects
+val path = Paths.get("data/sales.parquet")
+val df = DataFrame.readParquet(path)
+```
+
+<!---END-->
+
+<!---FUN readParquetURL-->
+
+```kotlin
+// Read from URLs
+val df = DataFrame.readParquet(url)
+```
+
+<!---END-->
+
+<!---FUN readParquetFile-->
+
+```kotlin
+// Read from File objects
+val file = File("data/sales.parquet")
+val df = DataFrame.readParquet(file)
+```
+
+<!---END-->
+
+
+<!---FUN readParquetFileWithParameters-->
+
+```kotlin
+// Read from File objects
+val file = File("data/sales.parquet")
+
+val df = DataFrame.readParquet(
+    file,
+    nullability = NullabilityOptions.Infer,
+    batchSize = 64L * 1024
+)
+```
+
+<!---END-->
+
+
+If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame,
+check out the [example project](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe).
+
+### Multiple Files
+
+It's possible to read multiple Parquet files:
+
+<!---FUN readMultipleParquetFiles-->
+
+```kotlin
+val file = File("data/sales.parquet")
+val file1 = File("data/sales1.parquet")
+val file2 = File("data/sales2.parquet")
+
+val df = DataFrame.readParquet(file, file1, file2)
+```
+
+<!---END-->
+
+**Requirements:**
+
+- All files must have compatible schemas
+- Files are vertically concatenated (union of rows)
+- Column types must match exactly
+- Missing columns in some files will result in null values
+
+### Performance tips
+
+- **Column selection**: Because the `readParquet` method reads all columns, use DataFrame operations like `select()` immediately after reading to reduce memory usage in later operations
+- **Predicate pushdown**: Currently not supported—filtering happens after data is loaded into memory
+- Use Arrow‑compatible JVMs as documented in
+  [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility).
+- Adjust `batchSize` if you read huge files and need to tune throughput vs. memory.
+
+### See also
+
+- [](ApacheArrow.md) — reading/writing Arrow IPC formats.
+- [Parquet official site](https://parquet.apache.org/).
+- Example: [Spark + Parquet + Kotlin DataFrame](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe)
+- [](Data-Sources.md) — Overview of all supported formats
diff --git a/docs/StardustDocs/topics/guides/Guide-for-backend-SQL-developers.md b/docs/StardustDocs/topics/guides/Guide-for-backend-SQL-developers.md
@@ -201,7 +201,7 @@ ORDER BY total DESC LIMIT 5;
 ```kotlin
 sales.filter { amount > 0 }
     .groupBy { region }
-    .aggregate { sum(amount).into("total") }
+    .aggregate { sum { amount } into "total" }
     .sortByDesc { total }
     .take(5)
 ```
diff --git a/samples/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/io/Parquet.kt b/samples/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/io/Parquet.kt
@@ -0,0 +1,79 @@
+package org.jetbrains.kotlinx.dataframe.samples.io
+
+import io.kotest.matchers.shouldBe
+import java.io.File
+import java.nio.file.Path
+import java.nio.file.Paths
+import org.jetbrains.kotlinx.dataframe.DataFrame
+import org.jetbrains.kotlinx.dataframe.api.NullabilityOptions
+import org.junit.Test
+import org.jetbrains.kotlinx.dataframe.io.readParquet
+import org.jetbrains.kotlinx.dataframe.testParquet
+
+class Parquet {
+    @Test
+    fun readParquetURL() {
+        val url = testParquet("sales")
+
+        // SampleStart
+        // Read from URLs
+        val df = DataFrame.readParquet(url)
+        // SampleEnd
+        df.rowsCount() shouldBe 300
+        df.columnsCount() shouldBe 20
+    }
+
+    @Test
+    fun readParquetFilePath() {
+        val url = testParquet("sales")
+        val path = Paths.get(url.toURI())
+        // SampleStart
+        val df = DataFrame.readParquet(path)
+        // SampleEnd
+        df.rowsCount() shouldBe 300
+        df.columnsCount() shouldBe 20
+    }
+
+    @Test
+    fun readParquetFile() {
+        val url = testParquet("sales")
+        val file = File(url.toURI())
+
+        // SampleStart
+        // Read from File objects
+        val df = DataFrame.readParquet(file)
+        // SampleEnd
+        df.rowsCount() shouldBe 300
+        df.columnsCount() shouldBe 20
+    }
+
+    @Test
+    fun readParquetFileWithParameters() {
+        val url = testParquet("sales")
+        val file = File(url.toURI())
+
+        // SampleStart
+        val df = DataFrame.readParquet(
+            file,
+            nullability = NullabilityOptions.Infer,
+            batchSize = 64L * 1024
+        )
+        // SampleEnd
+        df.rowsCount() shouldBe 300
+        df.columnsCount() shouldBe 20
+    }
+
+    @Test
+    fun readMultipleParquetFiles() {
+        val url = testParquet("sales")
+        val file = File(url.toURI())
+        val file1 = File(url.toURI())
+        val file2 = File(url.toURI())
+
+        // SampleStart
+        val df = DataFrame.readParquet(file, file1, file2)
+        // SampleEnd
+        df.rowsCount() shouldBe 900
+        df.columnsCount() shouldBe 20
+    }
+}
diff --git a/samples/src/test/kotlin/org/jetbrains/kotlinx/dataframe/testResource.kt b/samples/src/test/kotlin/org/jetbrains/kotlinx/dataframe/testResource.kt
@@ -9,3 +9,5 @@ fun testCsv(csvName: String) = testResource("$csvName.csv")
 fun testJson(jsonName: String) = testResource("$jsonName.json")
 
 fun testArrowFeather(name: String) = testResource("$name.feather")
+
+fun testParquet(name: String) = testResource("$name.parquet")
diff --git a/samples/src/test/resources/sales.parquet b/samples/src/test/resources/sales.parquet