Skip to content

Commit 3c01503

Browse files
authored
Add Parquet documentation (#1448)
* Add Parquet documentation with Apache Arrow integration details Added a new topic on reading Parquet files via Apache Arrow to the documentation. Updated the data sources index and table of contents to include the Parquet guide. The documentation covers method overloads, examples, requirements, performance tips, and current limitations. * Rewrite a Guide * Add examples and tests for reading Parquet files via various input types
1 parent e8cb692 commit 3c01503

File tree

8 files changed

+247
-1
lines changed

8 files changed

+247
-1
lines changed

docs/StardustDocs/d.tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@
209209
<toc-element topic="CSV-TSV.md"/>
210210
<toc-element topic="Excel.md"/>
211211
<toc-element topic="ApacheArrow.md"/>
212+
<toc-element topic="Parquet.md"/>
212213
<toc-element topic="SQL.md">
213214
<toc-element topic="PostgreSQL.md"/>
214215
<toc-element topic="MySQL.md"/>

docs/StardustDocs/topics/dataSources/ApacheArrow.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ and in [`%use dataframe`](SetupKotlinNotebook.md#integrate-kotlin-dataframe) for
2424
> when using Java 9+.
2525
> {style="warning"}
2626
27+
> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
28+
> See an issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
29+
> {style="warning"}
30+
2731
## Read
2832

2933
[`DataFrame`](DataFrame.md) supports both the

docs/StardustDocs/topics/dataSources/Data-Sources.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Below you'll find a list of supported sources along with instructions on how to
2121
- [CSV / TSV](CSV-TSV.md)
2222
- [Excel](Excel.md)
2323
- [Apache Arrow](ApacheArrow.md)
24+
- [Parquet](Parquet.md)
2425
- [SQL](SQL.md):
2526
- [PostgreSQL](PostgreSQL.md)
2627
- [MySQL](MySQL.md)
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Parquet
2+
3+
<web-summary>
4+
Read Parquet files via Apache Arrow in Kotlin DataFrame — high‑performance columnar storage for analytics.
5+
</web-summary>
6+
7+
<card-summary>
8+
Use Kotlin DataFrame to read Parquet datasets using Apache Arrow for fast, typed, columnar I/O.
9+
</card-summary>
10+
11+
<link-summary>
12+
Kotlin DataFrame can read Parquet files through Apache Arrow’s Dataset API. Learn how and when to use it.
13+
</link-summary>
14+
15+
Kotlin DataFrame supports reading [Apache Parquet](https://parquet.apache.org/) files through the Apache Arrow integration.
16+
17+
Requires the [`dataframe-arrow` module](Modules.md#dataframe-arrow), which is included by default in the general [`dataframe`](Modules.md#dataframe-general) artifact and in and when using `%use dataframe` for Kotlin Notebook.
18+
19+
> We currently only support READING Parquet via Apache Arrow; writing Parquet is not supported in Kotlin DataFrame.
20+
> {style="note"}
21+
22+
> Apache Arrow is not supported on Android, so reading Parquet files on Android is not available.
23+
> {style="warning"}
24+
25+
> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
26+
> See an issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
27+
> {style="warning"}
28+
29+
## Reading Parquet Files
30+
31+
Kotlin DataFrame provides four `readParquet()` methods that can read from different source types.
32+
All overloads accept optional `nullability` inference settings and `batchSize` for Arrow scanning.
33+
34+
```kotlin
35+
// 1) URLs
36+
public fun DataFrame.Companion.readParquet(
37+
vararg urls: URL,
38+
nullability: NullabilityOptions = NullabilityOptions.Infer,
39+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
40+
): AnyFrame
41+
42+
// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
43+
public fun DataFrame.Companion.readParquet(
44+
vararg strUrls: String,
45+
nullability: NullabilityOptions = NullabilityOptions.Infer,
46+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
47+
): AnyFrame
48+
49+
// 3) Paths
50+
public fun DataFrame.Companion.readParquet(
51+
vararg paths: Path,
52+
nullability: NullabilityOptions = NullabilityOptions.Infer,
53+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
54+
): AnyFrame
55+
56+
// 4) Files
57+
public fun DataFrame.Companion.readParquet(
58+
vararg files: File,
59+
nullability: NullabilityOptions = NullabilityOptions.Infer,
60+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
61+
): AnyFrame
62+
```
63+
64+
These overloads are defined in the `dataframe-arrow` module and internally use `FileFormat.PARQUET` from Apache Arrow’s
65+
Dataset API to scan the data and materialize it as a Kotlin `DataFrame`.
66+
67+
### Examples
68+
69+
```kotlin
70+
// Read from file paths (as strings)
71+
val df1 = DataFrame.readParquet("data/sales.parquet")
72+
```
73+
74+
<!---FUN readParquetFilePath-->
75+
76+
```kotlin
77+
// Read from Path objects
78+
val path = Paths.get("data/sales.parquet")
79+
val df = DataFrame.readParquet(path)
80+
```
81+
82+
<!---END-->
83+
84+
<!---FUN readParquetURL-->
85+
86+
```kotlin
87+
// Read from URLs
88+
val df = DataFrame.readParquet(url)
89+
```
90+
91+
<!---END-->
92+
93+
<!---FUN readParquetFile-->
94+
95+
```kotlin
96+
// Read from File objects
97+
val file = File("data/sales.parquet")
98+
val df = DataFrame.readParquet(file)
99+
```
100+
101+
<!---END-->
102+
103+
104+
<!---FUN readParquetFileWithParameters-->
105+
106+
```kotlin
107+
// Read from File objects
108+
val file = File("data/sales.parquet")
109+
110+
val df = DataFrame.readParquet(
111+
file,
112+
nullability = NullabilityOptions.Infer,
113+
batchSize = 64L * 1024
114+
)
115+
```
116+
117+
<!---END-->
118+
119+
120+
If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame,
121+
check out the [example project](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe).
122+
123+
### Multiple Files
124+
125+
It's possible to read multiple Parquet files:
126+
127+
<!---FUN readMultipleParquetFiles-->
128+
129+
```kotlin
130+
val file = File("data/sales.parquet")
131+
val file1 = File("data/sales1.parquet")
132+
val file2 = File("data/sales2.parquet")
133+
134+
val df = DataFrame.readParquet(file, file1, file2)
135+
```
136+
137+
<!---END-->
138+
139+
**Requirements:**
140+
141+
- All files must have compatible schemas
142+
- Files are vertically concatenated (union of rows)
143+
- Column types must match exactly
144+
- Missing columns in some files will result in null values
145+
146+
### Performance tips
147+
148+
- **Column selection**: Because the `readParquet` method reads all columns, use DataFrame operations like `select()` immediately after reading to reduce memory usage in later operations
149+
- **Predicate pushdown**: Currently not supported—filtering happens after data is loaded into memory
150+
- Use Arrow‑compatible JVMs as documented in
151+
[Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility).
152+
- Adjust `batchSize` if you read huge files and need to tune throughput vs. memory.
153+
154+
### See also
155+
156+
- [](ApacheArrow.md) — reading/writing Arrow IPC formats.
157+
- [Parquet official site](https://parquet.apache.org/).
158+
- Example: [Spark + Parquet + Kotlin DataFrame](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe)
159+
- [](Data-Sources.md) — Overview of all supported formats

docs/StardustDocs/topics/guides/Guide-for-backend-SQL-developers.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ ORDER BY total DESC LIMIT 5;
201201
```kotlin
202202
sales.filter { amount > 0 }
203203
.groupBy { region }
204-
.aggregate { sum(amount).into("total") }
204+
.aggregate { sum { amount } into "total" }
205205
.sortByDesc { total }
206206
.take(5)
207207
```
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
package org.jetbrains.kotlinx.dataframe.samples.io
2+
3+
import io.kotest.matchers.shouldBe
4+
import java.io.File
5+
import java.nio.file.Path
6+
import java.nio.file.Paths
7+
import org.jetbrains.kotlinx.dataframe.DataFrame
8+
import org.jetbrains.kotlinx.dataframe.api.NullabilityOptions
9+
import org.junit.Test
10+
import org.jetbrains.kotlinx.dataframe.io.readParquet
11+
import org.jetbrains.kotlinx.dataframe.testParquet
12+
13+
class Parquet {
14+
@Test
15+
fun readParquetURL() {
16+
val url = testParquet("sales")
17+
18+
// SampleStart
19+
// Read from URLs
20+
val df = DataFrame.readParquet(url)
21+
// SampleEnd
22+
df.rowsCount() shouldBe 300
23+
df.columnsCount() shouldBe 20
24+
}
25+
26+
@Test
27+
fun readParquetFilePath() {
28+
val url = testParquet("sales")
29+
val path = Paths.get(url.toURI())
30+
// SampleStart
31+
val df = DataFrame.readParquet(path)
32+
// SampleEnd
33+
df.rowsCount() shouldBe 300
34+
df.columnsCount() shouldBe 20
35+
}
36+
37+
@Test
38+
fun readParquetFile() {
39+
val url = testParquet("sales")
40+
val file = File(url.toURI())
41+
42+
// SampleStart
43+
// Read from File objects
44+
val df = DataFrame.readParquet(file)
45+
// SampleEnd
46+
df.rowsCount() shouldBe 300
47+
df.columnsCount() shouldBe 20
48+
}
49+
50+
@Test
51+
fun readParquetFileWithParameters() {
52+
val url = testParquet("sales")
53+
val file = File(url.toURI())
54+
55+
// SampleStart
56+
val df = DataFrame.readParquet(
57+
file,
58+
nullability = NullabilityOptions.Infer,
59+
batchSize = 64L * 1024
60+
)
61+
// SampleEnd
62+
df.rowsCount() shouldBe 300
63+
df.columnsCount() shouldBe 20
64+
}
65+
66+
@Test
67+
fun readMultipleParquetFiles() {
68+
val url = testParquet("sales")
69+
val file = File(url.toURI())
70+
val file1 = File(url.toURI())
71+
val file2 = File(url.toURI())
72+
73+
// SampleStart
74+
val df = DataFrame.readParquet(file, file1, file2)
75+
// SampleEnd
76+
df.rowsCount() shouldBe 900
77+
df.columnsCount() shouldBe 20
78+
}
79+
}

samples/src/test/kotlin/org/jetbrains/kotlinx/dataframe/testResource.kt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,5 @@ fun testCsv(csvName: String) = testResource("$csvName.csv")
99
fun testJson(jsonName: String) = testResource("$jsonName.json")
1010

1111
fun testArrowFeather(name: String) = testResource("$name.feather")
12+
13+
fun testParquet(name: String) = testResource("$name.parquet")
27.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)