|
| 1 | +# Kotlin DataFrame for SQL & Backend Developers |
| 2 | + |
| 3 | +<web-summary> |
| 4 | +Quickly transition from SQL to Kotlin DataFrame: load your datasets, perform essential transformations, and visualize your results — directly within a Kotlin Notebook. |
| 5 | +</web-summary> |
| 6 | + |
| 7 | +<card-summary> |
| 8 | +Switching from SQL? Kotlin DataFrame makes it easy to load, process, analyze, and visualize your data — fully interactive and type-safe! |
| 9 | +</card-summary> |
| 10 | + |
| 11 | +<link-summary> |
| 12 | +Explore Kotlin DataFrame as a SQL or ORM user: read your data, transform columns, group or join tables, and build insightful visualizations with Kotlin Notebook. |
| 13 | +</link-summary> |
| 14 | + |
| 15 | +This guide helps Kotlin backend developers with SQL experience quickly adapt to **Kotlin DataFrame**, mapping familiar |
| 16 | +SQL and ORM operations to DataFrame concepts. |
| 17 | + |
| 18 | +If you plan to work on a Gradle project without a Kotlin Notebook, |
| 19 | +we recommend installing the library together with our [**experimental Kotlin compiler plugin**](Compiler-Plugin.md) (available since version 2.2.*). |
| 20 | +This plugin generates type-safe schemas at compile time, |
| 21 | +tracking schema changes throughout your data pipeline. |
| 22 | + |
| 23 | +## Add Kotlin DataFrame Gradle dependency |
| 24 | + |
| 25 | +You could read more about the setup of the Gradle build in the [Gradle Setup Guide](SetupGradle.md). |
| 26 | + |
| 27 | +In your Gradle build file (`build.gradle` or `build.gradle.kts`), add the Kotlin DataFrame library as a dependency: |
| 28 | + |
| 29 | +<tabs> |
| 30 | +<tab title="Kotlin DSL"> |
| 31 | + |
| 32 | +```kotlin |
| 33 | +dependencies { |
| 34 | + implementation("org.jetbrains.kotlinx:dataframe:%dataFrameVersion%") |
| 35 | +} |
| 36 | +``` |
| 37 | + |
| 38 | +</tab> |
| 39 | + |
| 40 | +<tab title="Groovy DSL"> |
| 41 | + |
| 42 | +```groovy |
| 43 | +dependencies { |
| 44 | + implementation 'org.jetbrains.kotlinx:dataframe:%dataFrameVersion%' |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +</tab> |
| 49 | +</tabs> |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## 1. What is a dataframe? |
| 54 | + |
| 55 | +If you’re used to SQL, a **dataframe** is conceptually like a **table**: |
| 56 | + |
| 57 | +- **Rows**: ordered records of data |
| 58 | +- **Columns**: named, typed fields |
| 59 | +- **Schema**: a mapping of column names to types |
| 60 | + |
| 61 | +Kotlin DataFrame also supports [**hierarchical, JSON-like data**](hierarchical.md) — |
| 62 | +columns can contain *[nested dataframes](DataColumn.md#framecolumn)* or *column groups*, |
| 63 | +allowing you to represent and transform tree-like structures without flattening. |
| 64 | + |
| 65 | +Unlike a relational DB table: |
| 66 | + |
| 67 | +- A DataFrame object **lives in memory** — there’s no storage engine or transaction log |
| 68 | +- It’s **immutable** — each operation produces a *new* DataFrame |
| 69 | +- There is **no concept of foreign keys or relations** between DataFrames |
| 70 | +- It can be created from |
| 71 | + *any* [source](Data-Sources.md): [CSV](CSV-TSV.md), [JSON](JSON.md), [SQL tables](SQL.md), [Apache Arrow](ApacheArrow.md), |
| 72 | + in-memory objects |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## 2. Reading Data From SQL |
| 77 | + |
| 78 | +Kotlin DataFrame integrates with JDBC, so you can bring SQL data into memory for analysis. |
| 79 | + |
| 80 | +| Approach | Example | |
| 81 | +|----------------------------------|---------------------------------------------------------------------| |
| 82 | +| **From a table** | `val df = DataFrame.readSqlTable(dbConfig, "customers")` | |
| 83 | +| **From a SQL query** | `val df = DataFrame.readSqlQuery(dbConfig, "SELECT * FROM orders")` | |
| 84 | +| **From a JDBC Connection** | `val df = connection.readDataFrame("SELECT * FROM orders")` | |
| 85 | +| **From a ResultSet (extension)** | `val df = resultSet.readDataFrame(connection)` | |
| 86 | + |
| 87 | +```kotlin |
| 88 | +import org.jetbrains.kotlinx.dataframe.io.DbConnectionConfig |
| 89 | + |
| 90 | +val dbConfig = DbConnectionConfig( |
| 91 | + url = "jdbc:postgresql://localhost:5432/mydb", |
| 92 | + user = "postgres", |
| 93 | + password = "secret" |
| 94 | +) |
| 95 | + |
| 96 | +// Table |
| 97 | +val customers = DataFrame.readSqlTable(dbConfig, "customers") |
| 98 | + |
| 99 | +// Query |
| 100 | +val salesByRegion = DataFrame.readSqlQuery( |
| 101 | + dbConfig, """ |
| 102 | + SELECT region, SUM(amount) AS total |
| 103 | + FROM sales |
| 104 | + GROUP BY region |
| 105 | +""" |
| 106 | +) |
| 107 | + |
| 108 | +// From JDBC connection |
| 109 | +connection.readDataFrame("SELECT * FROM orders") |
| 110 | + |
| 111 | +// From ResultSet |
| 112 | +val rs = connection.createStatement().executeQuery("SELECT * FROM orders") |
| 113 | +rs.readDataFrame(connection) |
| 114 | +``` |
| 115 | + |
| 116 | +More information can be found [here](readSqlDatabases.md). |
| 117 | + |
| 118 | +## 3. Why It’s Not an ORM |
| 119 | + |
| 120 | +Frameworks like **[Hibernate](https://hibernate.org/orm/)** or **[Exposed](https://github.com/JetBrains/Exposed)**: |
| 121 | + |
| 122 | +- Map DB tables to Kotlin objects (entities) |
| 123 | +- Track object changes and sync them back to the database |
| 124 | +- Focus on **persistence** and **transactions** |
| 125 | + |
| 126 | +Kotlin DataFrame: |
| 127 | + |
| 128 | +- Has no persistence layer |
| 129 | +- Doesn’t try to map rows to mutable entities |
| 130 | +- Focuses on **in-memory analytics**, **transformations**, and **type-safe pipelines** |
| 131 | +- The **main idea** is that the schema *changes together with your transformations* — and the [**Compiler Plugin |
| 132 | + **](Compiler-Plugin.md) updates the type-safe API automatically under the hood. |
| 133 | + - You don’t have to manually define or recreate schemas every time — the plugin infers them dynamically from the data or |
| 134 | + transformations. |
| 135 | +- In ORMs, the mapping layer is **frozen** — schema changes require manual model edits and migrations. |
| 136 | + |
| 137 | +Think of Kotlin DataFrame as a **data analysis/ETL tool**, not an ORM. |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## 4. Key Differences from SQL & ORMs |
| 142 | + |
| 143 | +| Feature / Concept | SQL Databases (PostgreSQL, MySQL…) | ORM (Hibernate, Exposed…) | Kotlin DataFrame | |
| 144 | +|----------------------------|------------------------------------|------------------------------------|---------------------------------------------------------------------| |
| 145 | +| **Storage** | Persistent | Persistent | In-memory only | |
| 146 | +| **Schema definition** | `CREATE TABLE` DDL | Defined in entity classes | Derived from data or transformations or defined manually | |
| 147 | +| **Schema change** | `ALTER TABLE` | Manual migration of entity classes | Automatic via transformations + Compiler Plugin or defined manually | |
| 148 | +| **Relations** | Foreign keys | Mapped via annotations | Not applicable | |
| 149 | +| **Transactions** | Yes | Yes | Not applicable | |
| 150 | +| **DB Indexes** | Yes | Yes (via DB) | Not applicable | |
| 151 | +| **Data manipulation** | SQL DML (`INSERT`, `UPDATE`) | CRUD mapped to DB | Transformations only (immutable) | |
| 152 | +| **Joins** | `JOIN` keyword | Eager/lazy loading | [`.join()` / `.leftJoin()` DSL](join.md) | |
| 153 | +| **Grouping & aggregation** | `GROUP BY` | DB query with groupBy | [`.groupBy().aggregate()`](groupBy.md) | |
| 154 | +| **Filtering** | `WHERE` | Criteria API / query DSL | [`.filter { ... }`](filter.md) | |
| 155 | +| **Permissions** | `GRANT` / `REVOKE` | DB-level permissions | Not applicable | |
| 156 | +| **Execution** | On DB engine | On DB engine | In JVM process | |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## 5. SQL → Kotlin DataFrame Cheatsheet |
| 161 | + |
| 162 | +### DDL Analogues |
| 163 | + |
| 164 | +| SQL DDL Command / Example | Kotlin DataFrame Equivalent | |
| 165 | +|---------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------| |
| 166 | +| **Create table:**<br>`CREATE TABLE person (name text, age int);` | `@DataSchema`<br>`interface Person {`<br>` val name: String`<br>` val age: Int`<br>`}` | |
| 167 | +| **Add column:**<br>`ALTER TABLE sales ADD COLUMN profit numeric GENERATED ALWAYS AS (revenue - cost) STORED;` | `.add("profit") { revenue - cost }` | |
| 168 | +| **Rename column:**<br>`ALTER TABLE sales RENAME COLUMN old_name TO new_name;` | `.rename { old_name }.into("new_name")` | |
| 169 | +| **Drop column:**<br>`ALTER TABLE sales DROP COLUMN old_col;` | `.remove { old_col }` | |
| 170 | +| **Modify column type:**<br>`ALTER TABLE sales ALTER COLUMN amount TYPE numeric;` | `.convert { amount }.to<Double>()` | |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +### DML Analogues |
| 175 | + |
| 176 | +| SQL DML Command / Example | Kotlin DataFrame Equivalent | |
| 177 | +|--------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------| |
| 178 | +| `SELECT col1, col2` | `df.select { col1 and col2 }` | |
| 179 | +| `WHERE amount > 100` | `df.filter { amount > 100 }` | |
| 180 | +| `ORDER BY amount DESC` | `df.sortByDesc { amount }` | |
| 181 | +| `GROUP BY region` | `df.groupBy { region }` | |
| 182 | +| `SUM(amount)` | `.aggregate { sum { amount } }` | |
| 183 | +| `JOIN` | `.join(otherDf) { id match right.id }` | |
| 184 | +| `LIMIT 5` | `.take(5)` | |
| 185 | +| **Pivot:** <br>`SELECT * FROM crosstab('SELECT region, year, SUM(amount) FROM sales GROUP BY region, year') AS ct(region text, y2023 int, y2024 int);` | `.pivot(region, year) { sum { amount } }` | |
| 186 | +| **Explode array column:** <br>`SELECT id, unnest(tags) AS tag FROM products;` | `.explode { tags }` | |
| 187 | +| **Update column:** <br>`UPDATE sales SET amount = amount * 1.2;` | `.update { amount }.with { it * 1.2 }` | |
| 188 | + |
| 189 | +## 6. Example: SQL vs. DataFrame Side-by-Side |
| 190 | + |
| 191 | +**SQL (PostgreSQL):** |
| 192 | + |
| 193 | +```sql |
| 194 | +SELECT region, SUM(amount) AS total |
| 195 | +FROM sales |
| 196 | +WHERE amount > 0 |
| 197 | +GROUP BY region |
| 198 | +ORDER BY total DESC LIMIT 5; |
| 199 | +``` |
| 200 | + |
| 201 | +```kotlin |
| 202 | +sales.filter { amount > 0 } |
| 203 | + .groupBy { region } |
| 204 | + .aggregate { sum(amount).into("total") } |
| 205 | + .sortByDesc { total } |
| 206 | + .take(5) |
| 207 | +``` |
| 208 | + |
| 209 | +## In Conclusion |
| 210 | + |
| 211 | +- Kotlin DataFrame keeps the familiar SQL-style workflow (select → filter → group → aggregate) but makes it **type-safe |
| 212 | + ** and fully integrated into Kotlin. |
| 213 | +- The main focus is **readability** and schema change safety via |
| 214 | + the [Compiler Plugin](Compiler-Plugin.md). |
| 215 | +- It is neither a database nor an ORM — a Kotlin DataFrame library does not store data or manage transactions but works as an in-memory |
| 216 | + layer for analytics and transformations. |
| 217 | +- It does not provide some SQL features (permissions, transactions, indexes) — but offers convenient tools for working |
| 218 | + with JSON-like structures and combining multiple data sources. |
| 219 | +- Use Kotlin DataFrame as a **type-safe DSL** for post-processing, merging data sources, and analytics directly on the |
| 220 | + JVM, while keeping your code easily refactorable and IDE-assisted. |
| 221 | +- Use Kotlin DataFrame for small- and average-sized datasets, but for large datasets, consider using a more |
| 222 | + **performant** database engine. |
| 223 | + |
| 224 | +## What's Next? |
| 225 | + |
| 226 | +If you're ready to go through a complete example, we recommend our **[Quickstart Guide](quickstart.md)** |
| 227 | +— you'll learn the basics of reading data, transforming it, and creating visualization step-by-step. |
| 228 | + |
| 229 | +Ready to go deeper? Check out what’s next: |
| 230 | + |
| 231 | +- 📘 **[Explore in-depth guides and various examples](Guides-And-Examples.md)** with different datasets, |
| 232 | + API usage examples, and practical scenarios that help you understand the main features of Kotlin DataFrame. |
| 233 | + |
| 234 | +- 🛠️ **[Browse the operations overview](operations.md)** to learn what Kotlin DataFrame can do. |
| 235 | + |
| 236 | +- 🧠 **Understand the design** and core concepts in the [library overview](concepts.md). |
| 237 | + |
| 238 | +- 🔤 **[Learn more about Extension Properties](extensionPropertiesApi.md)** |
| 239 | + and make working with your data both convenient and type-safe. |
| 240 | + |
| 241 | +- 💡 **[Use Kotlin DataFrame Compiler Plugin](Compiler-Plugin.md)** |
| 242 | + for auto-generated column access in your IntelliJ IDEA projects. |
| 243 | + |
| 244 | +- 📊 **Master Kandy** for stunning and expressive DataFrame visualizations |
| 245 | + [Kandy Documentation](https://kotlin.github.io/kandy). |
0 commit comments