|
| 1 | +--- |
| 2 | +tags: |
| 3 | + - Enterprise Option |
| 4 | +displayed_sidebar: docsEnglish |
| 5 | +--- |
| 6 | + |
| 7 | +# Getting Started with ScalarDB Analytics |
| 8 | + |
| 9 | +import WarningLicenseKeyContact from "/src/components/en-us/_warning-license-key-contact.mdx"; |
| 10 | + |
| 11 | +This getting-started tutorial guide explains how to set up ScalarDB Analytics and run federated queries across different databases, including PostgreSQL, MySQL, and Cassandra. For an overview of ScalarDB Analytics and its key benefits, refer to the [ScalarDB Overview](../overview.mdx) and [ScalarDB Design](../design.mdx) pages. |
| 12 | + |
| 13 | +## What you'll build |
| 14 | + |
| 15 | +In this tutorial, you'll set up a sample e-commerce analytics environment where: |
| 16 | + |
| 17 | +- Customer data resides in PostgreSQL |
| 18 | +- Order data is managed by ScalarDB in MySQL |
| 19 | +- Line item details are stored in Cassandra, which are updated through ScalarDB transactions |
| 20 | + |
| 21 | +You'll run analytical queries that join data across all three databases to gain business insights. The source code is available at [https://github.com/scalar-labs/scalardb-samples/tree/main/scalardb-analytics-sample](https://github.com/scalar-labs/scalardb-samples/tree/main/scalardb-analytics-sample). |
| 22 | + |
| 23 | +## Prerequisites |
| 24 | + |
| 25 | +- [Docker](https://www.docker.com/get-started/) 20.10 or later with [Docker Compose](https://docs.docker.com/compose/install/) V2 or later |
| 26 | + |
| 27 | +<WarningLicenseKeyContact product="ScalarDB Analytics" /> |
| 28 | + |
| 29 | +## Step 1: Set up the environment |
| 30 | + |
| 31 | +This section describes how to set up a ScalarDB Analytics environment. |
| 32 | + |
| 33 | +### Clone the repository |
| 34 | + |
| 35 | +Open **Terminal**, and clone the ScalarDB samples repository: |
| 36 | + |
| 37 | +```console |
| 38 | +git clone https://github.com/scalar-labs/scalardb-samples |
| 39 | +cd scalardb-samples/scalardb-analytics-sample |
| 40 | +``` |
| 41 | + |
| 42 | +### Configure your license |
| 43 | + |
| 44 | +To add your ScalarDB Analytics license, open `config/scalardb-analytics-server.properties`. Then, uncomment and update the license configuration lines, replacing `<YOUR_LICENSE_KEY>` and `<YOUR_LICENSE_CERT_PEM>` with your actual license information: |
| 45 | + |
| 46 | +```properties |
| 47 | +# License configuration (required for production) |
| 48 | +scalar.db.analytics.server.licensing.license_key=<YOUR_LICENSE_KEY> |
| 49 | +scalar.db.analytics.server.licensing.license_check_cert_pem=<YOUR_LICENSE_CERT_PEM> |
| 50 | +``` |
| 51 | + |
| 52 | +## Step 2: Set up the sample databases |
| 53 | + |
| 54 | +To set up the sample databases, run the following command: |
| 55 | + |
| 56 | +```console |
| 57 | +docker compose up -d --wait |
| 58 | +``` |
| 59 | + |
| 60 | +This command starts the following services locally: |
| 61 | + |
| 62 | +- **ScalarDB Analytics components:** |
| 63 | + - **ScalarDB Analytics server:** Manages metadata about all data sources and provides a unified interface for querying. |
| 64 | +- **Sample databases:** |
| 65 | + - **PostgreSQL:** Used as a non-ScalarDB-managed database (accessed directly) |
| 66 | + - **Cassandra and MySQL:** Used as ScalarDB-managed databases (accessed through ScalarDB's transaction layer) |
| 67 | + |
| 68 | +In this guide, PostgreSQL is referred to as a **non-ScalarDB-managed database**, which is not managed by ScalarDB transactions, while Cassandra and MySQL are referred to as **ScalarDB-managed databases**, which are managed by ScalarDB transactions. |
| 69 | + |
| 70 | +The sample data is automatically loaded into all databases during the initial setup. After completing the setup, the following tables should be available: |
| 71 | + |
| 72 | +- In PostgreSQL: |
| 73 | + - `sample_ns.customer` |
| 74 | +- In ScalarDB (backed by Cassandra): |
| 75 | + - `cassandrans.lineitem` |
| 76 | +- In ScalarDB (backed by MySQL): |
| 77 | + - `mysqlns.orders` |
| 78 | + |
| 79 | +According to the above, within ScalarDB, `cassandrans` and `mysqlns` are mapped to Cassandra and MySQL, respectively. |
| 80 | + |
| 81 | +For details about the table schema, including column definitions and data types, refer to [Schema details](#schema-details). Ensure that the sample data has been successfully loaded into these tables. |
| 82 | + |
| 83 | +## Step 3: Register data sources by using the ScalarDB Analytics CLI |
| 84 | + |
| 85 | +Before running analytical queries, you need to register the data sources with the ScalarDB Analytics server. You can do this by using the ScalarDB Analytics CLI. |
| 86 | + |
| 87 | +### Create a catalog |
| 88 | + |
| 89 | +First, create a new catalog to organize your data sources: |
| 90 | + |
| 91 | +```console |
| 92 | +docker compose run --rm scalardb-analytics-cli catalog create --catalog sample_catalog |
| 93 | +``` |
| 94 | + |
| 95 | +### Register ScalarDB as a data source |
| 96 | + |
| 97 | +Register the ScalarDB-managed databases: |
| 98 | + |
| 99 | +```console |
| 100 | +docker compose run --rm scalardb-analytics-cli data-source register \ |
| 101 | + --data-source-json /config/data-sources/scalardb.json |
| 102 | +``` |
| 103 | + |
| 104 | +This registers tables from both Cassandra and MySQL, which are managed by ScalarDB. |
| 105 | + |
| 106 | +### Register PostgreSQL as a data source |
| 107 | + |
| 108 | +Register the PostgreSQL database: |
| 109 | + |
| 110 | +```console |
| 111 | +docker compose run --rm scalardb-analytics-cli data-source register \ |
| 112 | + --data-source-json /config/data-sources/postgres.json |
| 113 | +``` |
| 114 | + |
| 115 | +## Step 4: Launch the Spark SQL console |
| 116 | + |
| 117 | +To launch the Spark SQL console, run the following command: |
| 118 | + |
| 119 | +```console |
| 120 | +docker compose run --rm spark-sql |
| 121 | +``` |
| 122 | + |
| 123 | +While launching the Spark SQL console, the ScalarDB Analytics catalog is initialized with the configuration in **spark-defaults.conf** and is registered as a Spark catalog named `sample_catalog`. |
| 124 | + |
| 125 | +### Namespace mapping |
| 126 | + |
| 127 | +The following tables in the configured data sources are mapped to Spark SQL tables, allowing seamless querying across different data sources: |
| 128 | + |
| 129 | +- For PostgreSQL: |
| 130 | + - `sample_catalog.postgres.sample_ns.customer` |
| 131 | +- For ScalarDB (backed by Cassandra): |
| 132 | + - `sample_catalog.scalardb.cassandrans.lineitem` |
| 133 | +- For ScalarDB (backed by MySQL): |
| 134 | + - `sample_catalog.scalardb.mysqlns.orders` |
| 135 | + |
| 136 | +## Step 5: Run analytical queries |
| 137 | + |
| 138 | +Now that you've set up your ScalarDB Analytics environment, you can run analytical queries on the sample data by using the Spark SQL console. |
| 139 | + |
| 140 | +### Query 1: Analyze shipping performance and returns |
| 141 | + |
| 142 | +The SQL query below demonstrates basic analytical capabilities by examining line item data from Cassandra. The query helps answer business questions like: |
| 143 | + |
| 144 | +- What percentage of items are returned versus shipped successfully? |
| 145 | +- What's the financial impact of returns? |
| 146 | +- How does pricing vary between different order statuses? |
| 147 | + |
| 148 | +The query calculates key metrics grouped by return status and line status: |
| 149 | + |
| 150 | +```sql |
| 151 | +SELECT |
| 152 | + l_returnflag, |
| 153 | + l_linestatus, |
| 154 | + sum(l_quantity) AS sum_qty, |
| 155 | + sum(l_extendedprice) AS sum_base_price, |
| 156 | + sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, |
| 157 | + sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, |
| 158 | + avg(l_quantity) AS avg_qty, |
| 159 | + avg(l_extendedprice) AS avg_price, |
| 160 | + avg(l_discount) AS avg_disc, |
| 161 | + count(*) AS count_order |
| 162 | +FROM |
| 163 | + sample_catalog.scalardb.cassandrans.lineitem |
| 164 | +WHERE |
| 165 | + to_date(l_shipdate, 'yyyy-MM-dd') <= date '1998-12-01' - 3 |
| 166 | +GROUP BY |
| 167 | + l_returnflag, |
| 168 | + l_linestatus |
| 169 | +ORDER BY |
| 170 | + l_returnflag, |
| 171 | + l_linestatus; |
| 172 | +``` |
| 173 | + |
| 174 | +You should see the following output: |
| 175 | + |
| 176 | +```console |
| 177 | +A F 1519 2374824.6560278563 1387364.2207725341 1962763.4654265852 26.649122807017545 41663.590456629056 0.41501802923479575 57 |
| 178 | +N F 98 146371.2295412012 85593.96776336085 121041.55837332775 32.666666666666664 48790.409847067065 0.40984706454007996 3 |
| 179 | +N O 5374 8007373.247086477 4685647.785126835 6624210.945739046 24.427272727272726 36397.15112312035 0.4147594809559689 220 |
| 180 | +R F 1461 2190869.9676265526 1284178.4378283697 1814151.2807494882 25.189655172413794 37773.62013149229 0.41323493790730753 58 |
| 181 | +``` |
| 182 | + |
| 183 | +### Query 2: Cross-database analysis for revenue optimization |
| 184 | + |
| 185 | +The following SQL query showcases the key capability of ScalarDB Analytics: joining data across different databases without data movement. Specifically, this query joins the customer table in PostgreSQL, the order table in MySQL, and the line items in Cassandra without requiring data movements by, for example, ETL pipelines. The query helps answer business questions like: |
| 186 | + |
| 187 | +- To prioritize fulfillment, what are the high-value orders from specific customer segments that haven't shipped yet? |
| 188 | + |
| 189 | +The query finds AUTOMOBILE segment customers with unshipped orders, ranked by revenue: |
| 190 | + |
| 191 | +```sql |
| 192 | +SELECT |
| 193 | + l_orderkey, |
| 194 | + sum(l_extendedprice * (1 - l_discount)) AS revenue, |
| 195 | + o_orderdate, |
| 196 | + o_shippriority |
| 197 | +FROM |
| 198 | + sample_catalog.postgres.sample_ns.customer, |
| 199 | + sample_catalog.scalardb.mysqlns.orders, |
| 200 | + sample_catalog.scalardb.cassandrans.lineitem |
| 201 | +WHERE |
| 202 | + c_mktsegment = 'AUTOMOBILE' |
| 203 | + AND c_custkey = o_custkey |
| 204 | + AND l_orderkey = o_orderkey |
| 205 | + AND o_orderdate < '1995-03-15' |
| 206 | + AND l_shipdate > '1995-03-15' |
| 207 | +GROUP BY |
| 208 | + l_orderkey, |
| 209 | + o_orderdate, |
| 210 | + o_shippriority |
| 211 | +ORDER BY |
| 212 | + revenue DESC, |
| 213 | + o_orderdate, |
| 214 | + l_orderkey |
| 215 | +LIMIT 10; |
| 216 | +``` |
| 217 | + |
| 218 | +You should see the following output: |
| 219 | + |
| 220 | +```console |
| 221 | +1071617 128186.99915996166 1995-03-10 0 |
| 222 | +1959075 33104.51278645416 1994-12-23 0 |
| 223 | +430243 19476.115819260962 1994-12-24 0 |
| 224 | +``` |
| 225 | + |
| 226 | +The result indicates that the shipment of the order with the order key `1071617` should be prioritized. |
| 227 | + |
| 228 | +:::note |
| 229 | + |
| 230 | +You can also run any arbitrary query that Apache Spark and Spark SQL support on the imported tables in this sample tutorial. Since ScalarDB Analytics supports all queries that Spark SQL supports, you can do not only selections (filtering), joins, aggregations, and ordering, as shown in the example, but also window functions, lateral joins, and other various operations. |
| 231 | + |
| 232 | +To see which types of queries Spark SQL supports, see the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref.html). |
| 233 | + |
| 234 | +::: |
| 235 | + |
| 236 | +## Step 6: Stop the sample application |
| 237 | + |
| 238 | +To stop the sample application and remove all associated volumes, run the following command. This action shuts down all services and deletes any persisted data stored in the volumes, resetting the application state: |
| 239 | + |
| 240 | +```console |
| 241 | +docker compose down -v |
| 242 | +``` |
| 243 | + |
| 244 | +## Reference |
| 245 | + |
| 246 | +### Schema details |
| 247 | + |
| 248 | +The following entity relationship diagram illustrates the relationships between the tables across PostgreSQL, MySQL, and Cassandra, with foreign keys linking customers, orders, and line items. |
| 249 | + |
| 250 | +```mermaid |
| 251 | +erDiagram |
| 252 | + "postgres.sample_ns.customer" ||--|{ "scalardb.mysqlns.orders" : "custkey" |
| 253 | + "postgres.sample_ns.customer" { |
| 254 | + int c_custkey |
| 255 | + text c_name |
| 256 | + text c_address |
| 257 | + int c_nationkey |
| 258 | + text c_phone |
| 259 | + double c_acctbal |
| 260 | + text c_mktsegment |
| 261 | + text c_comment |
| 262 | + } |
| 263 | + "scalardb.mysqlns.orders" ||--|{ "scalardb.cassandrans.lineitem" : "orderkey" |
| 264 | + "scalardb.mysqlns.orders" { |
| 265 | + int o_orderkey |
| 266 | + int o_custkey |
| 267 | + text o_orderstatus |
| 268 | + double o_totalprice |
| 269 | + text o_orderdate |
| 270 | + text o_orderpriority |
| 271 | + text o_clerk |
| 272 | + int o_shippriority |
| 273 | + text o_comment |
| 274 | + } |
| 275 | + "scalardb.cassandrans.lineitem" { |
| 276 | + int l_orderkey |
| 277 | + int l_partkey |
| 278 | + int l_suppkey |
| 279 | + int l_linenumber |
| 280 | + double l_quantity |
| 281 | + double l_extendedprice |
| 282 | + double l_discount |
| 283 | + double l_tax |
| 284 | + text l_returnflag |
| 285 | + text l_linestatus |
| 286 | + text l_shipdate |
| 287 | + text l_commitdate |
| 288 | + text l_receiptdate |
| 289 | + text l_shipinstruct |
| 290 | + text l_shipmode |
| 291 | + text l_comment |
| 292 | + } |
| 293 | +``` |
| 294 | + |
| 295 | +- `postgres.sample_ns.customer` comes from PostgreSQL, which is not managed by ScalarDB. |
| 296 | +- `scalardb.mysqlns.orders` and `scalardb.cassandrans.lineitem` come from ScalarDB, which are backed by MySQL and Cassandra, respectively. |
| 297 | + |
| 298 | +The following are brief descriptions of the tables: |
| 299 | + |
| 300 | +- **`postgres.sample_ns.customer`.** A table that represents information about customers. This table includes attributes like customer key, name, address, phone number, and account balance. |
| 301 | +- **`scalardb.mysqlns.orders`.** A table that contains information about orders that customers have placed. This table includes attributes like order key, customer key, order status, order date, and order priority. |
| 302 | +- **`scalardb.cassandrans.lineitem`.** A table that represents line items associated with orders. This table includes attributes such as order key, part key, supplier key, quantity, price, and shipping date. |
0 commit comments