diff --git a/docs/scalardb-analytics-spark/README.mdx b/docs/scalardb-analytics-spark/README.mdx index 79a294ea..523992da 100644 --- a/docs/scalardb-analytics-spark/README.mdx +++ b/docs/scalardb-analytics-spark/README.mdx @@ -4,21 +4,17 @@ tags: - Public Preview --- -# ScalarDB Analytics with Spark +# ScalarDB Analytics import WarningLicenseKeyContact from '/src/components/en-us/_warning-license-key-contact.mdx'; -ScalarDB, as a universal transaction manager, targets mainly transactional workloads and therefore supports limited subsets of relational queries. +**ScalarDB Analytics** is the analytical component of ScalarDB. Similar to ScalarDB, it unifies diverse data sources - ranging from RDBMSs like PostgreSQL and MySQL to NoSQL databases such as Cassandra and DynamoDB - into a single logical database. While ScalarDB focuses on operational workloads with strong transactional consistency across multiple databases, ScalarDB Analytics is optimized for analytical workloads. It supports a wide range of queries, including complex joins, aggregations, and window functions. ScalarDB Analytics operates seamlessly on both ScalarDB-managed data sources and non-ScalarDB-managed ones, enabling advanced analytical queries across various datasets. -ScalarDB Analytics with Spark extends the functionality of ScalarDB to process analytical queries on ScalarDB-managed data by using Apache Spark and Spark SQL. - -Since ScalarDB Analytics with Spark is provided as a Spark catalog plugin, you can read externally managed data sources with its data schema. By using this plugin, you can read data from ScalarDB tables as Spark SQL tables with the same schema. +The current version of ScalarDB Analytics leverages **Apache Spark** as its execution engine. It provides a unified view of ScalarDB-managed and non-ScalarDB-managed data sources by utilizing a Spark custom catalog. Using ScalarDB Analytics, you can treat tables from these data sources as native Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly. For example, you can join a table stored in Cassandra with a table in PostgreSQL to perform a cross-database analysis with ease. ## Further reading -* To run ad-hoc analytical queries or development applications by using ScalarDB Analytics with Spark, see [Getting Started with ScalarDB Analytics with Spark](getting-started.mdx). -* For tutorials on how to use ScalarDB Analytics with Spark by using a sample dataset and application, see [Run Analytical Queries on Sample Data by Using ScalarDB Analytics with Spark](../scalardb-samples/scalardb-analytics-spark-sample/README.mdx). -* For details on how to configure ScalarDB Analytics with Spark, see [Configuration of ScalarDB Analytics with Spark](configuration.mdx). +* For tutorials on how to use ScalarDB Analytics by using a sample dataset and application, see [Getting Started with ScalarDB Analytics](../scalardb-samples/scalardb-analytics-spark-sample/README.mdx). * For supported Spark and Scala versions, see [Version Compatibility of ScalarDB Analytics with Spark](version-compatibility.mdx) diff --git a/docs/scalardb-analytics-spark/version-compatibility.mdx b/docs/scalardb-analytics-spark/version-compatibility.mdx index 41e0e0cb..c4b346fe 100644 --- a/docs/scalardb-analytics-spark/version-compatibility.mdx +++ b/docs/scalardb-analytics-spark/version-compatibility.mdx @@ -13,5 +13,6 @@ Regarding the Java version, ScalarDB Analytics with Spark supports Java 8 or lat The following is a list of Spark and Scalar versions supported by each version of ScalarDB Analytics with Spark. | ScalarDB Analytics with Spark Version | ScalarDB Version | Spark Versions Supported | Scala Versions Supported | Minimum Java Version | -|:---------------------------------------|:------------------|:--------------------------|:--------------------------|:----------------------| +|:--------------------------------------|:-----------------|:-------------------------|:-------------------------|:---------------------| +| 3.14 | 3.14 | 3.5, 3.4 | 2.13, 2.12 | 8 | | 3.12 | 3.12 | 3.5, 3.4 | 2.13, 2.12 | 8 | diff --git a/docs/scalardb-samples/scalardb-analytics-spark-sample/README.mdx b/docs/scalardb-samples/scalardb-analytics-spark-sample/README.mdx index e15647cc..0b59ba06 100644 --- a/docs/scalardb-samples/scalardb-analytics-spark-sample/README.mdx +++ b/docs/scalardb-samples/scalardb-analytics-spark-sample/README.mdx @@ -1,33 +1,32 @@ --- tags: - Enterprise Option - - Private Preview + - Public Preview --- -# Run Analytical Queries on Sample Data by Using ScalarDB Analytics with Spark +# Getting Started with ScalarDB Analytics import WarningLicenseKeyContact from '/src/components/en-us/_warning-license-key-contact.mdx'; -This tutorial describes how to run analytical queries on sample data by using ScalarDB Analytics with Spark. The source code is available at https://github.com/scalar-labs/scalardb-samples/scalardb-analytics-spark-sample. +This tutorial describes how to run analytical queries on sample data by using ScalarDB Analytics. The source code is available at https://github.com/scalar-labs/scalardb-samples/scalardb-analytics-spark-sample. -## What you can do in this sample application +ScalarDB Analytics in its current version leverages Apache Spark as its execution engine. It provides a unified view of ScalarDB-managed and non-ScalarDB-managed data sources by using a Spark custom catalog. By using ScalarDB Analytics, you can treat tables from these data sources as native Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly. For example, you can join a table stored in Cassandra with a table in PostgreSQL to perform cross-database analysis with ease. -This sample tutorial shows how you can run interactive analysis in the Spark shell by using ScalarDB Analytics with Spark. In particular, you'll learn how to run the following two types of queries: +## Overview of the Sample Application -- Read data and calculate summaries. -- Join tables that span multiple storages. +This sample tutorial demonstrates how to configure Spark to enable ScalarDB Analytics and perform interactive analyses using `spark-sql` on tables provided by ScalarDB Analytics. -## Prerequisites +## Prerequisites for this sample application - [Docker](https://www.docker.com/get-started/) 20.10 or later with [Docker Compose](https://docs.docker.com/compose/install/) V2 or later - + -## Set up ScalarDB Analytics with Spark +## Step 1: Set up ScalarDB Analytics ### Clone the ScalarDB samples repository -Open **Terminal**, then clone the ScalarDB samples repository by running the following command: +Open **Terminal**, and clone the ScalarDB samples repository by running the following command: ```console git clone https://github.com/scalar-labs/scalardb-samples @@ -39,198 +38,266 @@ Then, go to the directory that contains the sample application by running the fo cd scalardb-samples/scalardb-analytics-spark-sample ``` -### Add your license certificate to the sample directory +### Set your license information -Copy your license certificate (`cert.pem`) to the sample directory by running the following command, replacing `` with the path to your license: +ScalarDB Analytics requires valid license information to be specified in the Spark configuration. You can provide your license details in the **spark-defaults.conf** file. + + +Open the **spark-defaults.conf** file located in the **conf** directory of your Spark installation. Then, replace `` with your license key and `` with the PEM-encoded contents of your license certificate. ```console -cp //cert.pem cert.pem +spark.sql.catalog.test_catalog.license.key +spark.sql.catalog.test_catalog.license.cert_pem ``` -### Set up the sample underlying databases in ScalarDB +For additional configuration details required in the **spark-defaults.conf** file for setting up ScalarDB Analytics, refer to [ScalarDB Analytics configuration](#scalardb-analytics-configuration). + +## Step 2: Set up the sample databases -To set up the sample underlying databases for ScalarDB, run the following command: +To set up the sample databases, run the following command: ```console docker compose up -d --wait ``` -This command starts three services locally for PostgreSQL, Cassandra, and DynamoDB. +This command starts three services locally: PostgreSQL, Cassandra, and MySQL. -Then, set up the sample databases on those services by running the following command: +- **PostgreSQL:** Used independently (non-ScalarDB-managed). +- **Cassandra and MySQL:** Used as backend databases for ScalarDB (ScalarDB-managed). + +In this guide, PostgreSQL is referred to as a **non-ScalarDB-managed database**, which is not managed by ScalarDB transactions, while Cassandra and DynamoDB are referred to as **ScalarDB-managed databases**, which are managed by ScalarDB transactions. + +For non-ScalarDB-managed databases, sample data is automatically loaded when the Docker container is initialized, so no additional steps are required. For ScalarDB-managed databases, run the following command to load the sample data after starting the containers: ```console docker compose run --rm sample-data-loader ``` -This command creates `postgresns`, `cassandrans`, and `dynamons` namespaces, which are mapped to the local PostgreSQL, Cassandra, and DyanmoDB services respectively, creates `postgresns.orders`, `cassandrans.lineitem`, and `dynamons.customer` tables, and loads the sample data into those tables. For details about the table schema, see [Schema details](#schema-details). +After completing the setup, the following tables should be available: -### Set up ScalarDB Analytics with Spark in the Spark shell +- In PostgreSQL: + - `sample_ns.customer` +- In ScalarDB (backed by Cassandra): + - `cassandrans.lineitem` +- In ScalarDB (backed by MySQL): + - `mysqlns.order` -To launch the Spark shell, run the following command: +According to the above, within ScalarDB, `cassandrans` and `mysqlns` are mapped to Cassandra and MySQL, respectively. -```console -docker compose run --rm spark-shell -``` +For details about the table schema, including column definitions and data types, refer to [Schema details](#schema-details). Ensure that the sample data has been successfully loaded into these tables. -As you can see in `docker-compose.yml`, this command executes the `spark-shell` command with the `--packages com.scalar-labs:scalardb-analytics-spark-_:` option. With this option, `spark-shell` automatically downloads ScalarDB Analytics with Spark from the Maven Central Repository and add it to the classpath of `spark-shell`. +## Step 3: Launch the Spark SQL console -In the Spark shell console, you can set up ScalarDB Analytics with Spark by running the following commands: +To launch the Spark SQL console, run the following command: ```console -scala> import com.scalar.db.analytics.spark.implicits._ -scala> spark.setupScalarDbAnalytics( - | configPath = "/etc/scalardb.properties", - | namespaces = Set("postgresns", "cassandrans", "dynamons"), - | license = License.certPath("""{"your":"license", "key":"in", "json":"format"}""", "/etc/cert.pem") - | ) +docker compose run --rm spark-sql ``` -:::warning +While launching the Spark SQL console, the ScalarDB Analytics catalog is initialized with the configuration in **spark-defaults.conf** and is registered as a Spark catalog named `test_catalog`. -Remember that you must have copied your license certificate to the sample directory as described in [Add your license certificate to the sample directory](#add-your-license-certificate-to-the-sample-directory), since the license is referenced in the JSON string. +### Namespace mapping -::: +The following tables in the configured data sources are mapped to Spark SQL tables, allowing seamless querying across different data sources: -Now, you should have tables for `postgresns.orders`, `cassandrans.lineitem`, and `dynamons.customer` on the Spark side that are equivalent to the tables in ScalarDB. For example: +- For PostgreSQL: + - `test_catalog.postgresql.sample_ns.customer` +- For ScalarDB (backed by Cassandra): + - `test_catalog.scalardb.cassandrans.lineitem` +- For ScalarDB (backed by MySQL): + - `test_catalog.scalardb.mysqlns.orders` -```console -scala> sql("DESCRIBE postgresns.orders").show() -+---------------+---------+-------+ -| col_name|data_type|comment| -+---------------+---------+-------+ -| o_orderkey| int| NULL| -| o_custkey| int| NULL| -| o_orderstatus| string| NULL| -| o_totalprice| double| NULL| -| o_orderdate| string| NULL| -|o_orderpriority| string| NULL| -| o_clerk| string| NULL| -| o_shippriority| int| NULL| -| o_comment| string| NULL| -+---------------+---------+-------+ -``` +For more details about how tables are mapped to Spark SQL tables, refer to [Namespace-mapping details](#namespace-mapping-details). + +Additionally, ScalarDB Analytics offers WAL-interpreted views for ScalarDB tables, simplifying common use cases. In this sample application, you have the following WAL-interpreted views available: -## Run analytical queries +- For ScalarDB (backed by Cassandra): + - `test_catalog.view.scalardb.cassandrans.lineitem` +- For ScalarDB (backed by MySQL): + - `test_catalog.view.scalardb.mysqlns.orders` -The following sections describe how to read data, calculate summaries, and join tables that span multiple storages. +In most cases, WAL-interpreted views are preferred over raw tables. In this tutorial, we will use the WAL-interpreted views for the ScalarDB tables. For detailed information on WAL-interpreted views, including their use cases and benefits, see [WAL-interpreted views for ScalarDB tables](#wal-interpreted-views-for-scalardb-tables). + +## Step 4: Run analytical queries + +Now, everything is set up, and you can run analytical queries on the sample data using the Spark SQL console. ### Read data and calculate summaries -You can run a query that reads data from `cassandrans.lineitem`, with the actual data stored in Cassandra, and calculates several summaries of the ordered line items by aggregating the data. - -To run the query, run the following command in the Spark shell console: - -```scala -scala> sql(""" - SELECT - l_returnflag, - l_linestatus, - sum(l_quantity) AS sum_qty, - sum(l_extendedprice) AS sum_base_price, - sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, - sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, - avg(l_quantity) AS avg_qty, - avg(l_extendedprice) AS avg_price, - avg(l_discount) AS avg_disc, - count(*) AS count_order - FROM - cassandrans.lineitem - WHERE - to_date(l_shipdate, 'yyyy-MM-dd') <= date '1998-12-01' - 3 - GROUP BY - l_returnflag, - l_linestatus - ORDER BY - l_returnflag, - l_linestatus; - """).show() +You can run the following query to retrieve data from `test_catalog.scalardb.cassandrans.lineitem` in Cassandra and calculate aggregated metrics, including total quantity, average price, and total revenue for line items grouped by their return flag and line status. + +```sql +SELECT + l_returnflag, + l_linestatus, + sum(l_quantity) AS sum_qty, + sum(l_extendedprice) AS sum_base_price, + sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, + sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, + avg(l_quantity) AS avg_qty, + avg(l_extendedprice) AS avg_price, + avg(l_discount) AS avg_disc, + count(*) AS count_order +FROM + test_catalog.view.scalardb.cassandrans.lineitem +WHERE + to_date(l_shipdate, 'yyyy-MM-dd') <= date '1998-12-01' - 3 +GROUP BY + l_returnflag, + l_linestatus +ORDER BY + l_returnflag, + l_linestatus; ``` You should see the following output: ```console -+------------+------------+-------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+ -|l_returnflag|l_linestatus|sum_qty| sum_base_price| sum_disc_price| sum_charge| avg_qty| avg_price| avg_disc|count_order| -+------------+------------+-------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+ -| A| F| 1519|2374824.6560278563|1387364.2207725341|1962763.4654265852|26.649122807017545|41663.590456629056|0.41501802923479575| 57| -| N| F| 98| 146371.2295412012| 85593.96776336085|121041.55837332775|32.666666666666664|48790.409847067065|0.40984706454007996| 3| -| N| O| 5374| 8007373.247086477| 4685647.785126835| 6624210.945739046|24.427272727272726| 36397.15112312035| 0.4147594809559689| 220| -| R| F| 1461|2190869.9676265526|1284178.4378283697|1814151.2807494882|25.189655172413794| 37773.62013149229|0.41323493790730753| 58| -+------------+------------+-------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+ +A F 1519 2374824.6560278563 1387364.2207725341 1962763.4654265852 26.649122807017545 41663.590456629056 0.41501802923479575 57 +N F 98 146371.2295412012 85593.96776336085 121041.55837332775 32.666666666666664 48790.409847067065 0.40984706454007996 3 +N O 5374 8007373.247086477 4685647.785126835 6624210.945739046 24.427272727272726 36397.15112312035 0.4147594809559689 220 +R F 1461 2190869.9676265526 1284178.4378283697 1814151.2807494882 25.189655172413794 37773.62013149229 0.41323493790730753 58 ``` -### Join tables that span multiple databases - -You can also run a query to join tables that are connected to the three back-end databases and calculate the unshipped orders with the highest revenue on a particular date. - -To run the query, run the following command in the Spark shell console: - -```scala -scala> sql(""" - SELECT - l_orderkey, - sum(l_extendedprice * (1 - l_discount)) AS revenue, - o_orderdate, - o_shippriority - FROM - dynamons.customer, - postgresns.orders, - cassandrans.lineitem - WHERE - c_mktsegment = 'AUTOMOBILE' - AND c_custkey = o_custkey - AND l_orderkey = o_orderkey - AND o_orderdate < '1995-03-15' - AND l_shipdate > '1995-03-15' - GROUP BY - l_orderkey, - o_orderdate, - o_shippriority - ORDER BY - revenue DESC, - o_orderdate, - l_orderkey - LIMIT 10; - """).show() +### Join tables that span multiple data sources + +You can also run the following query to join tables from multiple data sources, including both ScalarDB-managed and non-ScalarDB-managed tables. This query joins customer, order, and line item data from PostgreSQL, MySQL, and Cassandra, identifying the top unshipped orders with the highest revenue on a specific date. This analysis helps prioritize shipments for maximum financial impact. + +```sql +SELECT + l_orderkey, + sum(l_extendedprice * (1 - l_discount)) AS revenue, + o_orderdate, + o_shippriority +FROM + test_catalog.postgresql.sample_ns.customer, + test_catalog.scalardb.mysqlns.orders, + test_catalog.scalardb.cassandrans.lineitem +WHERE + c_mktsegment = 'AUTOMOBILE' + AND c_custkey = o_custkey + AND l_orderkey = o_orderkey + AND o_orderdate < '1995-03-15' + AND l_shipdate > '1995-03-15' +GROUP BY + l_orderkey, + o_orderdate, + o_shippriority +ORDER BY + revenue DESC, + o_orderdate, + l_orderkey +LIMIT 10; ``` You should see the following output: ```console -+----------+------------------+-----------+--------------+ -|l_orderkey| revenue|o_orderdate|o_shippriority| -+----------+------------------+-----------+--------------+ -| 1071617|128186.99915996166| 1995-03-10| 0| -| 1959075| 33104.51278645416| 1994-12-23| 0| -| 430243|19476.115819260962| 1994-12-24| 0| -+----------+------------------+-----------+--------------+ +1071617 128186.99915996166 1995-03-10 0 +1959075 33104.51278645416 1994-12-23 0 +430243 19476.115819260962 1994-12-24 0 ``` :::note -You can also run any arbitrary query that Apache Spark and Spark SQL support on the imported tables in this sample tutorial. Since ScalarDB Analytics with Spark supports all queries that Spark SQL supports, you can use not only join, aggregation, filtering, and ordering as shown in the example but also the window function, lateral join, and various analytical operations. +You can also run any arbitrary query that Apache Spark and Spark SQL support on the imported tables in this sample tutorial. Since ScalarDB Analytics supports all queries that Spark SQL supports, you can do not only selections (filtering), joins, aggregations, and ordering, as shown in the example, but also window functions, lateral joins, and other various operations. To see which types of queries Spark SQL supports, see the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref.html). ::: -## Stop the sample application +## Step 5: Stop the sample application + +To stop the sample application and remove all associated volumes, run the following command. This action shuts down all services and deletes any persisted data stored in the volumes, resetting the application state: + +```console +docker compose down -v +``` + +## Reference + +This section contains other details related to ScalarDB Analytics, like configurations and schema details. + +### ScalarDB Analytics configuration + +You can configure ScalarDB Analytics in the Spark configuration, such as in the `spark-defaults.conf` file. This section contains brief explanations of the configurations for ScalarDB Analytics in this sample application. + +#### Common configurations -To stop the sample application, stop the Docker container by running the following command: +The following are common configurations for ScalarDB Analytics: ```console -docker compose down +spark.sql.catalog.test_catalog com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog +spark.sql.extensions com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions +``` + +The first line specifies the Spark catalog plugin implementation class. You must always set this to `com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog` to enable the ScalarDB Analytics catalog in Spark SQL. + +:::note + +You can set an arbitrary string as the catalog name, which is `test_catalog` in this example. The configured catalog name will be used as a part of the table identifier in the Spark SQL query. + +::: + +The second line specifies the Spark SQL extension implementation class. You must always set this to `com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions`, along with the `spark.sql.catalog.test_catalog` configuration, mentioned above. + +#### License information + +The following are the license configurations for ScalarDB Analytics: + +```apacheconf +spark.sql.catalog.test_catalog.license.key +spark.sql.catalog.test_catalog.license.cert_pem +``` + +These lines provide the license information for ScalarDB Analytics. As explained above, you must replace the placeholders with your license information before launching the Spark SQL console. + +#### Data source configurations for ScalarDB-managed databases + +The following are the data source configurations for ScalarDB-managed databases for ScalarDB Analytics: + +```apacheconf +spark.sql.catalog.test_catalog.data_source.scalardb.type scalardb +spark.sql.catalog.test_catalog.data_source.scalardb.config_path /etc/scalardb.properties +``` + +The first line specifies the data source type. You must always set this to `scalardb` to configure the data source for ScalarDB-managed databases. The second line specifies the path to the configuration file for the ScalarDB data source, which is the only required configuration for the ScalarDB data source. + +:::note + +You can set an arbitrary string as the data source name, which is `scalardb` in this example. The configured data source names will be used as a part of the table identifier in the Spark SQL query. + +::: + +#### Data source configurations for non-ScalarDB-managed databases + +The following are the data source configurations for non-ScalarDB-managed databases for ScalarDB Analytics: + +```apacheconf +spark.sql.catalog.test_catalog.data_source.postgresql.type postgresql +spark.sql.catalog.test_catalog.data_source.postgresql.host postgres +spark.sql.catalog.test_catalog.data_source.postgresql.port 5432 +spark.sql.catalog.test_catalog.data_source.postgresql.username postgres +spark.sql.catalog.test_catalog.data_source.postgresql.password postgres +spark.sql.catalog.test_catalog.data_source.postgresql.database sampledb ``` -## Reference - Schema details +These lines configure the data source PostgreSQL as a non-ScalarDB-managed database. The first line specifies the data source type, and the rest of the lines specify the data source-specific configurations, which is the connection information for the PostgreSQL data source. The data source–specific configurations may vary depending on the data source type. + +:::note + +You can set an arbitrary string as the data source name, which is `postgresql` in this example, in the same way as the ScalarDB data source. + +::: + +### Schema details -In this sample tutorial, you have tables with the following schema in the underlying databases of ScalarDB: +The following entity relationship diagram illustrates the relationships between the tables across PostgreSQL, MySQL, and Cassandra, with foreign keys linking customers, orders, and line items. ```mermaid erDiagram - "dynamons.customer" ||--|{ "postgresns.orders" : "custkey" - "dynamons.customer" { + "postgresql.sample_ns.customer" ||--|{ "scalardb.mysqlns.orders" : "custkey" + "postgresql.sample_ns.customer" { int c_custkey text c_name text c_address @@ -240,8 +307,8 @@ erDiagram text c_mktsegment text c_comment } - "postgresns.orders" ||--|{ "cassandrans.lineitem" : "orderkey" - "postgresns.orders" { + "scalardb.mysqlns.orders" ||--|{ "scalardb.cassandrans.lineitem" : "orderkey" + "scalardb.mysqlns.orders" { int o_orderkey int o_custkey text o_orderstatus @@ -252,7 +319,7 @@ erDiagram int o_shippriority text o_comment } - "cassandrans.lineitem" { + "scalardb.cassandrans.lineitem" { int l_orderkey int l_partkey int l_suppkey @@ -272,9 +339,117 @@ erDiagram } ``` -For reference, this diagram shows the following: +- `postgresql.sample_ns.customer` comes from PostgreSQL, which is not managed by ScalarDB. +- `scalardb.mysqlns.orders` and `scalardb.cassandrans.lineitem` come from ScalarDB, which are backed by MySQL and Cassandra, respectively. + +The following are brief descriptions of the tables: + +- **`postgresql.sample_ns.customer`.** A table that represents information about customers. This table includes attributes like customer key, name, address, phone number, and account balance. +- **`scalardb.mysqlns.orders`.** A table that contains information about orders that customers have placed. This table includes attributes like order key, customer key, order status, order date, and order priority. +- **`scalardb.cassandrans.lineitem`.** A table that represents line items associated with orders. This table includes attributes such as order key, part key, supplier key, quantity, price, and shipping date. + +### Namespace-mapping details + +The tables of each configured data source are mapped to the Spark SQL identifier by using the following format: + +```console +...`. +``` + +The following explains each part of the table identifier: + +- **``.** The catalog name configured in spark-defaults.conf. This identifies the ScalarDB Analytics catalog in Spark SQL. +- **``.** The data source name configured in spark-defaults.conf, representing the specific type of data source, such as postgresql or scalardb. +- **``.** The namespace name in the data source. For example: + - In an RDBMS like PostgreSQL or MySQL, this corresponds to the schema. + - In NoSQL databases like Cassandra, this may refer to a keyspace. +- **``.** The name of the table in the namespace. + +In this example, the following tables are available: + +- For PostgreSQL: + - test_catalog.postgresql.sample_ns.customer +- For ScalarDB (backed by Cassandra): + - test_catalog.scalardb.cassandrans.lineitem +- For ScalarDB (backed by MySQL): + - test_catalog.scalardb.mysqlns.orders -- `dynamons`, `postgresns`, and `cassandrans`. Namespaces that are mapped to the back-end storages of DynamoDB, PostgreSQL, and Cassandra, respectively. -- `dynamons.customer`. A table that represents information about customers. This table includes attributes like customer key, name, address, phone number, and account balance. -- `postgresns.orders`. A table that contains information about orders that customers have placed. This table includes attributes like order key, customer key, order status, order date, and order priority. -- `cassandrans.lineitem`. A table that represents line items associated with orders. This table includes attributes such as order key, part key, supplier key, quantity, price, and shipping date. +This mapping allows you to access and query tables from different data sources seamlessly by using Spark SQL. + +### WAL-interpreted views for ScalarDB tables + +ScalarDB tables that are transaction-enabled include transaction metadata columns in the raw tables stored in the underlying data sources. Since ScalarDB Analytics maps these raw tables directly to Spark SQL tables, you will see transaction metadata columns when describing these tables in Spark SQL. You can see these columns by running the following command: + +```sql +DESCRIBE test_catalog.scalardb.mysqlns.orders; +``` + +You should see the following output: + +```console +o_orderkey int +o_custkey int +o_orderstatus string +o_totalprice double +o_orderdate string +o_orderpriority string +o_clerk string +o_shippriority int +o_comment string +tx_id string +tx_state int +tx_version int +tx_prepared_at bigint +tx_committed_at bigint +before_tx_id string +before_tx_state int +before_tx_version int +before_tx_prepared_at bigint +before_tx_committed_at bigint +before_o_orderstatus string +before_o_clerk string +before_o_orderdate string +before_o_shippriority int +before_o_custkey int +before_o_totalprice double +before_o_comment string +before_o_orderpriority string +``` + +In many cases, you may not need the transaction metadata columns in your queries. To simplify this, ScalarDB Analytics provides WAL-interpreted views. WAL-interpreted views hide transaction metadata columns and expose only user-defined columns, simplifying queries. For example, use WAL-interpreted views when performing read-only analytics or when transaction metadata is not needed for analysis. Additionally, WAL-interpreted views guarantee read-committed consistency by interpreting the transaction metadata columns internally. + +#### WAL-interpreted view naming convention in Spark SQL + +WAL-interpreted views are prefixed with `view.` before the data source part of the table identifier. For example, the following WAL-interpreted views are available for ScalarDB tables: + +- For ScalarDB (backed by Cassandra): + - test_catalog.view.scalardb.cassandrans.lineitem +- For ScalarDB (backed by MySQL): + - test_catalog.view.scalardb.mysqlns.orders + +For example, to see the WAL-interpreted view for the ScalarDB table backed by Cassandra, run the following command: + +```sql +DESCRIBE test_catalog.view.scalardb.cassandrans.lineitem; +``` + +You should see the following output: + +```console +l_orderkey int +l_linenumber int +l_comment string +l_commitdate string +l_discount double +l_extendedprice double +l_linestatus string +l_partkey int +l_quantity int +l_receiptdate string +l_returnflag string +l_shipdate string +l_shipinstruct string +l_shipmode string +l_suppkey int +l_tax double +```