diff --git a/docs/scalardb-analytics-spark/version-compatibility.mdx b/docs/scalardb-analytics-spark/version-compatibility.mdx deleted file mode 100644 index 48a17c94..00000000 --- a/docs/scalardb-analytics-spark/version-compatibility.mdx +++ /dev/null @@ -1,18 +0,0 @@ ---- -tags: - - Enterprise Option -displayed_sidebar: docsEnglish ---- - -# Version Compatibility of ScalarDB Analytics with Spark - -Since Spark and Scala may be incompatible among different minor versions, ScalarDB Analytics with Spark offers different artifacts for various Spark and Scala versions, named in the format `scalardb-analytics-spark-_`. Make sure that you select the artifact matching the Spark and Scala versions you're using. For example, if you're using Spark 3.5 with Scala 2.13, you must specify `scalardb-analytics-spark-3.5_2.13`. - -Regarding the Java version, ScalarDB Analytics with Spark supports Java 8 or later. - -The following is a list of Spark and Scalar versions supported by each version of ScalarDB Analytics with Spark. - -| ScalarDB Analytics with Spark Version | ScalarDB Version | Spark Versions Supported | Scala Versions Supported | Minimum Java Version | -|:--------------------------------------|:-----------------|:-------------------------|:-------------------------|:---------------------| -| 3.14 | 3.14 | 3.5, 3.4 | 2.13, 2.12 | 8 | -| 3.12 | 3.12 | 3.5, 3.4 | 2.13, 2.12 | 8 | diff --git a/docs/scalardb-analytics-spark/README.mdx b/docs/scalardb-analytics/README.mdx similarity index 100% rename from docs/scalardb-analytics-spark/README.mdx rename to docs/scalardb-analytics/README.mdx diff --git a/docs/scalardb-analytics/deployment.mdx b/docs/scalardb-analytics/deployment.mdx new file mode 100644 index 00000000..b92c102a --- /dev/null +++ b/docs/scalardb-analytics/deployment.mdx @@ -0,0 +1,220 @@ +--- +tags: + - Enterprise Option + - Public Preview +displayed_sidebar: docsEnglish +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Deploy ScalarDB Analytics in Public Cloud Environments + +This guide explains how to deploy ScalarDB Analytics in a public cloud environment. ScalarDB Analytics currently uses Apache Spark as an execution engine and supports managed Spark services provided by public cloud providers, such as Amazon EMR and Databricks. + +## Supported managed Spark services and their application types + +ScalarDB Analytics supports the following managed Spark services and application types. + +| Public Cloud Service | Spark Driver | Spark Connect | JDBC | +| -------------------------- | ------------ | ------------- | ---- | +| Amazon EMR (EMR on EC2) | ✅ | ✅ | ❌ | +| Databricks | ✅ | ❌ | ✅ | + +## Configure and deploy + +Select your public cloud environment, and follow the instructions to set up and deploy ScalarDB Analytics. + + + + +

Use Amazon EMR

+ +You can use Amazon EMR (EMR on EC2) to run analytical queries through ScalarDB Analytics. For the basics to launch an EMR cluster, please refer to the [AWS EMR on EC2 documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html). + +

ScalarDB Analytics configuration

+ +To enable ScalarDB Analytics, you need to add the following configuration to the Software setting when you launch an EMR cluster. Be sure to replace the content in the angle brackets: + +```json +[ + { + "Classification": "spark-defaults", + "Properties": { + "spark.jars.packages": "com.scalar-labs:scalardb-analytics-spark-all-_:", + "spark.sql.catalog.": "com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog", + "spark.sql.extensions": "com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions", + "spark.sql.catalog..license.cert_pem": "", + "spark.sql.catalog..license.key": "", + + // Add your data source configuration below + } + } +] +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The version of Spark. +- ``: The version of Scala used to build Spark. +- ``: The version of ScalarDB Analytics. +- ``: The name of the catalog. +- ``: The PEM encoded license certificate. +- ``: The license key. + +For more details, refer to [Set up ScalarDB Analytics in the Spark configuration](development.mdx#set-up-scalardb-analytics-in-the-spark-configuration). + +

Run analytical queries via the Spark driver

+ +After the EMR Spark cluster has launched, you can use ssh to connect to the primary node of the EMR cluster and run your Spark application. For details on how to create a Spark Driver application, refer to [Spark Driver application](development.mdx?spark-application-type=spark-driver#spark-driver-application). + +

Run analytical queries via Spark Connect

+ +You can use Spark Connect to run your Spark application remotely by using the EMR cluster that you launched. + +You first need to configure the Software setting in the same way as the [Spark Driver application](development.mdx?spark-application-type=spark-driver#spark-driver-application). You also need to set the following configuration to enable Spark Connect. + +
Allow inbound traffic for a Spark Connect server
+ +1. Create a security group to allow inbound traffic for a Spark Connect server. (Port 15001 is the default). +2. Allow the role of "Amazon EMR service role" to attach the security group to the primary node of the EMR cluster. +3. Add the security group to the primary node of the EMR cluster as "Additional security groups" when you launch the EMR cluster. + +
Launch the Spark Connect server via a bootstrap action
+ +1. Create a script file to launch the Spark Connect server as follows: + +```bash +#!/usr/bin/env bash + +set -eu -o pipefail + +cd /var/lib/spark + +sudo -u spark /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_:,com.scalar-labs:scalardb-analytics-spark-all-_: +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13) +- ``: The full version of Spark you are using (such as 3.5.3) +- ``: The major and minor version of Spark you are using (such as 3.5) +- ``: The version of ScalarDB Analytics + +2. Upload the script file to S3. +3. Allow the role of "EC2 instance profile for Amazon EMR" to access the uploaded script file in S3. +4. Add the uploaded script file to "Bootstrap actions" when you launch the EMR cluster. + +
Run analytical queries
+ +You can run your Spark application via Spark Connect from anywhere by using the remote URL of the Spark Connect server, which is `sc://:15001`. + +For details on how to create a Spark application by using Spark Connect, refer to [Spark Connect application](development.mdx?spark-application-type=spark-connect#spark-connect-application). + +
+ +

Use Databricks

+ +You can use Databricks to run analytical queries through ScalarDB Analytics. + +:::note + +Note that Databricks provides a modified version of Apache Spark, which works differently from the original Apache Spark. + +::: + +

Launch Databricks cluster

+ +ScalarDB Analytics works with all-purpose and jobs-compute clusters on Databricks. When you launch the cluster, you need to configure the cluster to enable ScalarDB Analytics as follows: + +1. Store the license certificate and license key in the cluster by using the Databricks CLI. + +```console +databricks secrets create-scope scalardb-analytics-secret # you can use any secret scope name +cat license_key.json | databricks secrets put-secret scalardb-analytics-secret license-key +cat license_cert.pem | databricks secrets put-secret scalardb-analytics-secret license-cert +``` + +:::note + +For details on how to install and use the Databricks CLI, refer to the [Databricks CLI documentation](https://docs.databricks.com/en/dev-tools/cli/index.html). + +::: + +2. Select "No isolation shared" for the cluster mode. (This is required. ScalarDB Analytics works only with this cluster mode.) +3. Select an appropriate Databricks runtime version that supports Spark 3.4 or later. +4. Configure "Advanced Options" > "Spark config" as follows, replacing `` with the name of the catalog that you want to use: + +``` +spark.sql.catalog. com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog +spark.sql.extensions com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions +spark.sql.catalog..license.key {{secrets/scalardb-analytics-secret/license-key}} +spark.sql.catalog..license.cert_pem {{secrets/scalardb-analytics-secret/license-pem}} +``` + +:::note + +You also need to configure the data source. For details, refer to [Set up ScalarDB Analytics in the Spark configuration](development.mdx#set-up-scalardb-analytics-in-the-spark-configuration). + +::: + +:::note + +If you specified different secret names in the previous step, be sure to replace the secret names in the configuration above. + +::: + +5. Add the library of ScalarDB Analytics to the launched cluster as a Maven dependency. For details on how to add the library, refer to the [Databricks cluster libraries documentation](https://docs.databricks.com/en/libraries/cluster-libraries.html). + +

Run analytical queries via the Spark Driver

+ +You can run your Spark application on the properly configured Databricks cluster with Databricks Notebook or Databricks Jobs to access the tables in ScalarDB Analytics. To run the Spark application, you can migrate your Pyspark, Scala, or Spark SQL application to Databricks Notebook, or use Databricks Jobs to run your Spark application. ScalarDB Analytics works with task types for Notebook, Python, JAR, and SQL. + +For more details on how to use Databricks Jobs, refer to the [Databricks Jobs documentation](https://docs.databricks.com/en/jobs/index.html) + +

Run analytical queries via the JDBC driver

+ +Databricks supports JDBC to run SQL jobs on the cluster. You can use this feature to run your Spark application in SQL with ScalarDB Analytics by configuring extra settings as follows: + +1. Download the ScalarDB Analytics library JAR file from the Maven repository. +2. Upload the JAR file to the Databricks workspace. +3. Add the JAR file to the cluster as a library, instead of the Maven dependency. +4. Create an init script as follows, replacing `` with the path to your JAR file in the Databricks workspace: + +```bash +#!/bin/bash + +# Target directories +TARGET_DIRECTORIES=("/databricks/jars" "/databricks/hive_metastore_jars") +JAR_PATH=" + +# Copy the JAR file to the target directories +for TARGET_DIR in "${TARGET_DIRECTORIES[@]}"; do + mkdir -p "$TARGET_DIR" + cp "$JAR_PATH" "$TARGET_DIR/" +done +``` + +5. Upload the init script to the Databricks workspace. +6. Add the init script to the cluster to "Advanced Options" > "Init scripts" when you launch the cluster. + +After the cluster is launched, you can get the JDBC URL of the cluster in the "Advanced Options" > "JDBC/ODBC" tab on the cluster details page. + +To connect to the Databricks cluster by using JDBC, you need to add the Databricks JDBC driver to your application dependencies. For example, if you are using Gradle, you can add the following dependency to your `build.gradle` file: + +```groovy +implementation("com.databricks:databricks-jdbc:0.9.6-oss") +``` + +Then, you can connect to the Databricks cluster by using JDBC with the JDBC URL (``), as is common with JDBC applications. + +```java +Class.forName("com.databricks.client.jdbc.Driver"); +String url = ""; +Connection conn = DriverManager.getConnection(url) +``` + +For more details on how to use JDBC with Databricks, refer to the [Databricks JDBC Driver documentation](https://docs.databricks.com/en/integrations/jdbc/index.html). + +
+
diff --git a/docs/scalardb-analytics/design.mdx b/docs/scalardb-analytics/design.mdx new file mode 100644 index 00000000..5534c293 --- /dev/null +++ b/docs/scalardb-analytics/design.mdx @@ -0,0 +1,393 @@ +--- +tags: + - Enterprise Option + - Public Preview +displayed_sidebar: docsEnglish +--- + +# ScalarDB Analytics Design + +import Tabs from "@theme/Tabs"; +import TabItem from "@theme/TabItem"; + +ScalarDB Analytics is the analytical component of ScalarDB. Similar to ScalarDB, it unifies diverse data sources—ranging from RDBMSs like PostgreSQL and MySQL to NoSQL databases like Cassandra and DynamoDB—into a single logical database. This enables you to perform analytical queries across multiple databases seamlessly. + +ScalarDB Analytics consists of two main components: a universal data catalog and a query engine: + +- **Universal data catalog.** The universal data catalog is a flexible metadata management system that handles multiple catalog spaces. Each catalog space provides an independent logical grouping of data sources and views, enabling organized management of diverse data environments. +- **Query engine.** The query engine executes queries against the universal data catalog. ScalarDB Analytics provides appropriate data connectors to interface with the underlying data sources. + +ScalarDB Analytics employs a decoupled architecture where the data catalog and query engine are separate components. This design allows for integration with various existing query engines through an extensible architecture. As a result, you can select different query engines to execute queries against the same data catalog based on your specific requirements. + +## Universal data catalog + +The universal data catalog is composed of several levels and is structured as follows: + +```mermaid +graph TD + C[Catalog] --> D[Data Source] + C[Catalog] --> D2[Data Source] + subgraph " " + D --> N[Namespace] + D --> N2[Namespace] + N --> T[Table] + N --> T2[Table] + T --> TC[Column] + T --> TC2[Column] + D2 + end + + C --> VN[View Namespace] + C --> VN2[View Namespace] + subgraph " " + VN --> V[View] + VN --> V2[View] + V --> VC[Column] + V --> VC2[Column] + VN2 + end +``` + +The following are definitions for those levels: + +- **Catalog** is a folder that contains all your data source information. For example, you might have one catalog called `analytics_catalog` for your analytics data and another called `operational_catalog` for your day-to-day operations. +- **Data source** represents each data source you connect to. For each data source, we store important information like: + - What kind of data source it is (PostgreSQL, Cassandra, etc.) + - How to connect to it (connection details and passwords) + - Special features the data source supports (like transactions) +- **Namespace** is like a subfolder within your data source that groups related tables together. In PostgreSQL these are called schemas, in Cassandra they're called keyspaces. You can have multiple levels of namespaces, similar to having folders within folders. +- **Table** is where your actual data lives. For each table, we keep track of: + - What columns it has + - What type of data each column can store + - Whether columns can be empty (null) +- **View namespace** is a special folder for views. Unlike regular namespaces that are tied to one data source, view namespaces can work with multiple data sources at once. +- **View** is like a virtual table that can: + - Show your data in a simpler way (like hiding technical columns in ScalarDB tables) + - Combine data from different sources using SQL queries + - Each view, like tables, has its own columns with specific types and rules about empty values. + +### Supported data types + +ScalarDB Analytics supports a wide range of data types across different data sources. The universal data catalog maps these data types to a common set of types to ensure compatibility and consistency across sources. The following list shows the supported data types in ScalarDB Analytics: + +- `BYTE` +- `SMALLINT` +- `INT` +- `BIGINT` +- `FLOAT` +- `DOUBLE` +- `DECIMAL` +- `TEXT` +- `BLOB` +- `BOOLEAN` +- `DATE` +- `TIME` +- `DATETIME` +- `TIMESTAMP` +- `DURATION` +- `INTERVAL` + +### Catalog information mappings by data source + +When registering a data source to ScalarDB Analytics, the catalog information of the data source, that is, namespaces, tables, and columns, are resolved and registered to the universal data catalog. To resolve the catalog information of the data source, a particular object on the data sources side are mapped to the universal data catalog object. This mapping is consists of two parts: catalog-level mappings and data-type mappings. In the following sections, we describe how ScalarDB Analytics maps the catalog level and data type from each data source into the universal data catalog. + +#### Catalog-level mappings + +The catalog-level mappings are the mappings of the namespace names, table names, and column names from the data sources to the universal data catalog. To see the catalog-level mappings in each data source, select a data source. + + + + The catalog information of ScalarDB is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows: + + - The ScalarDB namespace is mapped to the namespace. Therefore, the namespace of the ScalarDB data source is always single level, consisting of only the namespace name. + - The ScalarDB table is mapped to the table. + - The ScalarDB column is mapped to the column. + + + + + The catalog information of PostgreSQL is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows: + + - The PostgreSQL schema is mapped to the namespace. Therefore, the namespace of the PostgreSQL data source is always single level, consisting of only the schema name. + - Only user-defined schemas are mapped to namespaces. The following system schemas are ignored: + - `information_schema` + - `pg_catalog` + - The PostgreSQL table is mapped to the table. + - The PostgreSQL column is mapped to the column. + + + + The catalog information of MySQL is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows: + + - The MySQL database is mapped to the namespace. Therefore, the namespace of the MySQL data source is always single level, consisting of only the database name. + - Only user-defined databases are mapped to namespaces. The following system databases are ignored: + - `mysql` + - `sys` + - `information_schema` + - `performance_schema` + - The MySQL table is mapped to the table. + - The MySQL column is mapped to the column. + + + + The catalog information of Oracle is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows: + + - The Oracle schema is mapped to the namespace. Therefore, the namespace of the Oracle data source is always single level, consisting of only schema name. + - Only user-defined schemas are mapped to namespaces. The following system schemas are ignored: + - `ANONYMOUS` + - `APPQOSSYS` + - `AUDSYS` + - `CTXSYS` + - `DBSNMP` + - `DGPDB_INT` + - `DBSFWUSER` + - `DVF` + - `DVSYS` + - `GGSYS` + - `GSMADMIN_INTERNAL` + - `GSMCATUSER` + - `GSMROOTUSER` + - `GSMUSER` + - `LBACSYS` + - `MDSYS` + - `OJVMSYS` + - `ORDDATA` + - `ORDPLUGINS` + - `ORDSYS` + - `OUTLN` + - `REMOTE_SCHEDULER_AGENT` + - `SI_INFORMTN_SCHEMA` + - `SYS` + - `SYS$UMF` + - `SYSBACKUP` + - `SYSDG` + - `SYSKM` + - `SYSRAC` + - `SYSTEM` + - `WMSYS` + - `XDB` + - `DIP` + - `MDDATA` + - `ORACLE_OCM` + - `XS$NULL` + + + + The catalog information of SQL Server is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows: + + - The SQL Server database and schema are mapped to the namespace together. Therefore, the namespace of the SQL Server data source is always two-level, consisting of the database name and the schema name. + - Only user-defined databases are mapped to namespaces. The following system databases are ignored: + - `sys` + - `guest` + - `INFORMATION_SCHEMA` + - `db_accessadmin` + - `db_backupoperator` + - `db_datareader` + - `db_datawriter` + - `db_ddladmin` + - `db_denydatareader` + - `db_denydatawriter` + - `db_owner` + - `db_securityadmin` + - Only user-defined schemas are mapped to namespaces. The following system schemas are ignored: + - `master` + - `model` + - `msdb` + - `tempdb` + - The SQL Server table is mapped to the table. + - The SQL Server column is mapped to the column. + + + + Since DynamoDB is schema-less, you need to specify the catalog information explicitly when registering a DynamoDB data source by using the following format JSON: + + ```json + { + "namespaces": [ + { + "name": "", + "tables": [ + { + "name": "", + "columns": [ + { + "name": "", + "type": "" + }, + ... + ] + }, + ... + ] + }, + ... + ] + } + ``` + + In the specified JSON, you can use any arbitrary namespace names, but the table names must match the table names in DynamoDB and column name and type must match field names and types in DynamoDB. + + + + +#### Data-type mappings + +The native data types of the underlying data sources are mapped to the data types in ScalarDB Analytics. To see the data-type mappings in each data source, select a data source. + + + + | **ScalarDB Data Type** | **ScalarDB Analytics Data Type** | + |:------------------------------|:---------------------------------| + | `BOOLEAN` | `BOOLEAN` | + | `INT` | `INT` | + | `BIGINT` | `BIGINT` | + | `FLOAT` | `FLOAT` | + | `DOUBLE` | `DOUBLE` | + | `TEXT` | `TEXT` | + | `BLOB` | `BLOB` | + | `DATE` | `DATE` | + | `TIME` | `TIME` | + | `DATETIME` | `DATETIME` | + | `TIMESTAMP` | `TIMESTAMP` | + | `TIMESTAMPTZ` | `TIMESTAMPTZ` | + + + | **PostgreSQL Data Type** | **ScalarDB Analytics Data Type** | + |:------------------------------|:---------------------------------| + | `integer` | `INT` | + | `bigint` | `BIGINT` | + | `real` | `FLOAT` | + | `double precision` | `DOUBLE` | + | `smallserial` | `SMALLINT` | + | `serial` | `INT` | + | `bigserial` | `BIGINT` | + | `char` | `TEXT` | + | `varchar` | `TEXT` | + | `text` | `TEXT` | + | `bpchar` | `TEXT` | + | `boolean` | `BOOLEAN` | + | `bytea` | `BLOB` | + | `date` | `DATE` | + | `time` | `TIME` | + | `time with time zone` | `TIME` | + | `time without time zone` | `TIME` | + | `timestamp` | `DATETIME` | + | `timestamp with time zone` | `TIMESTAMP` | + | `timestamp without time zone` | `DATETIME` | + + + | **MySQL Data Type** | **ScalarDB Analytics Data Type** | + |:-----------------------|:---------------------------------| + | `bit` | `BOOLEAN` | + | `bit(1)` | `BOOLEAN` | + | `bit(x)` if *x >= 2* | `BLOB` | + | `tinyint` | `SMALLINT` | + | `tinyint(1)` | `BOOLEAN` | + | `boolean` | `BOOLEAN` | + | `smallint` | `SMALLINT` | + | `smallint unsigned` | `INT` | + | `mediumint` | `INT` | + | `mediumint unsigned` | `INT` | + | `int` | `INT` | + | `int unsigned` | `BIGINT` | + | `bigint` | `BIGINT` | + | `float` | `FLOAT` | + | `double` | `DOUBLE` | + | `real` | `DOUBLE` | + | `char` | `TEXT` | + | `varchar` | `TEXT` | + | `text` | `TEXT` | + | `binary` | `BLOB` | + | `varbinary` | `BLOB` | + | `blob` | `BLOB` | + | `date` | `DATE` | + | `time` | `TIME` | + | `datetime` | `DATETIME` | + | `timestamp` | `TIMESTAMP` | + + + | **Oracle Data Type** | **ScalarDB Analytics Data Type** | + |:-----------------------------------|:---------------------------------| + | `NUMBER` if *scale = 0* | `BIGINT` | + | `NUMBER` if *scale > 0* | `DOUBLE` | + | `FLOAT` if *precision ≤ 53* | `DOUBLE` | + | `BINARY_FLOAT` | `FLOAT` | + | `BINARY_DOUBLE` | `DOUBLE` | + | `CHAR` | `TEXT` | + | `NCHAR` | `TEXT` | + | `VARCHAR2` | `TEXT` | + | `NVARCHAR2` | `TEXT` | + | `CLOB` | `TEXT` | + | `NCLOB` | `TEXT` | + | `BLOB` | `BLOB` | + | `BOOLEAN` | `BOOLEAN` | + | `DATE` | `DATE` | + | `TIMESTAMP` | `TIMESTAMP` | + | `TIMESTAMP WITH TIME ZONE` | `TIMESTAMP` | + | `TIMESTAMP WITH LOCAL TIME ZONE` | `DATETIME` | + | `RAW` | `BLOB` | + + + | **SQL Server Data Type** | **ScalarDB Analytics Data Type** | + |:---------------------------|:---------------------------------| + | `bit` | `BOOLEAN` | + | `tinyint` | `SMALLINT` | + | `smallint` | `SMALLINT` | + | `int` | `INT` | + | `bigint` | `BIGINT` | + | `real` | `FLOAT` | + | `float` | `DOUBLE` | + | `float(n)` if *n ≤ 24* | `FLOAT` | + | `float(n)` if *n ≥ 25* | `DOUBLE` | + | `binary` | `BLOB` | + | `varbinary` | `BLOB` | + | `char` | `TEXT` | + | `varchar` | `TEXT` | + | `nchar` | `TEXT` | + | `nvarchar` | `TEXT` | + | `ntext` | `TEXT` | + | `text` | `TEXT` | + | `date` | `DATE` | + | `time` | `TIME` | + | `datetime` | `DATETIME` | + | `datetime2` | `DATETIME` | + | `smalldatetime` | `DATETIME` | + | `datetimeoffset` | `TIMESTAMP` | + + + | **DynamoDB Data Type** | **ScalarDB Analytics Data Type** | + |:-------------------------|:---------------------------------| + | `Number` | `BYTE` | + | `Number` | `SMALLINT` | + | `Number` | `INT` | + | `Number` | `BIGINT` | + | `Number` | `FLOAT` | + | `Number` | `DOUBLE` | + | `Number` | `DECIMAL` | + | `String` | `TEXT` | + | `Binary` | `BLOB` | + | `Boolean` | `BOOLEAN` | + +:::warning + +It is important to ensure that the field values of `Number` types are parsable as a specified data type for ScalarDB Analytics. For example, if a column that corresponds to a `Number`-type field is specified as an `INT` type, its value must be an integer. If the value is not an integer, an error will occur when running a query. + +::: + + + + +## Query engine + +A query engine is an independent component along with the universal data catalog, which is responsible for executing queries against the data sources registered in the universal data catalog and returning the results to the user. ScalarDB Analytics does not currently provide a built-in query engine. Instead, it is designed to be integrated with existing query engines, normally provided as a plugin of the query engine. + +When you run a query, the ScalarDB Analytics query engine plugin works as follows: + +1. Fetches the catalog metadata by calling the universal data catalog API, like the data source location, the table object identifier, and the table schema. +2. Sets up the data source connectors to the data sources by using the catalog metadata. +3. Provides the query optimization information to the query engine based on the catalog metadata. +4. Reads the data from the data sources by using the data source connectors. + +ScalarDB Analytics manages these processes internally. You can simply run a query against the universal data catalog by using the query engine API in the same way that you would normally run a query. + +ScalarDB Analytics currently supports Apache Spark as its query engine. For details on how to use ScalarDB Analytics with Spark, see [Run Analytical Queries Through ScalarDB Analytics](./run-analytical-queries.mdx). diff --git a/docs/scalardb-analytics/development.mdx b/docs/scalardb-analytics/development.mdx new file mode 100644 index 00000000..bfafd93d --- /dev/null +++ b/docs/scalardb-analytics/development.mdx @@ -0,0 +1,438 @@ +--- +tags: + - Enterprise Option + - Public Preview +displayed_sidebar: docsEnglish +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Run Analytical Queries Through ScalarDB Analytics + +This guide explains how to develop ScalarDB Analytics applications. For details on the architecture and design, see [ScalarDB Analytics Design](design.mdx) + +ScalarDB Analytics currently uses Spark as an execution engine and provides a Spark custom catalog plugin to provide a unified view of ScalarDB-managed and non-ScalarDB-managed data sources as Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly. + +## Preparation + +This section describes the prerequisites, setting up ScalarDB Analytics in the Spark configuration, and adding the ScalarDB Analytics dependency. + +### Prerequisites + +ScalarDB Analytics works with Apache Spark 3.4 or later. If you don't have Spark installed yet, please download the Spark distribution from [Apache's website](https://spark.apache.org/downloads.html). + +:::note + +Apache Spark are built with either Scala 2.12 or Scala 2.13. ScalarDB Analytics supports both versions. You need to be sure which version you are using so that you can select the correct version of ScalarDB Analytics later. You can refer to [Version Compatibility](#version-compatibility) for more details. + +::: + +### Set up ScalarDB Analytics in the Spark configuration + +The following sections describe all available configuration options for ScalarDB Analytics. These configurations control: + +- How ScalarDB Analytics integrates with Spark +- How data sources are connected and accessed +- How license information is provided + +For example configurations in a practical scenario, see [the sample application configuration](../scalardb-samples/scalardb-analytics-spark-sample/README.mdx#scalardb-analytics-configuration). + +#### Spark plugin configurations + +| Configuration Key | Required | Description | +|:-----------------|:---------|:------------| +| `spark.jars.packages` | No | A comma-separated list of Maven coordinates for the required dependencies. User need to include the ScalarDB Analytics package you are using, otherwise, specify it as the command line argument when running the Spark application. For the details about the Maven coordinates of ScalarDB Analytics, refer to [Add ScalarDB Analytics dependency](#add-scalardb-analytics-dependency). | +| `spark.sql.extensions` | Yes | Must be set to `com.scalar.db.analytics.spark.Extensions` | +| `spark.sql.catalog.` | Yes | Must be set to `com.scalar.db.analytics.spark.ScalarCatalog` | + +You can specify any name for ``. Be sure to use the same catalog name throughout your configuration. + +#### License configurations + +| Configuration Key | Required | Description | +| :--------------------------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------- | +| `spark.sql.catalog..license.key` | Yes | JSON string of the license key for ScalarDB Analytics | +| `spark.sql.catalog..license.cert_pem` | Yes | A string of PEM-encoded certificate of ScalarDB Analytics license. Either `cert_pem` or `cert_path` must be set. | +| `spark.sql.catalog..license.cert_path` | Yes | A path to the PEM-encoded certificate of ScalarDB Analytics license. Either `cert_pem` or `cert_path` must be set. | + +#### Data source configurations + +ScalarDB Analytics supports multiple types of data sources. Each type requires specific configuration parameters: + + + + +:::note + +ScalarDB Analytics supports ScalarDB as a data source. This table describes how to configure ScalarDB as a data source. + +::: + +| Configuration Key | Required | Description | +| :---------------------------------------------------------------------------- | :------- | :---------------------------------------------- | +| `spark.sql.catalog..data_source..type` | Yes | Always set to `scalardb` | +| `spark.sql.catalog..data_source..config_path` | Yes | The path to the configuration file for ScalarDB | + +:::tip + +You can use an arbitrary name for ``. + +::: + + + + +| Configuration Key | Required | Description | +| :------------------------------------------------------------------------- | :------- | :------------------------------------- | +| `spark.sql.catalog..data_source..type` | Yes | Always set to `mysql` | +| `spark.sql.catalog..data_source..host` | Yes | The host name of the MySQL server | +| `spark.sql.catalog..data_source..port` | Yes | The port number of the MySQL server | +| `spark.sql.catalog..data_source..username` | Yes | The username of the MySQL server | +| `spark.sql.catalog..data_source..password` | Yes | The password of the MySQL server | +| `spark.sql.catalog..data_source..database` | No | The name of the database to connect to | + +:::tip + +You can use an arbitrary name for ``. + +::: + + + + +| Configuration Key | Required | Description | +| :------------------------------------------------------------------------- | :------- | :--------------------------------------- | +| `spark.sql.catalog..data_source..type` | Yes | Always set to `postgresql` or `postgres` | +| `spark.sql.catalog..data_source..host` | Yes | The host name of the PostgreSQL server | +| `spark.sql.catalog..data_source..port` | Yes | The port number of the PostgreSQL server | +| `spark.sql.catalog..data_source..username` | Yes | The username of the PostgreSQL server | +| `spark.sql.catalog..data_source..password` | Yes | The password of the PostgreSQL server | +| `spark.sql.catalog..data_source..database` | Yes | The name of the database to connect to | + +:::tip + +You can use an arbitrary name for ``. + +::: + + + + +| Configuration Key | Required | Description | +| :----------------------------------------------------------------------------- | :------- | :------------------------------------ | +| `spark.sql.catalog..data_source..type` | Yes | Always set to `oracle` | +| `spark.sql.catalog..data_source..host` | Yes | The host name of the Oracle server | +| `spark.sql.catalog..data_source..port` | Yes | The port number of the Oracle server | +| `spark.sql.catalog..data_source..username` | Yes | The username of the Oracle server | +| `spark.sql.catalog..data_source..password` | Yes | The password of the Oracle server | +| `spark.sql.catalog..data_source..service_name` | Yes | The service name of the Oracle server | + +:::tip + +You can use an arbitrary name for ``. + +::: + + + + +| Configuration Key | Required | Description | +| :------------------------------------------------------------------------- | :------- | :----------------------------------------------------------------------------------------------------- | +| `spark.sql.catalog..data_source..type` | Yes | Always set to `sqlserver` or `mssql` | +| `spark.sql.catalog..data_source..host` | Yes | The host name of the SQL Server server | +| `spark.sql.catalog..data_source..port` | Yes | The port number of the SQL Server server | +| `spark.sql.catalog..data_source..username` | Yes | The username of the SQL Server server | +| `spark.sql.catalog..data_source..password` | Yes | The password of the SQL Server server | +| `spark.sql.catalog..data_source..database` | No | The name of the database to connect to | +| `spark.sql.catalog..data_source..secure` | No | Whether to use a secure connection to the SQL Server server. Set to `true` to use a secure connection. | + +:::tip + +You can use an arbitrary name for ``. + +::: + + + + +#### Example configuration + +Below is an example configuration for ScalarDB Analytics that demonstrates how to set up a catalog named `scalardb` with multiple data sources: + +```conf +# Spark plugin configurations +spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-_: +spark.sql.extensions com.scalar.db.analytics.spark.Extensions +spark.sql.catalog.scalardb com.scalar.db.analytics.spark.ScalarCatalog + +# License configurations +spark.sql.catalog.scalardb.license.key +spark.sql.catalog.scalardb.license.cert_pem + +# Data source configurations +spark.sql.catalog.scalardb.data_source.scalardb.type scalardb +spark.sql.catalog.scalardb.data_source.scalardb.config_path /path/to/scalardb.properties + +spark.sql.catalog.scalardb.data_source.mysql_source.type mysql +spark.sql.catalog.scalardb.data_source.mysql_source.host localhost +spark.sql.catalog.scalardb.data_source.mysql_source.port 3306 +spark.sql.catalog.scalardb.data_source.mysql_source.username root +spark.sql.catalog.scalardb.data_source.mysql_source.password password +spark.sql.catalog.scalardb.data_source.mysql_source.database mydb +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The license key for ScalarDB Analytics +- ``: The PEM-encoded certificate of ScalarDB Analytics license +- ``: The major and minor version of Spark you are using (such as 3.4) +- ``: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13) +- ``: The version of ScalarDB Analytics + +### Add the ScalarDB Analytics dependency + +ScalarDB Analytics is hosted in the Maven Central Repository. The name of the package is `scalardb-analytics-spark-all-_:`, where: + +- ``: The major and minor version of Spark you are using (such as 3.4) +- ``: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13) +- ``: The version of ScalarDB Analytics + +For details about version compatibility, refer to [Version Compatibility](#version-compatibility). + +You can add this dependency to your project by configuring the build settings of your project. For example, if you are using Gradle, you can add the following to your `build.gradle` file: + +```groovy +dependencies { + implementation 'com.scalar-labs:scalardb-analytics-spark-all-_:' +} +``` + +:::note + +If you want bundle your application in a single fat JAR file by using plugins like Gradle Shadow plugin or Maven Shade plugin, you need to exclude ScalarDB Analytics from the fat JAR file by choosing the appropriate configuration, such as `provided` or `shadow`, depending on the plugin you are using. + +::: + +## Develop a Spark application + +In this section, you will learn how to develop a Spark application that uses ScalarDB Analytics in Java. + +There are three ways to develop Spark applications with ScalarDB Analytics: + +1. **Spark driver application**: A traditional Spark application that runs within the cluster +2. **Spark Connect application**: A remote application that uses the Spark Connect protocol +3. **JDBC application**: A remote application that uses the JDBC interface + +:::note + +Depending on your environment, you may not be able to use all of the methods mentioned above. For details about supported features and deployment options, refer to [Supported managed Spark services and their application types](deployment.mdx#supported-managed-spark-services-and-their-application-types). + +::: + +With all of these methods, you can refer to tables in ScalarDB Analytics using the same table identifier format. For details about how ScalarDB Analytics maps catalog information from data sources, refer to [Catalog information mappings by data source](design.mdx#catalog-information-mappings-by-data-source). + + + + +You can use a commonly used `SparkSession` class for ScalarDB Analytics. Additionally, you can use any type of cluster deployment that Spark supports, such as YARN, Kubernetes, standalone, or local mode. + +To read data from tables in ScalarDB Analytics, you can use the `spark.sql` or `spark.read.table` function in the same way as when reading a normal Spark table. + +First, you need to set up your Java project. For example, if you are using Gradle, you can add the following to your `build.gradle` file: + +```groovy +dependencies { + implementation 'com.scalar-labs:scalardb-analytics-spark-_:' +} +``` + +Below is an example of a Spark Driver application: + +```java +import org.apache.spark.sql.SparkSession; + +public class MyApp { + public static void main(String[] args) { + // Create a SparkSession + try (SparkSession spark = SparkSession.builder().getOrCreate()) { + // Read data from a table in ScalarDB Analytics + spark.sql("SELECT * FROM my_catalog.my_data_source.my_namespace.my_table").show(); + } + } +} +``` + +Then, you can build and run your application by using the `spark-submit` command. + +:::note + +You may need to build a fat JAR file for your application, as is usual for normal Spark applications. + +::: + +```console +spark-submit --class MyApp --master local[*] my-spark-application-all.jar +``` + +:::tip + +You can also use other CLI tools that Spark provides, such as `spark-sql` and `spark-shell`, to interact with ScalarDB Analytics tables. + +::: + + + + +You can use [Spark Connect](https://spark.apache.org/spark-connect/) to interact with ScalarDB Analytics. By using Spark Connect, you can access a remote Spark cluster and read data in the same way as a Spark Driver application. The following briefly describes how to use Spark Connect. + +First, you need to start a Spark Connect server in the remote Spark cluster by running the following command: + +```console +./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_:,com.scalar-labs:scalardb-analytics-spark-all-_: +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13) +- ``: The full version of Spark you are using (such as 3.5.3) +- ``: The major and minor version of Spark you are using (such as 3.5) +- ``: The version of ScalarDB Analytics + +:::note + +The versions of the packages must match the versions of Spark and ScalarDB Analytics that you are using. + +::: + +You also need to include the Spark Connect client package in your application. For example, if you are using Gradle, you can add the following to your `build.gradle` file: + +```kotlin +implementation("org.apache.spark:spark-connect-client-jvm_2.12:3.5.3") +``` + +Then, you can write a Spark Connect client application to connect to the server and read data. + +```java +import org.apache.spark.sql.SparkSession; + +public class MyApp { + public static void main(String[] args) { + try (SparkSession spark = SparkSession.builder() + .remote("sc://:") + .getOrCreate()) { + + // Read data from a table in ScalarDB Analytics + spark.sql("SELECT * FROM my_catalog.my_data_source.my_namespace.my_table").show(); + } + } +} +``` + +You can run your Spark Connect client application as a normal Java application by running the following command: + +```console +java -jar my-spark-connect-client.jar +``` + +For details about how you can use Spark Connect, refer to the [Spark Connect documentation](https://spark.apache.org/docs/latest/spark-connect-overview.html). + + + + +Unfortunately, Spark Thrift JDBC server does not support the Spark features that are necessary for ScalarDB Analytics, so you cannot use JDBC to read data from ScalarDB Analytics in your Apache Spark environment. JDBC application is referred to here because some managed Spark services provide different ways to interact with a Spark cluster via the JDBC interface. For more details, refer to [Supported application types](deployment.mdx#supported-managed-spark-services-and-their-application-types). + + + + +## Catalog information mapping + +ScalarDB Analytics manages its own catalog, containing data sources, namespaces, tables, and columns. That information is automatically mapped to the Spark catalog. In this section, you will learn how ScalarDB Analytics maps its catalog information to the Spark catalog. + +For details about how information in the raw data sources is mapped to the ScalarDB Analytics catalog, refer to [Catalog information mappings by data source](design.mdx#catalog-information-mappings-by-data-source). + +### Catalog level mapping + +Each catalog level object in the ScalarDB Analytics catalog is mapped to a Spark catalog. The following table shows how the catalog levels are mapped: + +#### Data source tables + +Tables from data sources in the ScalarDB Analytics catalog are mapped to Spark tables. The following format is used to represent the identity of the Spark tables that correspond to ScalarDB Analytics tables: + +```console +... +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The name of the catalog. +- ``: The name of the data source. +- ``: The names of the namespaces. If the namespace names are multi-level, they are concatenated with a dot (`.`) as the separator. +- ``: The name of the table. + +For example, if you have a ScalarDB catalog named `my_catalog` that contains a data source named `my_data_source` and a schema named `my_schema`, you can refer to the table named `my_table` in that schema as `my_catalog.my_data_source.my_schema.my_table`. + +#### Views + +Views in ScalarDB Analytics are provided as tables in the Spark catalog, not views. The following format is used to represent the identity of the Spark tables that correspond to ScalarDB Analytics views: + +```console +.view.. +``` + +The following describes what you should change the content in the angle brackets to: + +- ``: The name of the catalog. +- ``: The names of the view namespaces. If the view namespace names are multi-level, they are concatenated with a dot (`.`) as the separator. +- ``: The name of the view. + +For example, if you have a ScalarDB catalog named `my_catalog` and a view namespace named `my_view_namespace`, you can refer to the view named `my_view` in that namespace as `my_catalog.view.my_view_namespace.my_view`. + +:::note + +`view` is prefixed to avoid conflicts with the data source table identifier. + +::: + +##### WAL-interpreted views + +As explained in [ScalarDB Analytics Design](design.mdx), ScalarDB Analytics provides a functionality called WAL-interpreted views, which is a special type of views. These views are automatically created for tables of ScalarDB data sources to provide a user-friendly view of the data by interpreting WAL-metadata in the tables. + +Since the data source name and the namespace names of the original ScalarDB tables are used as the view namespace names for WAL-interpreted views, if you have a ScalarDB table named `my_table` in a namespace named `my_namespace` of a data source named `my_data_source`, you can refer to the WAL-interpreted view of the table as `my_catalog.view.my_data_source.my_namespace.my_table`. + +### Data-type mapping + +ScalarDB Analytics maps data types in its catalog to Spark data types. The following table shows how the data types are mapped: + +| ScalarDB Data Type | Spark Data Type | +| :----------------- | :----------------- | +| `BYTE` | `Byte` | +| `SMALLINT` | `Short` | +| `INT` | `Integer` | +| `BIGINT` | `Long` | +| `FLOAT` | `Float` | +| `DOUBLE` | `Double` | +| `DECIMAL` | `Decimal` | +| `TEXT` | `String` | +| `BLOB` | `Binary` | +| `BOOLEAN` | `Boolean` | +| `DATE` | `Date` | +| `TIME` | `TimestampNTZ` | +| `DATETIME` | `TimestampNTZ` | +| `TIMESTAMP` | `Timestamp` | +| `DURATION` | `CalendarInterval` | +| `INTERVAL` | `CalendarInterval` | + +## Version compatibility + +Since Spark and Scala may be incompatible among different minor versions, ScalarDB Analytics offers different artifacts for various Spark and Scala versions, named in the format `scalardb-analytics-spark-all-_`. Make sure that you select the artifact matching the Spark and Scala versions you're using. For example, if you're using Spark 3.5 with Scala 2.13, you must specify `scalardb-analytics-spark-all-3.5_2.13`. + +Regarding the Java version, ScalarDB Analytics supports Java 8 or later. + +The following is a list of Spark and Scalar versions supported by each version of ScalarDB Analytics. + +| ScalarDB Analytics Version | ScalarDB Version | Spark Versions Supported | Scala Versions Supported | Minimum Java Version | +|:---------------------------|:-----------------|:-------------------------|:-------------------------|:---------------------| +| 3.15 | 3.15 | 3.5, 3.4 | 2.13, 2.12 | 8 | +| 3.14 | 3.14 | 3.5, 3.4 | 2.13, 2.12 | 8 | +| 3.12 | 3.12 | 3.5, 3.4 | 2.13, 2.12 | 8 | diff --git a/docs/scalardb-analytics/version-compatibility.mdx b/docs/scalardb-analytics/version-compatibility.mdx new file mode 100644 index 00000000..4107ceee --- /dev/null +++ b/docs/scalardb-analytics/version-compatibility.mdx @@ -0,0 +1,5 @@ +--- +tags: + - Enterprise Option +displayed_sidebar: docsEnglish +--- diff --git a/docusaurus.config.js b/docusaurus.config.js index 7b88d299..8c89ffb3 100644 --- a/docusaurus.config.js +++ b/docusaurus.config.js @@ -178,6 +178,10 @@ const config = { to: '/docs/latest/releases/release-support-policy', from: '/docs/releases/release-support-policy', }, + { + to: '/docs/latest/scalardb-analytics/development#version-compatibility', + from: '/docs/latest/scalardb-analytics-spark/version-compatibility', + }, { to: '/docs/3.13/run-non-transactional-storage-operations-through-primitive-crud-interface', from: '/docs/3.13/storage-abstraction', diff --git a/i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics-spark/README.mdx b/i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics/README.mdx similarity index 100% rename from i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics-spark/README.mdx rename to i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics/README.mdx diff --git a/i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics-spark/version-compatibility.mdx b/i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics/version-compatibility.mdx similarity index 100% rename from i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics-spark/version-compatibility.mdx rename to i18n/versioned_docs/ja-jp/docusaurus-plugin-content-docs/current/scalardb-analytics/version-compatibility.mdx diff --git a/sidebars.js b/sidebars.js index ec21d643..69aa3851 100644 --- a/sidebars.js +++ b/sidebars.js @@ -173,6 +173,11 @@ const sidebars = { id: 'scalardb-cluster/getting-started-with-using-python-for-scalardb-cluster', label: 'Use Python for ScalarDB Cluster', }, + { + type: 'doc', + id: 'scalardb-analytics/design', + label: 'ScalarDB Analytics Design', + }, { type: 'doc', id: 'scalardb-analytics-postgresql/installation', @@ -261,22 +266,6 @@ const sidebars = { }, ], }, - // { - // type: 'category', - // label: 'Run Analytical Queries', - // collapsible: true, - // link: { - // type: 'doc', - // id: 'develop-run-analytical-queries-overview', - // }, - // items: [ - // { - // type: 'doc', - // id: '', - // label: '