|
| 1 | +--- |
| 2 | +tags: |
| 3 | + - Enterprise Option |
| 4 | + - Public Preview |
| 5 | +--- |
| 6 | + |
| 7 | +# Getting Started with ScalarDB Analytics with Spark |
| 8 | + |
| 9 | +import Tabs from '@theme/Tabs'; |
| 10 | +import TabItem from '@theme/TabItem'; |
| 11 | + |
| 12 | +This guide explains how to get started with ScalarDB Analytics with Spark. |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +Before you can run queries with ScalarDB Analytics with Spark, you'll need to set up ScalarDB tables and install Apache Spark. |
| 17 | + |
| 18 | +### Set up ScalarDB tables |
| 19 | + |
| 20 | +To use ScalarDB Analytics with Spark, you need at least one underlying database in ScalarDB to run analytical queries on. If you have your own underlying database set up in ScalarDB, you can skip this section and use your database instead. |
| 21 | + |
| 22 | +If you don't have your own database set up yet, you can set up ScalarDB with a sample underlying database by following the instructions in [Run Analytical Queries on Sample Data by Using ScalarDB Analytics with Spark](../scalardb-samples/scalardb-analytics-spark-sample/README.mdx). |
| 23 | + |
| 24 | +### Install Apache Spark |
| 25 | + |
| 26 | +You also need a packaged release of Apache Spark. If you already have Spark installed, you can skip this section. |
| 27 | + |
| 28 | +If you need Spark, you can download it from the [Spark website](https://spark.apache.org/downloads.html). After downloading the compressed Spark file, you'll need to uncompress the file by running the following command, replacing `X.X.X` with the version of Spark that you downloaded: |
| 29 | + |
| 30 | +```console |
| 31 | +tar xf spark-X.X.X-bin-hadoop3.tgz |
| 32 | +``` |
| 33 | + |
| 34 | +Then, enter the directory by running the following command, again replacing `X.X.X` with the version of Spark that you downloaded: |
| 35 | + |
| 36 | +```console |
| 37 | +cd spark-X.X.X-bin-hadoop3 |
| 38 | +``` |
| 39 | + |
| 40 | +## Configure the Spark shell |
| 41 | + |
| 42 | +The following explains how to perform interactive analysis by using the Spark shell. |
| 43 | + |
| 44 | +Since ScalarDB Analytics with Spark is available on the Maven Central Repository, you can use it to enable ScalarDB Analytics with Spark in the Spark shell by using the `--packages` option, replacing `<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>` with the versions that you're using. |
| 45 | + |
| 46 | +```console |
| 47 | +./bin/spark-shell --packages com.scalar-labs:scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION> |
| 48 | +``` |
| 49 | + |
| 50 | +:::warning |
| 51 | + |
| 52 | +ScalarDB Analytics with Spark offers different artifacts for various Spark and Scala versions, provided in the format `scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>`. Make sure that you select the artifact matching the Spark and Scala versions you're using. |
| 53 | + |
| 54 | +For reference, see [Version Compatibility of ScalarDB Analytics with Spark](version-compatibility.mdx). |
| 55 | + |
| 56 | +::: |
| 57 | + |
| 58 | +Next, you'll need to configure the ScalarDB Analytics with Spark environment in the shell. ScalarDB Analytics with Spark provides a helper method for this purpose, which get everything set up to run analytical queries for you. |
| 59 | + |
| 60 | +```scala |
| 61 | +spark-shell> import com.scalar.db.analytics.spark.implicits._ |
| 62 | +spark-shell> spark.setupScalarDbAnalytics( |
| 63 | + | // ScalarDB config file |
| 64 | + | configPath = "/<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties", |
| 65 | + | // Namespaces in ScalarDB to import |
| 66 | + | namespaces = Set("<YOUR_NAMESPACE_NAME_1>", "<YOUR_NAMESPACE_NAME_2>"), |
| 67 | + | // License information |
| 68 | + | license = License.certPath("""{"your":"license", "key":"in", "json":"format"}""", "/<PATH_TO_YOUR_LICENSE>/cert.pem") |
| 69 | + | ) |
| 70 | +``` |
| 71 | + |
| 72 | +Now, you can read data from the tables in the underlying databases of ScalarDB and run any arbitrary analytical queries through the Spark Dataset API. For example: |
| 73 | + |
| 74 | +```console |
| 75 | +spark-shell> spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show() |
| 76 | +```` |
| 77 | + |
| 78 | +## Implement and submit a Spark application |
| 79 | + |
| 80 | +This section explains how to implement a Spark application with ScalarDB Analytics with Spark and submit it to the Spark cluster. |
| 81 | + |
| 82 | +You can integrate ScalarDB Analytics with Spark into your application by using build tools like SBT, Gradle, or Maven. |
| 83 | + |
| 84 | +<Tabs groupId="implementation" queryString> |
| 85 | + <TabItem value="gradle" label="Gradle" default> |
| 86 | + For Gradle projects, add the following to your `build.gradle.kts` file, replacing `<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>` with the versions that you're using: |
| 87 | + |
| 88 | + ```kotlin |
| 89 | + implementation("com.scalar-labs:scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>") |
| 90 | + ``` |
| 91 | + </TabItem> |
| 92 | + <TabItem value="maven" label="Maven" default> |
| 93 | + To configure Gradle by using Groovy, add the following to your `build.gradle` file, replacing `<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>` with the versions that you're using: |
| 94 | + |
| 95 | + ```groovy |
| 96 | + implementation 'com.scalar-labs:scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>' |
| 97 | + ``` |
| 98 | + </TabItem> |
| 99 | + <TabItem value="sbt" label="SBT"> |
| 100 | + To add your application to an SBT project, insert the following into your `build.sbt` file, replacing `<SPARK_VERSION>` and `<SCALA_VERSION>` with the versions that you're using: |
| 101 | + |
| 102 | + ```scala |
| 103 | + libraryDependencies += "com.scalar-labs" %% "scalardb-analytics-spark-<SPARK_VERSION>" % "<SCALA_VERSION>" |
| 104 | + ``` |
| 105 | + </TabItem> |
| 106 | +</Tabs> |
| 107 | + |
| 108 | +After integrating ScalarDB Analytics with Spark into your application, you can use the same helper method explained above to configure ScalarDB Analytics with Spark in your Spark application. |
| 109 | + |
| 110 | +<Tabs groupId="helper_method" queryString> |
| 111 | + <TabItem value="Scala" label="Scala" default> |
| 112 | + The following is a sample application that uses Scala: |
| 113 | + |
| 114 | + ```scala |
| 115 | + import com.scalar.db.analytics.spark.implicits._ |
| 116 | + |
| 117 | + object YourApp { |
| 118 | + def main(args: Array[String]): Unit = { |
| 119 | + // Initialize SparkSession as usual |
| 120 | + val spark = SparkSession.builder.appName("<YOUR_APPLICATION_NAME>").getOrCreate() |
| 121 | + // Setup ScalarDB Analytics with Spark via helper method |
| 122 | + spark.setupScalarDbAnalytics( |
| 123 | + // ScalarDB config file |
| 124 | + configPath = "/<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties", |
| 125 | + // Namespaces in ScalarDB to import |
| 126 | + namespaces = Set("<YOUR_NAMESPACE_NAME_1>", "<YOUR_NAMESPACE_NAME_2>"), |
| 127 | + // License information |
| 128 | + license = License.certPath("""{"your":"license", "key":"in", "json":"format"}""", "/<PATH_TO_YOUR_LICENSE>/cert.pem") |
| 129 | + ) |
| 130 | + // Run arbitrary queries |
| 131 | + spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show() |
| 132 | + // Stop SparkSession |
| 133 | + spark.stop() |
| 134 | + } |
| 135 | + } |
| 136 | + ``` |
| 137 | + </TabItem> |
| 138 | + <TabItem value="Java" label="Java"> |
| 139 | + You can write a Spark application with ScalarDB Analytics with Spark in Java: |
| 140 | + |
| 141 | + ```java |
| 142 | + import com.scalar.db.analytics.spark.ScalarDbAnalyticsInitializer |
| 143 | + |
| 144 | + class YourApp { |
| 145 | + public static void main(String[] args) { |
| 146 | + // Initialize SparkSession as usual |
| 147 | + SparkSession spark = SparkSession.builder().appName("<YOUR_APPLICATION_NAME>").getOrCreate() |
| 148 | + // Setup ScalarDB Analytics with Spark via helper class |
| 149 | + ScalarDbAnalyticsInitializer |
| 150 | + .builder() |
| 151 | + .spark(spark) |
| 152 | + .configPath("/<PATH_TO_YOUR_SCALARDB_PROPERTIES>/config.properties") |
| 153 | + .namespace("<YOUR_NAMESPACE_NAME_1>") |
| 154 | + .namespace("<YOUR_NAMESPACE_NAME_2>") |
| 155 | + .licenseKey("{\"your\":\"license\", \"key\":\"in\", \"json\":\"format\"}") |
| 156 | + .licenseCertPath("/<PATH_TO_YOUR_LICENSE>/cert.pem") |
| 157 | + .build() |
| 158 | + .run() |
| 159 | + // Run arbitrary queries |
| 160 | + spark.sql("select * from <YOUR_NAMESPACE_NAME_1>.<YOUR_TABLE_NAME>").show() |
| 161 | + // Stop SparkSession |
| 162 | + spark.stop() |
| 163 | + } |
| 164 | + } |
| 165 | + ``` |
| 166 | + </TabItem> |
| 167 | +</Tabs> |
| 168 | + |
| 169 | +Then, you need to build a .jar file by using your preferred build tool, like `sbt package` or `./gradlew assemble`. |
| 170 | + |
| 171 | +After building the .jar file, you can submit that .jar file to your Spark cluster by using `spark-submit`, using the `--packages` option to enable the ScalarDB Analytics libraries on your cluster by running the following command, replacing `<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION>` with the versions that you're using: |
| 172 | + |
| 173 | +```console |
| 174 | +./bin/spark-submit \ |
| 175 | + --class "YourApp" \ |
| 176 | + --packages com.scalar-labs:scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_WITH_SPARK_VERSION> \ |
| 177 | + <YOUR_APP_NAME>.jar |
| 178 | +``` |
| 179 | + |
| 180 | +For more information about general Spark application development, see the [Apache Spark documentation](https://spark.apache.org/docs/latest/). |
0 commit comments