Skip to content

Commit aceb649

Browse files
authored
Merge pull request #4142 from ClickHouse/clickhouse-glue-connector
ClickHouse Glue Connector
2 parents ed806a5 + 8c0e857 commit aceb649

File tree

5 files changed

+180
-50
lines changed

5 files changed

+180
-50
lines changed

docs/integrations/data-ingestion/aws-glue/index.md

Lines changed: 126 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -3,58 +3,125 @@ sidebar_label: 'Amazon Glue'
33
sidebar_position: 1
44
slug: /integrations/glue
55
description: 'Integrate ClickHouse and Amazon Glue'
6-
keywords: ['clickhouse', 'amazon', 'aws', 'glue', 'migrating', 'data']
7-
title: 'Integrating Amazon Glue with ClickHouse'
6+
keywords: ['clickhouse', 'amazon', 'aws', 'glue', 'migrating', 'data', 'spark']
7+
title: 'Integrating Amazon Glue with ClickHouse and Spark'
88
---
99

10+
import Image from '@theme/IdealImage';
1011
import Tabs from '@theme/Tabs';
1112
import TabItem from '@theme/TabItem';
13+
import notebook_connections_config from '@site/static/images/integrations/data-ingestion/aws-glue/notebook-connections-config.png';
14+
import dependent_jars_path_option from '@site/static/images/integrations/data-ingestion/aws-glue/dependent_jars_path_option.png';
1215

13-
# Integrating Amazon Glue with ClickHouse
16+
# Integrating Amazon Glue with ClickHouse and Spark
1417

1518
[Amazon Glue](https://aws.amazon.com/glue/) is a fully managed, serverless data integration service provided by Amazon Web Services (AWS). It simplifies the process of discovering, preparing, and transforming data for analytics, machine learning, and application development.
1619

17-
Although there is no Glue ClickHouse connector available yet, the official JDBC connector can be leveraged to connect and integrate with ClickHouse:
20+
## Installation {#installation}
21+
22+
To integrate your Glue code with ClickHouse, you can use our official Spark connector in Glue via one of the following:
23+
- Installing the ClickHouse Glue connector from the AWS Marketplace (recommended).
24+
- Manually adding the Spark Connector's jars to your Glue job.
1825

1926
<Tabs>
20-
<TabItem value="Java" label="Java" default>
27+
<TabItem value="AWS Marketplace" label="AWS Marketplace" default>
28+
29+
1. <h3 id="subscribe-to-the-connector">Subscribe to the Connector</h3>
30+
To access the connector in your account, subscribe to the ClickHouse AWS Glue Connector from AWS Marketplace.
31+
32+
2. <h3 id="grant-required-permissions">Grant Required Permissions</h3>
33+
Ensure your Glue job’s IAM role has the necessary permissions, as described in the minimum privileges [guide](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs-job.html#getting-started-min-privs-connectors).
34+
35+
3. <h3 id="activate-the-connector">Activate the Connector & Create a Connection</h3>
36+
You can activate the connector and create a connection directly by clicking [this link](https://console.aws.amazon.com/gluestudio/home#/connector/add-connection?connectorName="ClickHouse%20AWS%20Glue%20Connector"&connectorType="Spark"&connectorUrl=https://709825985650.dkr.ecr.us-east-1.amazonaws.com/clickhouse/clickhouse-glue:0.1&connectorClassName="com.clickhouse.spark.ClickHouseCatalog"), which opens the Glue connection creation page with key fields pre-filled. Give the connection a name, and press create (no need to provide the ClickHouse connection details at this stage).
37+
38+
4. <h3 id="use-in-glue-job">Use in Glue Job</h3>
39+
In your Glue job, select the `Job details` tab, and expend the `Advanced properties` window. Under the `Connections` section, select the connection you just created. The connector automatically injects the required JARs into the job runtime.
40+
41+
<Image img={notebook_connections_config} size='md' alt='Glue Notebook connections config' />
42+
43+
:::note
44+
The JARs used in the Glue connector are built for `Spark 3.2`, `Scala 2`, and `Python 3`. Make sure to select these versions when configuring your Glue job.
45+
:::
46+
47+
</TabItem>
48+
<TabItem value="Manual Installation" label="Manual Installation">
49+
To add the required jars manually, please follow the following:
50+
1. Upload the following jars to an S3 bucket - `clickhouse-jdbc-0.6.X-all.jar` and `clickhouse-spark-runtime-3.X_2.X-0.8.X.jar`.
51+
2. Make sure the Glue job has access to this bucket.
52+
3. Under the `Job details` tab, scroll down and expend the `Advanced properties` drop down, and fill the jars path in `Dependent JARs path`:
53+
54+
<Image img={dependent_jars_path_option} size='md' alt='Glue Notebook JAR path options' />
55+
56+
</TabItem>
57+
</Tabs>
58+
59+
## Examples {#example}
60+
<Tabs>
61+
<TabItem value="Scala" label="Scala" default>
2162

2263
```java
23-
import com.amazonaws.services.glue.util.Job
24-
import com.amazonaws.services.glue.util.GlueArgParser
2564
import com.amazonaws.services.glue.GlueContext
26-
import org.apache.spark.SparkContext
65+
import com.amazonaws.services.glue.util.GlueArgParser
66+
import com.amazonaws.services.glue.util.Job
67+
import com.clickhouseScala.Native.NativeSparkRead.spark
2768
import org.apache.spark.sql.SparkSession
28-
import org.apache.spark.sql.DataFrame
69+
2970
import scala.collection.JavaConverters._
30-
import com.amazonaws.services.glue.log.GlueLogger
71+
import org.apache.spark.sql.types._
72+
import org.apache.spark.sql.functions._
3173

32-
// Initialize Glue job
33-
object GlueJob {
74+
object ClickHouseGlueExample {
3475
def main(sysArgs: Array[String]) {
35-
val sc: SparkContext = new SparkContext()
36-
val glueContext: GlueContext = new GlueContext(sc)
37-
val spark: SparkSession = glueContext.getSparkSession
38-
val logger = new GlueLogger
39-
import spark.implicits._
40-
// @params: [JOB_NAME]
4176
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
42-
Job.init(args("JOB_NAME"), glueContext, args.asJava)
4377

44-
// JDBC connection details
45-
val jdbcUrl = "jdbc:ch://{host}:{port}/{schema}"
46-
val jdbcProperties = new java.util.Properties()
47-
jdbcProperties.put("user", "default")
48-
jdbcProperties.put("password", "*******")
49-
jdbcProperties.put("driver", "com.clickhouse.jdbc.ClickHouseDriver")
50-
51-
// Load the table from ClickHouse
52-
val df: DataFrame = spark.read.jdbc(jdbcUrl, "my_table", jdbcProperties)
53-
54-
// Show the Spark df, or use it for whatever you like
55-
df.show()
56-
57-
// Commit the job
78+
val sparkSession: SparkSession = SparkSession.builder
79+
.config("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
80+
.config("spark.sql.catalog.clickhouse.host", "<your-clickhouse-host>")
81+
.config("spark.sql.catalog.clickhouse.protocol", "https")
82+
.config("spark.sql.catalog.clickhouse.http_port", "<your-clickhouse-port>")
83+
.config("spark.sql.catalog.clickhouse.user", "default")
84+
.config("spark.sql.catalog.clickhouse.password", "<your-password>")
85+
.config("spark.sql.catalog.clickhouse.database", "default")
86+
// for ClickHouse cloud
87+
.config("spark.sql.catalog.clickhouse.option.ssl", "true")
88+
.config("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")
89+
.getOrCreate
90+
91+
val glueContext = new GlueContext(sparkSession.sparkContext)
92+
Job.init(args("JOB_NAME"), glueContext, args.asJava)
93+
import sparkSession.implicits._
94+
95+
val url = "s3://{path_to_cell_tower_data}/cell_towers.csv.gz"
96+
97+
val schema = StructType(Seq(
98+
StructField("radio", StringType, nullable = false),
99+
StructField("mcc", IntegerType, nullable = false),
100+
StructField("net", IntegerType, nullable = false),
101+
StructField("area", IntegerType, nullable = false),
102+
StructField("cell", LongType, nullable = false),
103+
StructField("unit", IntegerType, nullable = false),
104+
StructField("lon", DoubleType, nullable = false),
105+
StructField("lat", DoubleType, nullable = false),
106+
StructField("range", IntegerType, nullable = false),
107+
StructField("samples", IntegerType, nullable = false),
108+
StructField("changeable", IntegerType, nullable = false),
109+
StructField("created", TimestampType, nullable = false),
110+
StructField("updated", TimestampType, nullable = false),
111+
StructField("averageSignal", IntegerType, nullable = false)
112+
))
113+
114+
val df = sparkSession.read
115+
.option("header", "true")
116+
.schema(schema)
117+
.csv(url)
118+
119+
// Write to ClickHouse
120+
df.writeTo("clickhouse.default.cell_towers").append()
121+
122+
123+
// Read from ClickHouse
124+
val dfRead = spark.sql("select * from clickhouse.default.cell_towers")
58125
Job.commit()
59126
}
60127
}
@@ -70,6 +137,8 @@ from awsglue.utils import getResolvedOptions
70137
from pyspark.context import SparkContext
71138
from awsglue.context import GlueContext
72139
from awsglue.job import Job
140+
from pyspark.sql import Row
141+
73142

74143
## @params: [JOB_NAME]
75144
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
@@ -80,20 +149,29 @@ logger = glueContext.get_logger()
80149
spark = glueContext.spark_session
81150
job = Job(glueContext)
82151
job.init(args['JOB_NAME'], args)
83-
jdbc_url = "jdbc:ch://{host}:{port}/{schema}"
84-
query = "select * from my_table"
85-
# For cloud usage, please add ssl options
86-
df = (spark.read.format("jdbc")
87-
.option("driver", 'com.clickhouse.jdbc.ClickHouseDriver')
88-
.option("url", jdbc_url)
89-
.option("user", 'default')
90-
.option("password", '*******')
91-
.option("query", query)
92-
.load())
93-
94-
logger.info("num of rows:")
95-
logger.info(str(df.count()))
96-
logger.info("Data sample:")
152+
153+
spark.conf.set("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
154+
spark.conf.set("spark.sql.catalog.clickhouse.host", "<your-clickhouse-host>")
155+
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "https")
156+
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "<your-clickhouse-port>")
157+
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
158+
spark.conf.set("spark.sql.catalog.clickhouse.password", "<your-password>")
159+
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
160+
spark.conf.set("spark.clickhouse.write.format", "json")
161+
spark.conf.set("spark.clickhouse.read.format", "arrow")
162+
# for ClickHouse cloud
163+
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "true")
164+
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")
165+
166+
# Create DataFrame
167+
data = [Row(id=11, name="John"), Row(id=12, name="Doe")]
168+
df = spark.createDataFrame(data)
169+
170+
# Write DataFrame to ClickHouse
171+
df.writeTo("clickhouse.default.example_table").append()
172+
173+
# Read DataFrame from ClickHouse
174+
df_read = spark.sql("select * from clickhouse.default.example_table")
97175
logger.info(str(df.take(10)))
98176

99177
job.commit()
@@ -102,4 +180,4 @@ job.commit()
102180
</TabItem>
103181
</Tabs>
104182

105-
For more details, please visit our [Spark & JDBC documentation](/integrations/apache-spark/spark-jdbc#read-data).
183+
For more details, please visit our [Spark documentation](/integrations/apache-spark).

docs/integrations/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
205205
|Amazon Kinesis|<Kinesissvg style={{width: '3rem', height: '3rem'}} alt="Kenesis logo"/> |Data ingestion|Integration with Amazon Kinesis.|[Documentation](/integrations/clickpipes/kinesis/)|
206206
|Amazon MSK|<Amazonmsksvg style={{width: '3rem'}} alt="Amazon MSK logo"/> |Data ingestion|Integration with Amazon Managed Streaming for Apache Kafka (MSK).|[Documentation](/integrations/kafka/cloud/amazon-msk/)|
207207
|Amazon S3|<S3svg style={{width: '3rem', height: '3rem'}} alt="Amazon S3 logo"/>|Data ingestion|Import from, export to, and transform S3 data in flight with ClickHouse built-in S3 functions.|[Documentation](/integrations/data-ingestion/s3/index.md)|
208-
|Amazon Glue|<Image img={glue_logo} size="logo" alt="Amazon Glue logo"/>|Data ingestion|Query ClickHouse over JDBC|[Documentation](/integrations/glue)|
208+
|Amazon Glue|<Image img={glue_logo} size="logo" alt="Amazon Glue logo"/>|Data ingestion|Query ClickHouse over Spark using our official Glue connector|[Documentation](/integrations/glue)|
209209
|Apache Spark|<Sparksvg alt="Amazon Spark logo" style={{width: '3rem'}}/>|Data ingestion|Spark ClickHouse Connector is a high performance connector built on top of Spark DataSource V2.|[GitHub](https://github.com/housepower/spark-clickhouse-connector),<br/>[Documentation](/integrations/data-ingestion/apache-spark/index.md)|
210210
|Azure Event Hubs|<Azureeventhubssvg alt="Azure Events Hub logo" style={{width: '3rem'}}/>|Data ingestion|A data streaming platform that supports Apache Kafka's native protocol|[Website](https://azure.microsoft.com/en-gb/products/event-hubs)|
211211
|Azure Synapse|<Image img={azure_synapse_logo} size="logo" alt="Azure Synapse logo"/>|Data ingestion|A cloud-based analytics service for big data and data warehousing.|[Documentation](/integrations/azure-synapse)|

scripts/aspell-ignore/en/aspell-dict.txt

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3572,4 +3572,56 @@ zlib
35723572
znode
35733573
znodes
35743574
zookeeperSessionUptime
3575-
zstd
3575+
zstd
3576+
Okta
3577+
specificities
3578+
reproducibility
3579+
CertManager
3580+
Istio
3581+
LogHouse
3582+
Tailscale
3583+
Thanos
3584+
ReplacingReplicatedMergeTree
3585+
ReplacingSharedMergeTree
3586+
SharedMergeTree
3587+
VersionedCollapsing
3588+
subpath
3589+
AICPA
3590+
restartable
3591+
sumArray
3592+
sumForEach
3593+
argMaxIf
3594+
groupArrayResample
3595+
downsampled
3596+
uniqArrayIf
3597+
minSimpleState
3598+
sumArray
3599+
avgMerge
3600+
avgMergeState
3601+
timeslot
3602+
timeslots
3603+
groupArrayDistinct
3604+
avgMap
3605+
avgState
3606+
avgIf
3607+
quantilesTiming
3608+
quantilesTimingIf
3609+
quantilesTimingArrayIf
3610+
downvotes
3611+
sumSimpleState
3612+
upvotes
3613+
uniqArray
3614+
avgResample
3615+
countResample
3616+
avgMerge
3617+
avgState
3618+
argMinIf
3619+
minSimpleState
3620+
maxSimpleState
3621+
TimescaleDB
3622+
columnstore
3623+
TiDB
3624+
resync
3625+
resynchronization
3626+
Sackmann's
3627+
JARs
66.6 KB
Loading
74.1 KB
Loading

0 commit comments

Comments
 (0)