Skip to content

Commit 904ed8c

Browse files
authored
Improve clarity + typos (#3346)
1 parent 10c1210 commit 904ed8c

File tree

3 files changed

+12
-15
lines changed

3 files changed

+12
-15
lines changed

docs/integrations/data-ingestion/apache-spark/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ import TOCInline from '@theme/TOCInline';
1414

1515
<br/>
1616

17-
[Apache Spark](https://spark.apache.org/) Apache Spark™ is a multi-language engine for executing data engineering, data
17+
[Apache Spark](https://spark.apache.org/) is a multi-language engine for executing data engineering, data
1818
science, and machine learning on single-node machines or clusters.
1919

2020
There are two main ways to connect Apache Spark and ClickHouse:
2121

22-
1. [Spark Connector](./apache-spark/spark-native-connector) - the Spark connector implements the `DataSourceV2` and has its own Catalog
22+
1. [Spark Connector](./apache-spark/spark-native-connector) - The Spark connector implements the `DataSourceV2` and has its own Catalog
2323
management. As of today, this is the recommended way to integrate ClickHouse and Spark.
2424
2. [Spark JDBC](./apache-spark/spark-jdbc) - Integrate Spark and ClickHouse
2525
using a [JDBC data source](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).

docs/integrations/data-ingestion/apache-spark/spark-jdbc.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import TabItem from '@theme/TabItem';
1010
import TOCInline from '@theme/TOCInline';
1111

1212
# Spark JDBC
13-
One of the most used data sources supported by Spark is JDBC.
13+
JDBC is one of the most commonly used data sources in Spark.
1414
In this section, we will provide details on how to
1515
use the [ClickHouse official JDBC connector](/integrations/java/jdbc-driver) with Spark.
1616

@@ -209,7 +209,6 @@ df.show()
209209
.option("dbtable", "example_table")
210210
.option("user", "default")
211211
.option("password", "123456")
212-
.option("SaveMode", "append")
213212
.save();
214213

215214

@@ -248,15 +247,15 @@ object WriteData extends App {
248247
)
249248

250249
//---------------------------------------------------------------------------------------------------//---------------------------------------------------------------------------------------------------
251-
// Write the df to ClickHouse using the jdbc method// Write the df to ClickHouse using the jdbc method
250+
// Write the df to ClickHouse using the jdbc method
252251
//---------------------------------------------------------------------------------------------------//---------------------------------------------------------------------------------------------------
253252

254253
df.write
255254
.mode(SaveMode.Append)
256255
.jdbc(jdbcUrl, "example_table", jdbcProperties)
257256

258257
//---------------------------------------------------------------------------------------------------//---------------------------------------------------------------------------------------------------
259-
// Write the df to ClickHouse using the save method// Write the df to ClickHouse using the save method
258+
// Write the df to ClickHouse using the save method
260259
//---------------------------------------------------------------------------------------------------//---------------------------------------------------------------------------------------------------
261260

262261
df.write
@@ -266,7 +265,6 @@ object WriteData extends App {
266265
.option("dbtable", "example_table")
267266
.option("user", "default")
268267
.option("password", "123456")
269-
.option("SaveMode", "append")
270268
.save()
271269

272270

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ With these external solutions, users had to register their data source tables ma
2323
However, since Spark 3.0 introduced the catalog concept, Spark can now automatically discover tables by registering
2424
catalog plugins.
2525

26-
Spark default catalog is `spark_catalog`, and tables are identified by `{catalog name}.{database}.{table}`. With the new
26+
Spark's default catalog is `spark_catalog`, and tables are identified by `{catalog name}.{database}.{table}`. With the new
2727
catalog feature, it is now possible to add and work with multiple catalogs in a single Spark application.
2828

2929
<TOCInline toc={toc}></TOCInline>
@@ -124,7 +124,7 @@ libraryDependencies += "com.clickhouse.spark" %% clickhouse-spark-runtime-{{ spa
124124
</TabItem>
125125
<TabItem value="Spark SQL/Shell CLI" label="Spark SQL/Shell CLI">
126126

127-
When working with Spark's shell options (Spark SQL CLI, Spark Shell CLI, Spark Submit command), the dependencies can be
127+
When working with Spark's shell options (Spark SQL CLI, Spark Shell CLI, and Spark Submit command), the dependencies can be
128128
registered by passing the required jars:
129129

130130
```text
@@ -135,7 +135,7 @@ $SPARK_HOME/bin/spark-sql \
135135
If you want to avoid copying the JAR files to your Spark client node, you can use the following instead:
136136

137137
```text
138-
--repositories https://{maven-cental-mirror or private-nexus-repo} \
138+
--repositories https://{maven-central-mirror or private-nexus-repo} \
139139
--packages com.clickhouse.spark:clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }},com.clickhouse:clickhouse-jdbc:{{ clickhouse_jdbc_version }}:all
140140
```
141141

@@ -161,7 +161,7 @@ and all daily build SNAPSHOT JAR files in the [Sonatype OSS Snapshots Repository
161161
It's essential to include the [clickhouse-jdbc JAR](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-jdbc)
162162
with the "all" classifier,
163163
as the connector relies on [clickhouse-http](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-http-client)
164-
and [clickhouse-client](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-client) —both of which are bundled
164+
and [clickhouse-client](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-client) both of which are bundled
165165
in clickhouse-jdbc:all.
166166
Alternatively, you can add [clickhouse-client JAR](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-client)
167167
and [clickhouse-http](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-http-client) individually if you
@@ -193,7 +193,7 @@ These settings could be set via one of the following:
193193
* Add the configuration when initiating your context.
194194

195195
:::important
196-
When working with ClickHouse cluster, you need to set a unique catalog name for each instance.
196+
When working with a ClickHouse cluster, you need to set a unique catalog name for each instance.
197197
For example:
198198

199199
```text
@@ -498,13 +498,13 @@ The following are the adjustable configurations available in the connector:
498498

499499
| Key | Default | Description | Since |
500500
|----------------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
501-
| spark.clickhouse.ignoreUnsupportedTransform | false | ClickHouse supports using complex expressions as sharding keys or partition values, e.g. `cityHash64(col_1, col_2)`, and those can not be supported by Spark now. If `true`, ignore the unsupported expressions, otherwise fail fast w/ an exception. Note, when `spark.clickhouse.write.distributed.convertLocal` is enabled, ignore unsupported sharding keys may corrupt the data. | 0.4.0 |
501+
| spark.clickhouse.ignoreUnsupportedTransform | false | ClickHouse supports using complex expressions as sharding keys or partition values, e.g. `cityHash64(col_1, col_2)`, which are currently not supported by Spark. If `true`, ignore the unsupported expressions, otherwise fail fast w/ an exception. Note, when `spark.clickhouse.write.distributed.convertLocal` is enabled, ignore unsupported sharding keys may corrupt the data. | 0.4.0 |
502502
| spark.clickhouse.read.compression.codec | lz4 | The codec used to decompress data for reading. Supported codecs: none, lz4. | 0.5.0 |
503503
| spark.clickhouse.read.distributed.convertLocal | true | When reading Distributed table, read local table instead of itself. If `true`, ignore `spark.clickhouse.read.distributed.useClusterNodes`. | 0.1.0 |
504504
| spark.clickhouse.read.fixedStringAs | binary | Read ClickHouse FixedString type as the specified Spark data type. Supported types: binary, string | 0.8.0 |
505505
| spark.clickhouse.read.format | json | Serialize format for reading. Supported formats: json, binary | 0.6.0 |
506506
| spark.clickhouse.read.runtimeFilter.enabled | false | Enable runtime filter for reading. | 0.8.0 |
507-
| spark.clickhouse.read.splitByPartitionId | true | If `true`, construct input partition filter by virtual column `_partition_id`, instead of partition value. There are known bugs to assemble SQL predication by partition value. This feature requires ClickHouse Server v21.6+ | 0.4.0 |
507+
| spark.clickhouse.read.splitByPartitionId | true | If `true`, construct input partition filter by virtual column `_partition_id`, instead of partition value. There are known issues with assembling SQL predicates by partition value. This feature requires ClickHouse Server v21.6+ | 0.4.0 |
508508
| spark.clickhouse.useNullableQuerySchema | false | If `true`, mark all the fields of the query schema as nullable when executing `CREATE/REPLACE TABLE ... AS SELECT ...` on creating the table. Note, this configuration requires SPARK-43390(available in Spark 3.5), w/o this patch, it always acts as `true`. | 0.8.0 |
509509
| spark.clickhouse.write.batchSize | 10000 | The number of records per batch on writing to ClickHouse. | 0.1.0 |
510510
| spark.clickhouse.write.compression.codec | lz4 | The codec used to compress data for writing. Supported codecs: none, lz4. | 0.3.0 |
@@ -520,7 +520,6 @@ The following are the adjustable configurations available in the connector:
520520
| spark.clickhouse.write.retryInterval | 10s | The interval in seconds between write retry. | 0.1.0 |
521521
| spark.clickhouse.write.retryableErrorCodes | 241 | The retryable error codes returned by ClickHouse server when write failing. | 0.1.0 |
522522

523-
524523
## Supported Data Types {#supported-data-types}
525524

526525
This section outlines the mapping of data types between Spark and ClickHouse. The tables below provide quick references

0 commit comments

Comments
 (0)