@@ -8,7 +8,23 @@ The MarkLogic Spark connector allows for data to be retrieved from MarkLogic as
88[ Optic DSL query] ( https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710 ) . The
99sections below provide more detail on configuring how data is retrieved and converted into a Spark DataFrame.
1010
11- ## Query requirements
11+ ## Basic read operation
12+
13+ As shown in the [ Getting Started with PySpark guide] ( getting-started/pyspark.md ) , a basic read operation will define
14+ how the connector should connect to MarkLogic, the MarkLogic Optic query to run, and zero or more other options:
15+
16+ ```
17+ df = spark.read.format("com.marklogic.spark") \
18+ .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8020") \
19+ .option("spark.marklogic.read.opticQuery", "op.fromView('example', 'employee')") \
20+ .load()
21+ ```
22+
23+ As shown above, ` format ` , ` spark.marklogic.client.uri ` (or the other ` spark.marklogic.client ` options
24+ that can be used to define the connection details), and ` spark.marklogic.read.opticQuery ` are always required. The
25+ following sections provide more details about these and other options that can be set.
26+
27+ ## Optic query requirements
1228
1329As of the 2.0 release of the connector, the Optic query must use the
1430[ op.fromView] ( https://docs.marklogic.com/op.fromView ) accessor function. The query must also adhere to the
@@ -87,7 +103,7 @@ stream.stop()
87103Micro-batches are constructed based on the number of partitions and user-defined batch size; more information on each
88104setting can be found in section below on tuning performance. Each request to MarkLogic that is made in "batch read"
89105mode - i.e. when using Spark's ` read ` function instead of ` readStream ` - corresponds to a micro-batch when reading
90- data via a stream. In the example above, which uses the connector's default batch size of 10 ,000 rows and 2
106+ data via a stream. In the example above, which uses the connector's default batch size of 100 ,000 rows and 2
91107partitions, 2 calls are made to MarkLogic, resulting in two micro-batches.
92108
93109The number of micro-batches can be determined by enabling info-level logging and looking for a message similar to:
@@ -169,40 +185,46 @@ correct result, please [file an issue with this project](https://github.com/mark
169185
170186## Tuning performance
171187
172- The primary factor affecting how quickly the connector can retrieve rows is MarkLogic's ability to process your Optic
173- query. The
174- [ MarkLogic Optic performance documentation] ( https://docs.marklogic.com/guide/app-dev/OpticAPI#id_91398 ) can help with
175- optimizing your query to maximize performance.
188+ The primary factor affecting connector performance when reading rows is how many requests are made to MarkLogic. In
189+ general, performance will be best when minimizing the number of requests to MarkLogic while ensuring that no single
190+ request attempts to return or process too much data.
176191
177- Two [ configuration options] ( configuration.md ) in the connector will also impact performance . First, the
192+ Two [ configuration options] ( configuration.md ) control how many requests are made . First, the
178193` spark.marklogic.read.numPartitions ` option controls how many partitions are created. For each partition, Spark
179194will use a separate task to send requests to MarkLogic to retrieve rows matching your Optic DSL query. Second, the
180195` spark.marklogic.read.batchSize ` option controls approximately how many rows will be retrieved in each call to
181196MarkLogic.
182197
183- These two options impact each other in terms of how many tasks are used to make requests to MarkLogic. For example ,
184- consider an Optic query that matches 1 million rows in MarkLogic, a partition count of 10, and a batch size of
185- 10 ,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
186- each of which will retrieve approximately 100 ,000 unique rows. And with a batch size of 10 ,000, each partition
198+ To understand how these options control the number of requests to MarkLogic,
199+ consider an Optic query that matches 10 million rows in MarkLogic, a partition count of 10, and a batch size of
200+ 100 ,000 rows (the default value). This configuration will result in the connector creating 10 Spark partition readers,
201+ each of which will retrieve approximately 1 ,000,000 unique rows. And with a batch size of 100 ,000, each partition
187202reader will make approximately 10 calls to MarkLogic to retrieve these rows, for a total of 100 calls across all
188- partitions.
203+ partitions.
189204
190- Performance can thus be tested by varying the number of partitions and the batch size. In general, increasing the
191- number of partitions should help performance as the number of matching rows increases. A single partition may suffice
192- for a query that returns thousands of rows or fewer, while a query that returns hundreds of millions of rows will
193- benefit from dozens of partitions or more. The ideal settings will depend on your Spark and MarkLogic environments
194- along with the complexity of your Optic query. Testing should be performed with different queries, partition counts,
195- and batch sizes to determine the optimal settings.
205+ Performance should be tested by varying the number of partitions and the batch size. In general, increasing the
206+ number of partitions should help performance as the number of rows to return increases. Determining the optimal batch
207+ size depends both on the number of columns in each returned row and what kind of Spark operations are being invoked.
208+ The next section describes both how the connector tries to optimize performance when an aggregation is performed
209+ and when the same kind of optimization should be made when not many rows need to be returned.
196210
197211### Optimizing for smaller result sets
198212
199213If your Optic query matches a set of rows whose count is a small percentage of the total number of rows in
200- the view that the query runs against, you may find improved performance by setting ` spark.marklogic.read.batchSize `
201- to zero. Doing so ensures that for each partition, a single request is sent to MarkLogic.
202-
203- If the result set matching your query is particularly small - such as thousands of rows or less, or possibly tens of
204- thousands of rows or less - you may find optimal performance by also setting ` spark.marklogic.read.numPartitions ` to
205- one. This will result in the connector sending a single request to MarkLogic.
214+ the view that the query runs against, you should find improved performance by setting ` spark.marklogic.read.batchSize `
215+ to zero. This setting ensures that for each partition, a single request is sent to MarkLogic.
216+
217+ If your Spark program includes an aggregation that the connector can push down to MarkLogic, then the connector will
218+ automatically use a batch size of zero unless you specify a different value for ` spark.marklogic.read.batchSize ` . This
219+ optimization should typically be desirable when calculating an aggregation, as MarkLogic will return far fewer rows
220+ per request depending on the type of aggregation.
221+
222+ If the result set matching your query is particularly small - such as tens of thousands of rows or less, or possibly
223+ hundreds of thousands of rows or less - you may find optimal performance by setting
224+ ` spark.marklogic.read.numPartitions ` to one. This will result in the connector sending a single request to MarkLogic.
225+ The effectiveness of this approach can be evaluated by executing the Optic query via
226+ [ MarkLogic's qconsole application] ( https://docs.marklogic.com/guide/qconsole/intro ) , which will execute the query in
227+ a single request as well.
206228
207229### More detail on partitions
208230
0 commit comments