You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 24, 2021. It is now read-only.
In case of doing multiple operations on a dataframe (select, filter etc.),
160
-
you should persist a dataframe. Othewise, every operation on a dataframe will load the same data from Cloudant again.
145
+
you should persist a dataframe. Otherwise, every operation on a dataframe will load the same data from Cloudant again.
161
146
Persisting will also speed up computation. This statement will persist an RDD in memory: `df.cache()`. Alternatively for large dbs to persist in memory & disk, use:
By default, Spark Streaming will load all documents from a database. If you want to limit the loading to specific documents, use `selector` option of `CloudantReceiver` and specify your conditions ([Scala code](examples/scala/src/main/scala/mytest/spark/CloudantStreamingSelector.scala)):
220
+
By default, Spark Streaming will load all documents from a database. If you want to limit the loading to specific documents, use `selector` option of `CloudantReceiver` and specify your conditions ([CloudantStreamingSelector.scala](examples/scala/src/main/scala/mytest/spark/CloudantStreamingSelector.scala)):
@@ -283,11 +269,11 @@ cloudant.protocol|https|protocol to use to transfer data: http or https
283
269
cloudant.host||cloudant host url
284
270
cloudant.username||cloudant userid
285
271
cloudant.password||cloudant password
286
-
jsonstore.rdd.partitions|5|the number of partitions intent used to drive JsonStoreRDD loading query result in parallel. The actual number is calculated based on total rows returned and satisfying maxInPartition and minInPartition
272
+
jsonstore.rdd.partitions|10|the number of partitions intent used to drive JsonStoreRDD loading query result in parallel. The actual number is calculated based on total rows returned and satisfying maxInPartition and minInPartition
287
273
jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means unlimited
288
274
jsonstore.rdd.minInPartition|10|the min rows in a partition.
289
-
jsonstore.rdd.requestTimeout|100000| the request timeout in milli-second
290
-
bulkSize|20| the bulk save size
275
+
jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds
276
+
bulkSize|200| the bulk save size
291
277
schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means we are using only first document for schema discovery; -1 means all documents; 0 will be treated as 1; any number N means min(N, total) docs
292
278
createDBOnSave|"false"| whether to create a new database during save operation. If false, a database should already exist. If true, a new database will be created. If true, and a database with a provided name already exists, an error will be raised.
293
279
@@ -304,15 +290,15 @@ view||cloudant view w/o the database name. only used for load.
304
290
index||cloudant search index w/o the database name. only used for load data with less than or equal to 200 results.
305
291
path||cloudant: as database name if database is not present
306
292
schemaSampleSize|"-1"| the sample size used to discover the schema for this temp table. -1 scans all documents
307
-
bulkSize|20| the bulk save size
293
+
bulkSize|200| the bulk save size
308
294
createDBOnSave|"false"| whether to create a new database during save operation. If false, a database should already exist. If true, a new database will be created. If true, and a database with a provided name already exists, an error will be raised.
309
295
310
296
311
297
312
298
For fast loading, views are loaded without include_docs. Thus, a derived schema will always be: `{id, key, value}`, where `value `can be a compount field. An example of loading data from a view:
0 commit comments