Query Test Runs

Below are the results of some query test runs using the different Spark to DocumentDB connector methods.

DocumentDB Configurations Used for Query Tests

Below are the DocumentDB configurations used:

Single partition collection: 10,000 RUs
airport.codes has 512 documents
DepartureDelays.flights has 1.05M documents (single collection)
Partitioned Collection: 250,000 RUs
DepartureDelays.flights (pColl) has 1.39M documents (partitioned collection)

Apache Spark Configurations Used for Query Tests

Below are the Apache Spark configurations used:

dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2

Query and Collections Used

The queries were:

Q1: SELECT c.City FROM c WHERE c.State='WA'
Q2a: SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
Q2b: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
Q3: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
Q4: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c

Note the slight differences between Q2a and Q2b; this is because Q2a via pyDocumentDB is using a DocumentDB SQL query (i.e. using TOP) while Q2b via azure-documentdb-spark is using a Spark SQL query (i.e. using LIMIT).

against the following collections:

C1: airport.codes
C2: DepartureDelays.flights
C3: DepartureDelays.flights_pcoll

The query results below are from executing df.count()

pyDocumentDB Performance

Below are the results of connecting Spark to DocumentDB via pyDocumentDB from the dev box:

Single Collection

Below are the results from querying a single collection

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q1	7	C1	0:00:00.225645	0:00:00.006784
Q2a	100	C2	0:00:00.214985	0:00:00.009669
Q3	14,808	C2	0:00:01.498699	0:00:01.323917
Q4	1,048,575	C2	0:01:37.518344

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions)

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2a	100	C3	0:00:00.774820	0:00:00.508290
Q3	23,078	C3	0:00:05.146107	0:00:03.234670
Q4	1,391,578	C3	0:02:36.335267

.

azure-documentdb-spark Performance

Below are the results of connecting Spark to DocumentDB via azure-documentdb-spark:

Single Collection

Below are the results from querying a single collection from the dev box:

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C2	00:00:01.183	00:00:00.958
Q3	14,808	C2	00:00:01.802	00:00:01.558
Q4	1,048,575	C2	00:00:56.642	00:00:54.931

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions):

Dev Box Results

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C3	0:00:00.774820	0:00:00.508290
Q3	23,078	C3	0:00:05.146107	0:00:03.234670
Q4	1,391,578	C3	0:02:36.335267

HDI Cluster: 2 workers

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C3	00:00:01.286	00:00:00.868
Q3	23,078	C3	00:00:01.582	00:00:01.339
Q4	1,391,578	C3	00:00:16.955	00:00:12.982

HDI Cluster: for Q4, multiple worker configurations

Re-running Q4 but scaling the cluster to see the impact of parallel queries

# of workers	Response Time (First)	Response Time (Second)
4	00:00:11.129	00:00:09.958
6	00:00:10.028	00:00:10.495
8	00:00:10.323	00:00:09.723
12	00:00:08.899	00:00:09.153
20	00:00:10.210	00:00:10.398

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query Test Runs

DocumentDB Configurations Used for Query Tests

Apache Spark Configurations Used for Query Tests

Query and Collections Used

pyDocumentDB Performance

Single Collection

Partitioned Collection

azure-documentdb-spark Performance

Single Collection

Partitioned Collection

Dev Box Results

HDI Cluster: 2 workers

HDI Cluster: for Q4, multiple worker configurations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally