Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Query Test Runs

Denny Lee edited this page Mar 11, 2017 · 6 revisions

Below are the results of some query test runs using the different Spark to DocumentDB connector methods.

DocumentDB Configurations Used for Query Tests

Below are the DocumentDB configurations used:

  • Single partition collection: 10,000 RUs
  • airport.codes has 512 documents
  • DepartureDelays.flights has 1.05M documents (single collection)
  • Partitioned Collection: 250,000 RUs
  • DepartureDelays.flights (pColl) has 1.39M documents (partitioned collection)

Apache Spark Configurations Used for Query Tests

Below are the Apache Spark configurations used:

  • dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
  • HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2

Query and Collections Used

The queries were:

  • Q1: SELECT c.City FROM c WHERE c.State='WA'
  • Q2a: SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
  • Q2b: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
  • Q3: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
  • Q4: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c

Note the slight differences between Q2a and Q2b; this is because Q2a via pyDocumentDB is using a DocumentDB SQL query (i.e. using TOP) while Q2b via azure-documentdb-spark is using a Spark SQL query (i.e. using LIMIT).

against the following collections:

  • C1: airport.codes
  • C2: DepartureDelays.flights
  • C3: DepartureDelays.flights_pcoll

The query results below are from executing df.count()

pyDocumentDB Performance

Below are the results of connecting Spark to DocumentDB via pyDocumentDB from the dev box:

Single Collection

Below are the results from querying a single collection

Query # of rows Collection Response Time (First) Response Time (Second)
Q1 7 C1 0:00:00.225645 0:00:00.006784
Q2a 100 C2 0:00:00.214985 0:00:00.009669
Q3 14,808 C2 0:00:01.498699 0:00:01.323917
Q4 1,048,575 C2 0:01:37.518344

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions)

Query # of rows Collection Response Time (First) Response Time (Second)
Q2a 100 C3 0:00:00.774820 0:00:00.508290
Q3 23,078 C3 0:00:05.146107 0:00:03.234670
Q4 1,391,578 C3 0:02:36.335267

.

azure-documentdb-spark Performance

Below are the results of connecting Spark to DocumentDB via azure-documentdb-spark:

Single Collection

Below are the results from querying a single collection from the dev box:

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C2 00:00:01.183 00:00:00.958
Q3 14,808 C2 00:00:01.802 00:00:01.558
Q4 1,048,575 C2 00:00:56.642 00:00:54.931

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions):

Dev Box Results

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C3 0:00:00.774820 0:00:00.508290
Q3 23,078 C3 0:00:05.146107 0:00:03.234670
Q4 1,391,578 C3 0:02:36.335267

HDI Cluster: 2 workers

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C3 00:00:01.286 00:00:00.868
Q3 23,078 C3 00:00:01.582 00:00:01.339
Q4 1,391,578 C3 00:00:16.955 00:00:12.982

HDI Cluster: for Q4, multiple worker configurations

Re-running Q4 but scaling the cluster to see the impact of parallel queries

# of workers Response Time (First) Response Time (Second)
4 00:00:11.129 00:00:09.958
6 00:00:10.028 00:00:10.495
8 00:00:10.323 00:00:09.723
12 00:00:08.899 00:00:09.153
20 00:00:10.210 00:00:10.398

Clone this wiki locally