Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Aggregations Examples

Denny Lee edited this page Mar 14, 2017 · 10 revisions

Below are some examples of how you can do distributed aggregations and analytics using Apache Spark and Azure DocumentDB together. Note, Azure DocumentDB already has support for aggregations (link to blog goes here) so here is how you can take it to the next level with Apache Spark.

Note, these aggregations are in reference to the Spark to DocumentDB Connector Notebook

Connecting to Flights Sample Data

For these aggregations examples, we are accessing some flight performance data stored in our DoctorWho DocumentDB database. To connect to it, you will need to utilize the following code snippet below:

// Import Spark to DocumentDB Connector
import com.microsoft.azure.documentdb.spark.schema._
import com.microsoft.azure.documentdb.spark._
import com.microsoft.azure.documentdb.spark.config.Config

// Connect to DocumentDB Database
val readConfig2 = Config(Map("Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "le1n99i1w5l7uvokJs3RT5ZAH8dc3ql7lx2CG0h0kK4lVWPkQnwpRLyAN0nwS1z4Cyd1lJgvGUfMWR3v8vkXKA==",
"Database" -> "DepartureDelays",
"preferredRegions" -> "Central US;East US 2;",
"Collection" -> "flights_pcoll", 
"SamplingRatio" -> "1.0"))

// Create collection connection 
val coll = spark.sqlContext.read.DocumentDB(readConfig2)
coll.createOrReplaceTempView("c")

With this, we will also run a base query which transfer the filtered set of data we want from DocumentDB to Spark (where the latter can perform distributed aggregates). In this case, we are asking for flights departing from Seattle (SEA).

// Run, get row count, and time query
val originSEA = spark.sql("SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'")
originSEA.createOrReplaceTempView("originSEA")

Running a LIMIT query

Just like you're used to in SQL / Spark SQL, let's start off with a LIMIT query:

Clone this wiki locally