|
| 1 | +<page xmlns="http://projectmallard.org/1.0/" |
| 2 | + type="topic" |
| 3 | + id="aggregation"> |
| 4 | + <info><link xref="index#aggregation" type="guide"/></info> |
| 5 | + <title>Aggregation Framework Examples</title> |
| 6 | + |
| 7 | + <p>This document provides a number of practical examples that display the capabilities of the aggregation framework.</p> |
| 8 | + |
| 9 | + <p>The <link href="http://docs.mongodb.org/manual/tutorial/aggregation-examples/#aggregations-using-the-zip-code-data-set">Aggregations using the Zip Codes Data Set</link> examples uses a publicly available data set of all zipcodes and populations in the United States. These data are available at: <link href="http://media.mongodb.org/zips.json">zips.json</link>.</p> |
| 10 | + |
| 11 | + <section id="requirements"> |
| 12 | + <title>Requirements</title> |
| 13 | + |
| 14 | + <p><link href="https://mongodb.org">MongoDB</link>, version 2.2.0 or later. <link href="https://github.com/mongodb/mongo-c-driver">MongoDB C driver</link>, version 0.94.4 or later.</p> |
| 15 | + <p>Let's check if everything is installed.</p> |
| 16 | + <p>Use the following command to load zips.json data set into mongod instance:</p> |
| 17 | + |
| 18 | + <screen><input style="prompt">$ </input><input>mongoimport --drop -d test -c zipcodes zips.json</input></screen> |
| 19 | + |
| 20 | + <p>Let's use the MongoDB shell to verify that everything was imported successfully.</p> |
| 21 | + |
| 22 | + <screen><input style="prompt">$ </input><input>mongo test</input> |
| 23 | +<output>MongoDB shell version: 2.6.1 |
| 24 | +connecting to: test</output> |
| 25 | +<input style="prompt">> </input><input>db.zipcodes.count()</input> |
| 26 | +<output>29467</output> |
| 27 | +<input style="prompt">> </input><input>db.zipcodes.findOne()</input> |
| 28 | +<output><![CDATA[{ |
| 29 | + "_id" : "35004", |
| 30 | + "city" : "ACMAR", |
| 31 | + "loc" : [ |
| 32 | + -86.51557, |
| 33 | + 33.584132 |
| 34 | + ], |
| 35 | + "pop" : 6055, |
| 36 | + "state" : "AL" |
| 37 | +}]]></output></screen> |
| 38 | + </section> |
| 39 | + |
| 40 | + <section> |
| 41 | + <title>Aggregations using the Zip Codes Data Set</title> |
| 42 | + <p>Each document in this collection has the following form:</p> |
| 43 | + <synopsis><code mime="text/x-json"><![CDATA[{ |
| 44 | + "_id" : "35004", |
| 45 | + "city" : "Acmar", |
| 46 | + "state" : "AL", |
| 47 | + "pop" : 6055, |
| 48 | + "loc" : [-86.51557, 33.584132] |
| 49 | +}]]></code></synopsis> |
| 50 | + |
| 51 | + <p>In these documents:</p> |
| 52 | + |
| 53 | + <list> |
| 54 | + <item><p>The <code>_id</code> field holds the zipcode as a string.</p></item> |
| 55 | + <item><p>The <code>city</code> field holds the city name.</p></item> |
| 56 | + <item><p>The <code>state</code> field holds the two letter state abbreviation.</p></item> |
| 57 | + <item><p>The <code>pop</code> field holds the population.</p></item> |
| 58 | + <item><p>The <code>loc</code> field holds the location as a <code>[latitude, longitude]</code> array.</p></item> |
| 59 | + </list> |
| 60 | + </section> |
| 61 | + |
| 62 | + <section> |
| 63 | + <title>States with Populations Over 10 Million</title> |
| 64 | + <p>To get all states with a population greater than 10 million, use the following aggregation pipeline:</p> |
| 65 | + <synopsis><code mime="text/x-csrc"><![CDATA[#include <mongoc.h> |
| 66 | +#include <bcon.h> |
| 67 | +#include <stdio.h> |
| 68 | + |
| 69 | +static void |
| 70 | +print_pipeline (mongoc_collection_t *collection) |
| 71 | +{ |
| 72 | + bson_t *pipeline; |
| 73 | + mongoc_cursor_t *cursor; |
| 74 | + const bson_t *doc; |
| 75 | + |
| 76 | + pipeline = BCON_NEW ("pipeline", "[", |
| 77 | + "{", "$group", "{", "_id", "$state", "total_pop", "{", "$sum", "$pop", "}", "}", "}", |
| 78 | + "{", "$match", "{", "total_pop", "{", "$gte", BCON_INT32 (10000000), "}", "}", "}", |
| 79 | + "]"); |
| 80 | + |
| 81 | + cursor = mongoc_collection_aggregate (collection, MONGOC_QUERY_NONE, pipeline, NULL, NULL); |
| 82 | + |
| 83 | + while (mongoc_cursor_next (cursor, &doc)) { |
| 84 | + char *str; |
| 85 | + |
| 86 | + str = bson_as_json (doc, NULL); |
| 87 | + printf ("%s\n", str); |
| 88 | + bson_free (str); |
| 89 | + } |
| 90 | + |
| 91 | + mongoc_cursor_destroy (cursor); |
| 92 | + bson_destroy (pipeline); |
| 93 | +}]]></code></synopsis> |
| 94 | + |
| 95 | + <p>You should see a result like the following:</p> |
| 96 | + |
| 97 | + <synopsis><code mime="text/x-json"><![CDATA[{ "_id" : "PA", "total_pop" : 11881643 } |
| 98 | +{ "_id" : "OH", "total_pop" : 10847115 } |
| 99 | +{ "_id" : "NY", "total_pop" : 17990455 } |
| 100 | +{ "_id" : "FL", "total_pop" : 12937284 } |
| 101 | +{ "_id" : "TX", "total_pop" : 16986510 } |
| 102 | +{ "_id" : "IL", "total_pop" : 11430472 } |
| 103 | +{ "_id" : "CA", "total_pop" : 29760021 }]]></code></synopsis> |
| 104 | + |
| 105 | + <p>The above aggregation pipeline is build from two pipeline operators: <code>$group</code> and <code>$match</code>.</p> |
| 106 | + |
| 107 | + <p>The <code>$group</code> pipeline operator requires _id field where we specify grouping; remaining fields specify how to generate composite value and must use one of the group aggregation functions: <code>$addToSet</code>, <code>$first</code>, <code>$last</code>, <code>$max</code>, <code>$min</code>, <code>$avg</code>, <code>$push</code>, <code>$sum</code>. The <code>$match</code> pipeline operator syntax is the same as the read operation query syntax.</p> |
| 108 | + |
| 109 | + <p>The <code>$group</code> process reads all documents and for each state it creates a separate document, for example:</p> |
| 110 | + |
| 111 | + <synopsis><code mime="text/x-json">{ "_id" : "WA", "total_pop" : 4866692 }</code></synopsis> |
| 112 | + |
| 113 | + <p>The <code>total_pop</code> field uses the $sum aggregation function to sum the values of all pop fields in the source documents.</p> |
| 114 | + <p>Documents created by <code>$group</code> are piped to the <code>$match</code> pipeline operator. It returns the documents with the value of <code>total_pop</code> field greater than or equal to 10 million.</p> |
| 115 | + |
| 116 | + </section> |
| 117 | + |
| 118 | + <section> |
| 119 | + <title>Average City Population by State</title> |
| 120 | + <p>To get the first three states with the greatest average population per city, use the following aggregation:</p> |
| 121 | + |
| 122 | + <synopsis><code mime="text/x-csrc"><![CDATA[pipeline = BCON_NEW ("pipeline", "[", |
| 123 | + "{", "$group", "{", "_id", "{", "state", "$state", "city", "$city", "}", "pop", "{", "$sum", "$pop", "}", "}", "}", |
| 124 | + "{", "$group", "{", "_id", "$_id.state", "avg_city_pop", "{", "$avg", "$pop", "}", "}", "}", |
| 125 | + "{", "$sort", "{", "avg_city_pop", BCON_INT32 (-1), "}", "}", |
| 126 | + "{", "$limit", BCON_INT32 (3) "}", |
| 127 | +"]");]]></code></synopsis> |
| 128 | + |
| 129 | + <p>This aggregate pipeline produces:</p> |
| 130 | + |
| 131 | + <synopsis><code mime="text/x-json"><![CDATA[{ "_id" : "DC", "avg_city_pop" : 303450.0 } |
| 132 | +{ "_id" : "FL", "avg_city_pop" : 27942.29805615551 } |
| 133 | +{ "_id" : "CA", "avg_city_pop" : 27735.341099720412 }]]></code></synopsis> |
| 134 | + |
| 135 | + <p>The above aggregation pipeline is build from three pipeline operators: <code>$group</code>, <code>$sort</code> and <code>$limit</code>.</p> |
| 136 | + |
| 137 | + <p>The first <code>$group</code> operator creates the following documents:</p> |
| 138 | + |
| 139 | + <synopsis><code mime="text/x-json"><![CDATA[{ "_id" : { "state" : "WY", "city" : "Smoot" }, "pop" : 414 }]]></code></synopsis> |
| 140 | + |
| 141 | + <p>Note, that the <code>$group</code> operator can't use nested documents except the <code>_id</code> field.</p> |
| 142 | + |
| 143 | + <p>The second <code>$group</code> uses these documents to create the following documents:</p> |
| 144 | + |
| 145 | + <synopsis><code mime="text/x-json"><![CDATA[{ "_id" : "FL", "avg_city_pop" : 27942.29805615551 }]]></code></synopsis> |
| 146 | + |
| 147 | + <p>These documents are sorted by the <code>avg_city_pop</code> field in descending order. Finally, the <code>$limit</code> pipeline operator returns the first 3 documents from the sorted set.</p> |
| 148 | + </section> |
| 149 | + |
| 150 | +</page> |
0 commit comments