|
1 | 1 | ---
|
2 |
| -title: 'Graph data modeling for Azure Cosmos DB for Gremlin' |
3 |
| -description: Learn how to model a graph database by using Azure Cosmos DB for Gremlin. This article describes when to use a graph database and best practices to model entities and relationships. |
| 2 | +title: Graph data modeling with Azure Cosmos DB for Apache Gremlin |
| 3 | +description: Learn how to model a graph database by using Azure Cosmos DB for Apache Gremlin, and learn best practices to model entities and relationships. |
4 | 4 | ms.service: cosmos-db
|
5 | 5 | ms.subservice: apache-gremlin
|
6 | 6 | ms.custom: ignite-2022
|
7 | 7 | ms.topic: how-to
|
8 |
| -ms.date: 12/02/2019 |
| 8 | +ms.date: 03/14/2023 |
9 | 9 | author: manishmsfte
|
10 | 10 | ms.author: mansha
|
11 | 11 | ---
|
12 | 12 |
|
13 |
| -# Graph data modeling for Azure Cosmos DB for Gremlin |
| 13 | +# Graph data modeling with Azure Cosmos DB for Apache Gremlin |
14 | 14 | [!INCLUDE[Gremlin](../includes/appliesto-gremlin.md)]
|
15 | 15 |
|
16 |
| -The following document is designed to provide graph data modeling recommendations. This step is vital in order to ensure the scalability and performance of a graph database system as the data evolves. An efficient data model is especially important with large-scale graphs. |
| 16 | +This article provides recommendations for the use of graph data models. These best practices are vital for ensuring the scalability and performance of a graph database system as the data evolves. An efficient data model is especially important for large-scale graphs. |
17 | 17 |
|
18 | 18 | ## Requirements
|
19 | 19 |
|
20 | 20 | The process outlined in this guide is based on the following assumptions:
|
21 |
| - * The **entities** in the problem-space are identified. These entities are meant to be consumed _atomically_ for each request. In other words, the database system isn't designed to retrieve a single entity's data in multiple query requests. |
22 |
| - * There is an understanding of **read and write requirements** for the database system. These requirements will guide the optimizations needed for the graph data model. |
23 |
| - * The principles of the [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) are well understood. |
| 21 | + |
| 22 | +* The *entities* in the problem-space are identified. These entities are meant to be consumed *atomically* for each request. In other words, the database system isn't designed to retrieve a single entity's data in multiple query requests. |
| 23 | +* There's an understanding of *read and write requirements* for the database system. These requirements guide the optimizations needed for the graph data model. |
| 24 | +* The principles of the [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) are well understood. |
24 | 25 |
|
25 | 26 | ## When do I need a graph database?
|
26 | 27 |
|
27 |
| -A graph database solution can be optimally applied if the entities and relationships in a data domain have any of the following characteristics: |
| 28 | +A graph database solution can be optimally used if the entities and relationships in a data domain have any of the following characteristics: |
28 | 29 |
|
29 |
| -* The entities are **highly connected** through descriptive relationships. The benefit in this scenario is the fact that the relationships are persisted in storage. |
30 |
| -* There are **cyclic relationships** or **self-referenced entities**. This pattern is often a challenge when using relational or document databases. |
31 |
| -* There are **dynamically evolving relationships** between entities. This pattern is especially applicable to hierarchical or tree-structured data with many levels. |
32 |
| -* There are **many-to-many relationships** between entities. |
33 |
| -* There are **write and read requirements on both entities and relationships**. |
| 30 | +* The entities are *highly connected* through descriptive relationships. The benefit in this scenario is that the relationships persist in storage. |
| 31 | +* There are *cyclic relationships* or *self-referenced entities*. This pattern is often a challenge when you use relational or document databases. |
| 32 | +* There are *dynamically evolving relationships* between entities. This pattern is especially applicable to hierarchical or tree-structured data with many levels. |
| 33 | +* There are *many-to-many relationships* between entities. |
| 34 | +* There are *write and read requirements on both entities and relationships*. |
34 | 35 |
|
35 |
| -If the above criteria is satisfied, it's likely that a graph database approach will provide advantages for **query complexity**, **data model scalability**, and **query performance**. |
| 36 | +If the above criteria are satisfied, a graph database approach likely provides advantages for *query complexity*, *data model scalability*, and *query performance*. |
36 | 37 |
|
37 |
| -The next step is to determine if the graph is going to be used for analytic or transactional purposes. If the graph is intended to be used for heavy computation and data processing workloads, it would be worth to explore the [Cosmos DB Spark connector](../nosql/quickstart-spark.md) and the use of the [GraphX library](https://spark.apache.org/graphx/). |
| 38 | +The next step is to determine if the graph is going to be used for analytic or transactional purposes. If the graph is intended to be used for heavy computation and data processing workloads, it's worth exploring the [Cosmos DB Spark connector](../nosql/quickstart-spark.md) and the [GraphX library](https://spark.apache.org/graphx/). |
38 | 39 |
|
39 | 40 | ## How to use graph objects
|
40 | 41 |
|
41 |
| -The [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) defines two types of objects **Vertices** and **Edges**. |
| 42 | +The [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) defines two types of objects: *vertices* and *edges*. |
42 | 43 |
|
43 |
| -The following are the best practices for the properties in the graph objects: |
| 44 | +The following are best practices for the properties in the graph objects: |
44 | 45 |
|
45 | 46 | | Object | Property | Type | Notes |
|
46 | 47 | | --- | --- | --- | --- |
|
47 |
| -| Vertex | ID | String | Uniquely enforced per partition. If a value isn't supplied upon insertion, an auto-generated GUID will be stored. | |
48 |
| -| Vertex | label | String | This property is used to define the type of entity that the vertex represents. If a value isn't supplied, a default value "vertex" will be used. | |
49 |
| -| Vertex | properties | String, Boolean, Numeric | A list of separate properties stored as key-value pairs in each vertex. | |
50 |
| -| Vertex | partition key | String, Boolean, Numeric | This property defines where the vertex and its outgoing edges will be stored. Read more about [graph partitioning](partitioning.md). | |
51 |
| -| Edge | ID | String | Uniquely enforced per partition. Auto-generated by default. Edges usually don't have the need to be uniquely retrieved by an ID. | |
52 |
| -| Edge | label | String | This property is used to define the type of relationship that two vertices have. | |
53 |
| -| Edge | properties | String, Boolean, Numeric | A list of separate properties stored as key-value pairs in each edge. | |
| 48 | +| Vertex | ID | String | Uniquely enforced per partition. If a value isn't supplied upon insertion, an auto-generated GUID is stored. | |
| 49 | +| Vertex | Label | String | This property is used to define the type of entity that the vertex represents. If a value isn't supplied, a default value *vertex* is used. | |
| 50 | +| Vertex | Properties | String, boolean, numeric | A list of separate properties stored as key-value pairs in each vertex. | |
| 51 | +| Vertex | Partition key | String, boolean, numeric | This property defines where the vertex and its outgoing edges are stored. Read more about [graph partitioning](partitioning.md). | |
| 52 | +| Edge | ID | String | Uniquely enforced per partition. Auto-generated by default. Edges usually don't need to be uniquely retrieved by an ID. | |
| 53 | +| Edge | Label | String | This property is used to define the type of relationship that two vertices have. | |
| 54 | +| Edge | Properties | String, boolean, numeric | A list of separate properties stored as key-value pairs in each edge. | |
54 | 55 |
|
55 | 56 | > [!NOTE]
|
56 |
| -> Edges don't require a partition key value, since its value is automatically assigned based on their source vertex. Learn more in the [graph partitioning](partitioning.md) article. |
| 57 | +> Edges don't require a partition key value, since the value is automatically assigned based on their source vertex. Learn more in the [Using a partitioned graph in Azure Cosmos DB](partitioning.md). |
57 | 58 |
|
58 | 59 | ## Entity and relationship modeling guidelines
|
59 | 60 |
|
60 |
| -The following are a set of guidelines to approach data modeling for an Azure Cosmos DB for Gremlin graph database. These guidelines assume that there's an existing definition of a data domain and queries for it. |
| 61 | +The following guidelines help you approach data modeling for an [Azure Cosmos DB for Apache Gremlin](introduction.md) graph database. These guidelines assume that there's an existing definition of a data domain and queries for it. |
61 | 62 |
|
62 | 63 | > [!NOTE]
|
63 |
| -> The steps outlined below are presented as recommendations. The final model should be evaluated and tested before its consideration as production-ready. Additionally, the recommendations below are specific to Azure Cosmos DB's Gremlin API implementation. |
| 64 | +> The following steps are presented as recommendations. You should evaluate and test the final model before considering it as production-ready. Additionally, the recommendations are specific to Azure Cosmos DB's Gremlin API implementation. |
64 | 65 |
|
65 | 66 | ### Modeling vertices and properties
|
66 | 67 |
|
67 |
| -The first step for a graph data model is to map every identified entity to a **vertex object**. A one to one mapping of all entities to vertices should be an initial step and subject to change. |
| 68 | +The first step for a graph data model is to map every identified entity to a *vertex object*. A one-to-one mapping of all entities to vertices should be an initial step and subject to change. |
68 | 69 |
|
69 |
| -One common pitfall is to map properties of a single entity as separate vertices. Consider the example below, where the same entity is represented in two different ways: |
| 70 | +One common pitfall is to map properties of a single entity as separate vertices. Consider the following example, where the same entity is represented in two different ways: |
70 | 71 |
|
71 | 72 | * **Vertex-based properties**: In this approach, the entity uses three separate vertices and two edges to describe its properties. While this approach might reduce redundancy, it increases model complexity. An increase in model complexity can result in added latency, query complexity, and computation cost. This model can also present challenges in partitioning.
|
72 | 73 |
|
73 |
| -:::image type="content" source="./media/modeling/graph-modeling-1.png" alt-text="Entity model with vertices for properties." border="false"::: |
| 74 | + :::image type="content" source="./media/modeling/graph-modeling-1.png" alt-text="Diagram of entity model with vertices for properties."::: |
74 | 75 |
|
75 |
| -* **Property-embedded vertices**: This approach takes advantage of the key-value pair list to represent all the properties of the entity inside a vertex. This approach provides reduced model complexity, which will lead to simpler queries and more cost-efficient traversals. |
| 76 | +* **Property-embedded vertices**: This approach takes advantage of the key-value pair list to represent all the properties of the entity inside a vertex. This approach reduces model complexity, which leads to simpler queries and more cost-efficient traversals. |
76 | 77 |
|
77 |
| -:::image type="content" source="./media/modeling/graph-modeling-2.png" alt-text="Diagram shows the Luis vertex from the previous diagram with i d, label, and properties." border="false"::: |
| 78 | + :::image type="content" source="./media/modeling/graph-modeling-2.png" alt-text="Diagram of the Luis vertex from the previous diagram with ID, label, and properties."::: |
78 | 79 |
|
79 | 80 | > [!NOTE]
|
80 |
| -> The above examples show a simplified graph model to only show the comparison between the two ways of dividing entity properties. |
| 81 | +> The preceding diagrams show a simplified graph model that only compares the two ways of dividing entity properties. |
81 | 82 |
|
82 |
| -The **property-embedded vertices** pattern generally provides a more performant and scalable approach. The default approach to a new graph data model should gravitate towards this pattern. |
| 83 | +The property-embedded vertices pattern generally provides a more performant and scalable approach. The default approach to a new graph data model should gravitate toward this pattern. |
83 | 84 |
|
84 |
| -However, there are scenarios where referencing to a property might provide advantages. For example: if the referenced property is updated frequently. Using a separate vertex to represent a property that is constantly changed would minimize the amount of write operations that the update would require. |
| 85 | +However, there are scenarios where referencing a property might provide advantages. For example, if the referenced property is updated frequently. Use a separate vertex to represent a property that's constantly changing to minimize the amount of write operations that the update requires. |
85 | 86 |
|
86 |
| -### Relationship modeling with edge directions |
| 87 | +### Relationship models with edge directions |
87 | 88 |
|
88 |
| -After the vertices are modeled, the edges can be added to denote the relationships between them. The first aspect that needs to be evaluated is the **direction of the relationship**. |
| 89 | +After the vertices are modeled, the edges can be added to denote the relationships between them. The first aspect that needs to be evaluated is the *direction of the relationship*. |
89 | 90 |
|
90 |
| -Edge objects have a default direction that is followed by a traversal when using the `out()` or `outE()` function. Using this natural direction results in an efficient operation, since all vertices are stored with their outgoing edges. |
| 91 | +Edge objects have a default direction that's followed by a traversal when using the `out()` or `outE()` functions. Using this natural direction results in an efficient operation, since all vertices are stored with their outgoing edges. |
91 | 92 |
|
92 |
| -However, traversing in the opposite direction of an edge, using the `in()` function, will always result in a cross-partition query. Learn more about [graph partitioning](partitioning.md). If there's a need to constantly traverse using the `in()` function, it's recommended to add edges in both directions. |
| 93 | +However, traversing in the opposite direction of an edge, using the `in()` function, always results in a cross-partition query. Learn more about [graph partitioning](partitioning.md). If there's a need to constantly traverse using the `in()` function, it's recommended to add edges in both directions. |
93 | 94 |
|
94 |
| -You can determine the edge direction by using the `.to()` or `.from()` predicates to the `.addE()` Gremlin step. Or by using the [bulk executor library for Gremlin API](bulk-executor-dotnet.md). |
| 95 | +You can determine the edge direction by using the `.to()` or `.from()` predicates with the `.addE()` Gremlin step. Or by using the [bulk executor library for Gremlin API](bulk-executor-dotnet.md). |
95 | 96 |
|
96 | 97 | > [!NOTE]
|
97 | 98 | > Edge objects have a direction by default.
|
98 | 99 |
|
99 |
| -### Relationship labeling |
| 100 | +### Relationship labels |
100 | 101 |
|
101 |
| -Using descriptive relationship labels can improve the efficiency of edge resolution operations. This pattern can be applied in the following ways: |
| 102 | +Using descriptive relationship labels can improve the efficiency of edge resolution operations. You can apply this pattern in the following ways: |
102 | 103 | * Use non-generic terms to label a relationship.
|
103 | 104 | * Associate the label of the source vertex to the label of the target vertex with the relationship name.
|
104 | 105 |
|
105 |
| -:::image type="content" source="./media/modeling/graph-modeling-3.png" alt-text="Relationship labeling examples." border="false"::: |
| 106 | +:::image type="content" source="./media/modeling/graph-modeling-3.png" alt-text="Diagram of relationship labeling examples."::: |
106 | 107 |
|
107 |
| -The more specific the label that the traverser will use to filter the edges, the better. This decision can have a significant impact on query cost as well. You can evaluate the query cost at any time [using the executionProfile step](execution-profile.md). |
| 108 | +The more specific the label that the traverser uses to filter the edges, the better. This decision can have a significant effect on query cost as well. You can evaluate the query cost at any time by using the [executionProfile step](execution-profile.md). |
108 | 109 |
|
| 110 | +## Next steps |
109 | 111 |
|
110 |
| -## Next steps: |
111 |
| -* Check out the list of supported [Gremlin steps](support.md). |
| 112 | +* Check out the list of [supported Gremlin steps](support.md). |
112 | 113 | * Learn about [graph database partitioning](partitioning.md) to deal with large-scale graphs.
|
113 |
| -* Evaluate your Gremlin queries using the [Execution Profile step](execution-profile.md). |
114 |
| -* Third-party Graph [design data model](modeling-tools.md) |
| 114 | +* Evaluate your Gremlin queries using the [execution profile step](execution-profile.md). |
| 115 | +* Third-party graph [design data model](modeling-tools.md). |
0 commit comments