Skip to content

Commit c82885d

Browse files
Merge pull request #230707 from cdpark/gremlin-modeling
Freshness Pass User Story: 2036619 Graph data models
2 parents d92cc2b + 825c8bb commit c82885d

File tree

1 file changed

+50
-49
lines changed

1 file changed

+50
-49
lines changed
Lines changed: 50 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,114 +1,115 @@
11
---
2-
title: 'Graph data modeling for Azure Cosmos DB for Gremlin'
3-
description: Learn how to model a graph database by using Azure Cosmos DB for Gremlin. This article describes when to use a graph database and best practices to model entities and relationships.
2+
title: Graph data modeling with Azure Cosmos DB for Apache Gremlin
3+
description: Learn how to model a graph database by using Azure Cosmos DB for Apache Gremlin, and learn best practices to model entities and relationships.
44
ms.service: cosmos-db
55
ms.subservice: apache-gremlin
66
ms.custom: ignite-2022
77
ms.topic: how-to
8-
ms.date: 12/02/2019
8+
ms.date: 03/14/2023
99
author: manishmsfte
1010
ms.author: mansha
1111
---
1212

13-
# Graph data modeling for Azure Cosmos DB for Gremlin
13+
# Graph data modeling with Azure Cosmos DB for Apache Gremlin
1414
[!INCLUDE[Gremlin](../includes/appliesto-gremlin.md)]
1515

16-
The following document is designed to provide graph data modeling recommendations. This step is vital in order to ensure the scalability and performance of a graph database system as the data evolves. An efficient data model is especially important with large-scale graphs.
16+
This article provides recommendations for the use of graph data models. These best practices are vital for ensuring the scalability and performance of a graph database system as the data evolves. An efficient data model is especially important for large-scale graphs.
1717

1818
## Requirements
1919

2020
The process outlined in this guide is based on the following assumptions:
21-
* The **entities** in the problem-space are identified. These entities are meant to be consumed _atomically_ for each request. In other words, the database system isn't designed to retrieve a single entity's data in multiple query requests.
22-
* There is an understanding of **read and write requirements** for the database system. These requirements will guide the optimizations needed for the graph data model.
23-
* The principles of the [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) are well understood.
21+
22+
* The *entities* in the problem-space are identified. These entities are meant to be consumed *atomically* for each request. In other words, the database system isn't designed to retrieve a single entity's data in multiple query requests.
23+
* There's an understanding of *read and write requirements* for the database system. These requirements guide the optimizations needed for the graph data model.
24+
* The principles of the [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) are well understood.
2425

2526
## When do I need a graph database?
2627

27-
A graph database solution can be optimally applied if the entities and relationships in a data domain have any of the following characteristics:
28+
A graph database solution can be optimally used if the entities and relationships in a data domain have any of the following characteristics:
2829

29-
* The entities are **highly connected** through descriptive relationships. The benefit in this scenario is the fact that the relationships are persisted in storage.
30-
* There are **cyclic relationships** or **self-referenced entities**. This pattern is often a challenge when using relational or document databases.
31-
* There are **dynamically evolving relationships** between entities. This pattern is especially applicable to hierarchical or tree-structured data with many levels.
32-
* There are **many-to-many relationships** between entities.
33-
* There are **write and read requirements on both entities and relationships**.
30+
* The entities are *highly connected* through descriptive relationships. The benefit in this scenario is that the relationships persist in storage.
31+
* There are *cyclic relationships* or *self-referenced entities*. This pattern is often a challenge when you use relational or document databases.
32+
* There are *dynamically evolving relationships* between entities. This pattern is especially applicable to hierarchical or tree-structured data with many levels.
33+
* There are *many-to-many relationships* between entities.
34+
* There are *write and read requirements on both entities and relationships*.
3435

35-
If the above criteria is satisfied, it's likely that a graph database approach will provide advantages for **query complexity**, **data model scalability**, and **query performance**.
36+
If the above criteria are satisfied, a graph database approach likely provides advantages for *query complexity*, *data model scalability*, and *query performance*.
3637

37-
The next step is to determine if the graph is going to be used for analytic or transactional purposes. If the graph is intended to be used for heavy computation and data processing workloads, it would be worth to explore the [Cosmos DB Spark connector](../nosql/quickstart-spark.md) and the use of the [GraphX library](https://spark.apache.org/graphx/).
38+
The next step is to determine if the graph is going to be used for analytic or transactional purposes. If the graph is intended to be used for heavy computation and data processing workloads, it's worth exploring the [Cosmos DB Spark connector](../nosql/quickstart-spark.md) and the [GraphX library](https://spark.apache.org/graphx/).
3839

3940
## How to use graph objects
4041

41-
The [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) defines two types of objects **Vertices** and **Edges**.
42+
The [Apache Tinkerpop property graph standard](https://tinkerpop.apache.org/docs/current/reference/#graph-computing) defines two types of objects: *vertices* and *edges*.
4243

43-
The following are the best practices for the properties in the graph objects:
44+
The following are best practices for the properties in the graph objects:
4445

4546
| Object | Property | Type | Notes |
4647
| --- | --- | --- | --- |
47-
| Vertex | ID | String | Uniquely enforced per partition. If a value isn't supplied upon insertion, an auto-generated GUID will be stored. |
48-
| Vertex | label | String | This property is used to define the type of entity that the vertex represents. If a value isn't supplied, a default value "vertex" will be used. |
49-
| Vertex | properties | String, Boolean, Numeric | A list of separate properties stored as key-value pairs in each vertex. |
50-
| Vertex | partition key | String, Boolean, Numeric | This property defines where the vertex and its outgoing edges will be stored. Read more about [graph partitioning](partitioning.md). |
51-
| Edge | ID | String | Uniquely enforced per partition. Auto-generated by default. Edges usually don't have the need to be uniquely retrieved by an ID. |
52-
| Edge | label | String | This property is used to define the type of relationship that two vertices have. |
53-
| Edge | properties | String, Boolean, Numeric | A list of separate properties stored as key-value pairs in each edge. |
48+
| Vertex | ID | String | Uniquely enforced per partition. If a value isn't supplied upon insertion, an auto-generated GUID is stored. |
49+
| Vertex | Label | String | This property is used to define the type of entity that the vertex represents. If a value isn't supplied, a default value *vertex* is used. |
50+
| Vertex | Properties | String, boolean, numeric | A list of separate properties stored as key-value pairs in each vertex. |
51+
| Vertex | Partition key | String, boolean, numeric | This property defines where the vertex and its outgoing edges are stored. Read more about [graph partitioning](partitioning.md). |
52+
| Edge | ID | String | Uniquely enforced per partition. Auto-generated by default. Edges usually don't need to be uniquely retrieved by an ID. |
53+
| Edge | Label | String | This property is used to define the type of relationship that two vertices have. |
54+
| Edge | Properties | String, boolean, numeric | A list of separate properties stored as key-value pairs in each edge. |
5455

5556
> [!NOTE]
56-
> Edges don't require a partition key value, since its value is automatically assigned based on their source vertex. Learn more in the [graph partitioning](partitioning.md) article.
57+
> Edges don't require a partition key value, since the value is automatically assigned based on their source vertex. Learn more in the [Using a partitioned graph in Azure Cosmos DB](partitioning.md).
5758
5859
## Entity and relationship modeling guidelines
5960

60-
The following are a set of guidelines to approach data modeling for an Azure Cosmos DB for Gremlin graph database. These guidelines assume that there's an existing definition of a data domain and queries for it.
61+
The following guidelines help you approach data modeling for an [Azure Cosmos DB for Apache Gremlin](introduction.md) graph database. These guidelines assume that there's an existing definition of a data domain and queries for it.
6162

6263
> [!NOTE]
63-
> The steps outlined below are presented as recommendations. The final model should be evaluated and tested before its consideration as production-ready. Additionally, the recommendations below are specific to Azure Cosmos DB's Gremlin API implementation.
64+
> The following steps are presented as recommendations. You should evaluate and test the final model before considering it as production-ready. Additionally, the recommendations are specific to Azure Cosmos DB's Gremlin API implementation.
6465
6566
### Modeling vertices and properties
6667

67-
The first step for a graph data model is to map every identified entity to a **vertex object**. A one to one mapping of all entities to vertices should be an initial step and subject to change.
68+
The first step for a graph data model is to map every identified entity to a *vertex object*. A one-to-one mapping of all entities to vertices should be an initial step and subject to change.
6869

69-
One common pitfall is to map properties of a single entity as separate vertices. Consider the example below, where the same entity is represented in two different ways:
70+
One common pitfall is to map properties of a single entity as separate vertices. Consider the following example, where the same entity is represented in two different ways:
7071

7172
* **Vertex-based properties**: In this approach, the entity uses three separate vertices and two edges to describe its properties. While this approach might reduce redundancy, it increases model complexity. An increase in model complexity can result in added latency, query complexity, and computation cost. This model can also present challenges in partitioning.
7273

73-
:::image type="content" source="./media/modeling/graph-modeling-1.png" alt-text="Entity model with vertices for properties." border="false":::
74+
:::image type="content" source="./media/modeling/graph-modeling-1.png" alt-text="Diagram of entity model with vertices for properties.":::
7475

75-
* **Property-embedded vertices**: This approach takes advantage of the key-value pair list to represent all the properties of the entity inside a vertex. This approach provides reduced model complexity, which will lead to simpler queries and more cost-efficient traversals.
76+
* **Property-embedded vertices**: This approach takes advantage of the key-value pair list to represent all the properties of the entity inside a vertex. This approach reduces model complexity, which leads to simpler queries and more cost-efficient traversals.
7677

77-
:::image type="content" source="./media/modeling/graph-modeling-2.png" alt-text="Diagram shows the Luis vertex from the previous diagram with i d, label, and properties." border="false":::
78+
:::image type="content" source="./media/modeling/graph-modeling-2.png" alt-text="Diagram of the Luis vertex from the previous diagram with ID, label, and properties.":::
7879

7980
> [!NOTE]
80-
> The above examples show a simplified graph model to only show the comparison between the two ways of dividing entity properties.
81+
> The preceding diagrams show a simplified graph model that only compares the two ways of dividing entity properties.
8182
82-
The **property-embedded vertices** pattern generally provides a more performant and scalable approach. The default approach to a new graph data model should gravitate towards this pattern.
83+
The property-embedded vertices pattern generally provides a more performant and scalable approach. The default approach to a new graph data model should gravitate toward this pattern.
8384

84-
However, there are scenarios where referencing to a property might provide advantages. For example: if the referenced property is updated frequently. Using a separate vertex to represent a property that is constantly changed would minimize the amount of write operations that the update would require.
85+
However, there are scenarios where referencing a property might provide advantages. For example, if the referenced property is updated frequently. Use a separate vertex to represent a property that's constantly changing to minimize the amount of write operations that the update requires.
8586

86-
### Relationship modeling with edge directions
87+
### Relationship models with edge directions
8788

88-
After the vertices are modeled, the edges can be added to denote the relationships between them. The first aspect that needs to be evaluated is the **direction of the relationship**.
89+
After the vertices are modeled, the edges can be added to denote the relationships between them. The first aspect that needs to be evaluated is the *direction of the relationship*.
8990

90-
Edge objects have a default direction that is followed by a traversal when using the `out()` or `outE()` function. Using this natural direction results in an efficient operation, since all vertices are stored with their outgoing edges.
91+
Edge objects have a default direction that's followed by a traversal when using the `out()` or `outE()` functions. Using this natural direction results in an efficient operation, since all vertices are stored with their outgoing edges.
9192

92-
However, traversing in the opposite direction of an edge, using the `in()` function, will always result in a cross-partition query. Learn more about [graph partitioning](partitioning.md). If there's a need to constantly traverse using the `in()` function, it's recommended to add edges in both directions.
93+
However, traversing in the opposite direction of an edge, using the `in()` function, always results in a cross-partition query. Learn more about [graph partitioning](partitioning.md). If there's a need to constantly traverse using the `in()` function, it's recommended to add edges in both directions.
9394

94-
You can determine the edge direction by using the `.to()` or `.from()` predicates to the `.addE()` Gremlin step. Or by using the [bulk executor library for Gremlin API](bulk-executor-dotnet.md).
95+
You can determine the edge direction by using the `.to()` or `.from()` predicates with the `.addE()` Gremlin step. Or by using the [bulk executor library for Gremlin API](bulk-executor-dotnet.md).
9596

9697
> [!NOTE]
9798
> Edge objects have a direction by default.
9899
99-
### Relationship labeling
100+
### Relationship labels
100101

101-
Using descriptive relationship labels can improve the efficiency of edge resolution operations. This pattern can be applied in the following ways:
102+
Using descriptive relationship labels can improve the efficiency of edge resolution operations. You can apply this pattern in the following ways:
102103
* Use non-generic terms to label a relationship.
103104
* Associate the label of the source vertex to the label of the target vertex with the relationship name.
104105

105-
:::image type="content" source="./media/modeling/graph-modeling-3.png" alt-text="Relationship labeling examples." border="false":::
106+
:::image type="content" source="./media/modeling/graph-modeling-3.png" alt-text="Diagram of relationship labeling examples.":::
106107

107-
The more specific the label that the traverser will use to filter the edges, the better. This decision can have a significant impact on query cost as well. You can evaluate the query cost at any time [using the executionProfile step](execution-profile.md).
108+
The more specific the label that the traverser uses to filter the edges, the better. This decision can have a significant effect on query cost as well. You can evaluate the query cost at any time by using the [executionProfile step](execution-profile.md).
108109

110+
## Next steps
109111

110-
## Next steps:
111-
* Check out the list of supported [Gremlin steps](support.md).
112+
* Check out the list of [supported Gremlin steps](support.md).
112113
* Learn about [graph database partitioning](partitioning.md) to deal with large-scale graphs.
113-
* Evaluate your Gremlin queries using the [Execution Profile step](execution-profile.md).
114-
* Third-party Graph [design data model](modeling-tools.md)
114+
* Evaluate your Gremlin queries using the [execution profile step](execution-profile.md).
115+
* Third-party graph [design data model](modeling-tools.md).

0 commit comments

Comments
 (0)