Skip to content

Commit fe191b4

Browse files
committed
More documentation on sharding
1 parent ca53320 commit fe191b4

File tree

2 files changed

+37
-6
lines changed

2 files changed

+37
-6
lines changed

docs/deployments/configuration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -900,8 +900,8 @@ The reclamation section provides configuration for the reclamation process, whic
900900
```yaml
901901
storage:
902902
reclamation:
903-
threshold: 0.4 # Start storage reclamation efforts when free space has reached 40% of the volume space (default)
904-
interval: 1h # Reclamation will run every hour (default)
903+
threshold: 0.4 # Start storage reclamation efforts when free space has reached 40% of the volume space (default)
904+
interval: 1h # Reclamation will run every hour (default)
905905
evictionFactor: 100000 # A factor used to determine how much aggressively to evict cached entries (default)
906906
```
907907

docs/developers/replication/sharding.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
Harper's replication system supports various levels of replication or sharding. Harper can be configured or set up to replicate to different data to different subsets of nodes. This can be used facilitate horizontally scalability of storage and write performance, while maintaining optimal strategies of data locality and data consistency. When sharding is configured, Harper will replicate data to only a subset of nodes, based on the sharding configuration, and can then retrieve data from the appropriate nodes as needed to fulfill requests for data.
22

3-
## Configuration
3+
There are two main ways to setup sharding in Harper. The approach is to use dynamic sharding, where the location or residency of records is determined dynamically based on where the record was written and record data, and records can be dynamically relocated based on where they are accessed. This residency information can be specific to each record, and can vary based on the computed residency and where the data is written and accessed.
4+
5+
The second approach is define specific shards, where each node is assigned to a specific shard, and each record is replicated to the nodes in that shard based on the primary key, regardless of where the data was written or accessed, or content. This approach is more static, but can be more efficient for certain use cases, and means that the location of data can always be predictably determined based on the primary key.
6+
7+
## Configuration For Dynamic Sharding
48
By default, Harper will replicate all data to all nodes. However, replication can easily be configured for "sharding", or storing different data in different locations or nodes. The simplest way to configure sharding and limit replication to improve performance and efficiency is to configure a replication-to count. This will limit the number of nodes that data is replicated to. For example, to specify that writes should replicate to 2 other nodes besides the node that first stored the data, you can set the `replicateTo` to 2 in the `replication` section of the `harperdb-config.yaml` file:
59
```yaml
610
replication:
@@ -60,21 +64,48 @@ class MyTable extends tables.MyTable {
6064
}
6165
```
6266

67+
## Configuration for Static Sharding
68+
Alternatively, you can configure static sharding, where each node is assigned to a specific shard, and each record is replicated to the nodes in that shard based on the primary key. The `shard` is identified by a number. To configure the shard for each node, you can specify the shard number in the `replication`'s `routes` in the configuration:
69+
```yaml
70+
replication:
71+
routes:
72+
- hostname: node1
73+
shard: 1
74+
- hostname: node2
75+
shard: 2
76+
```
77+
Or you can specify a `shard` number by including that property in an `add_node` operation or `set_node` operation, to dynamically assign a node to a shard.
78+
79+
You can then specify shard number in the `setResidency` or `setResidencyById` functions below.
80+
6381
## Custom Sharding
6482
You can also define a custom sharding strategy by specifying a function to compute the "residency" or location of where records should be stored and reside. To do this we use the `setResidency` method, providing a function that will determine the residency of each record. The function you provide will be called with the record entry, and should return an array of nodes that the record should be replicated to (using their hostname). For example, to shard records based on the value of the `id` field, you can use the following code:
6583
```javascript
6684
MyTable.setResidency((record) => {
6785
return record.id % 2 === 0 ? ['node1'] : ['node2'];
6886
});
6987
```
70-
With this approach, the record metadata, which includes the residency information, and any indexed properties, will be replicated to all nodes, but the full record will only be replicated to the nodes specified by the residency function.
88+
With this approach, the record metadata, which includes the residency information, and any indexed properties, will be replicated to all nodes, but the full record will only be replicated to the nodes specified by the residency function.
89+
90+
The `setResidency` function can alternately return a shard number, which will replicate the data to all the nodes in that shard:
91+
```javascript
92+
MyTable.setResidency((record) => {
93+
return record.id % 2 === 0 ? 1 : 2;
94+
});
95+
```
7196

7297
### Custom Sharding By Primary Key
73-
Alternately you can define a custom sharding strategy based on the primary key alone. This allows records to be retrieved without needing access to the record data or metadata. With this approach, data will only be replicated to the nodes specified by the residency function (the record metadata doesn't need to replicated to all nodes). To do this, you can use the `setResidencyById` method, providing a function that will determine the residency of each record based on the primary key. The function you provide will be called with the primary key, and should return an array of nodes that the record should be replicated to (using their hostname). For example, to shard records based on the value of the primary key, you can use the following code:
98+
Alternately you can define a custom sharding strategy based on the primary key alone. This allows records to be retrieved without needing access to the record data or metadata. With this approach, data will only be replicated to the nodes specified by the residency function (the record metadata doesn't need to replicated to all nodes). To do this, you can use the `setResidencyById` method, providing a function that will determine the residency or shard of each record based on the primary key. The function you provide will be called with the primary key, and should return a `shard` number or an array of nodes that the record should be replicated to (using their hostname). For example, to shard records based on the value of the primary key, you can use the following code:
7499

75100
```javascript
76101
MyTable.setResidencyById((id) => {
77-
return id % 2 === 0 ? ['node1'] : ['node2'];
102+
return id % 2 === 0 ? 1 : 2; // return shard number
103+
});
104+
```
105+
or
106+
```javascript
107+
MyTable.setResidencyById((id) => {
108+
return id % 2 === 0 ? ['node1'] : ['node2']; // return array of node hostnames
78109
});
79110
```
80111

0 commit comments

Comments
 (0)