Skip to content

Commit b6b27df

Browse files
committed
Applied min's feedback
1 parent 5c9cf73 commit b6b27df

File tree

1 file changed

+18
-8
lines changed

1 file changed

+18
-8
lines changed

62-cell-id/cell-id.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,9 @@ Relaxing the field to *optional* would lead to undesirable behavior. An optional
6969

7070
#### Reason for Character Restrictions (pattern, min/max length)
7171

72-
The [RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)](https://www.ietf.org/rfc/rfc3986.txt) defines the unreserved characters allowed for URI generation. Since IDs should be usable as referencable points in web requests, we want to restrict characters to at least these characters. Of these remaining non-alphanumeric reserved characters (`-`, `.`, `_`, and `~`) three of them have semantic meaning or are restricted in URL generation leaving only alphanumeric and `-` as legal characters we want to support. This extra restriction also helps with storage of ids in databases, where non-ascii characters in identifiers can oftentimes lead to query, storage, or application bugs when not handled correctly. Since we don't have a pre-existing strong need for such characters (`.`, `_`, and `~`) in our `id` field, we propose not introducing the additional complexity of allowing these other characters here.
72+
The [RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)](https://www.ietf.org/rfc/rfc3986.txt) defines the unreserved characters allowed for URI generation. Since IDs should be usable as referencable points in web requests, we want to restrict characters to at least these characters. Of these remaining non-alphanumeric reserved characters (`-`, `.`, `_`, and `~`), one has semantic meaning which doesn't impact our use-case (`_`) and two of them are restricted in URL generation leaving only alphanumeric, `-`, and `_` as legal characters we want to support. This extra restriction also helps with storage of ids in databases, where non-ascii characters in identifiers can oftentimes lead to query, storage, or application bugs when not handled correctly. Since we don't have a pre-existing strong need for such characters (`.` and `~`) in our `id` field, we propose not introducing the additional complexity of allowing these other characters here.
7373

74-
The length restrictions are there for a few reasons. First, you don't want empty strings in your ids, so enforce some natural minimum. We could use 1 or 2 for accepting bascially any id pattern, or be more restrictive with a higher minimum to reserve a wider combination of min length ids (`63^k` combinations). Second, you want a fixed max length for string identifiers for indexable ids in many database solutions for both performance and ease of implementation concerns. These will certainly be used in recall mechanisms so ease of database use should be a strong criterion. Third, a UUID string takes 36 characters to represent (with the `-` characters), and we likely want to support this as a supported identity pattern for certain applications that want this.
74+
The length restrictions are there for a few reasons. First, you don't want empty strings in your ids, so enforce some natural minimum. We could use 1 or 2 for accepting bascially any id pattern, or be more restrictive with a higher minimum to reserve a wider combination of min length ids (`63^k` combinations). Second, you want a fixed max length for string identifiers for indexable ids in many database solutions for both performance and ease of implementation concerns. These will certainly be used in recall mechanisms so ease of database use should be a strong criterion. Third, a UUID string takes 36 characters to represent (with the `-` characters), and we likely want to support this as a supported identity pattern for certain applications that want this. Thus we choose a 1-64 character limit range to provide flexibility and some measure of consistency.
7575

7676
### Updating older formats
7777

@@ -118,8 +118,8 @@ index e3dedf2..4f192e6 100644
118118
+ "description": "A string field representing the identifier of this particular cell.",
119119
+ "type": "string",
120120
+ "pattern": "^[a-zA-Z0-9-]+$",
121-
+ "minLength": 2,
122-
+ "maxLength": 36
121+
+ "minLength": 1,
122+
+ "maxLength": 64
123123
+ },
124124
+
125125
"cell": {
@@ -218,11 +218,21 @@ If bookkeeping of current cell ids is not desirable, a 64-bit random id (11 char
218218

219219
```python
220220
def get_cell_id(id_length=8):
221-
# Ok technically this isn't exactly a 64-bit k-length string... but it's close and easy to implement
222-
return str(uuid.uuid4())[:id_length]
221+
n_bytes = max(id_length * 3 // 4, 1)
222+
# since standard base64 uses + and /, which the proposed regex excludes we need to use urlsafe_b64encode
223+
urlsafe_b64encode(os.urandom(n_bytes)).decode("ascii").rstrip("=")
223224
```
224225

225-
#### Option C: Join human-readable strings from a corpus randomly
226+
#### Option C: uuid-subset
227+
228+
Basically the same as Option B, just a different flavor of random generation.
229+
230+
```python
231+
def get_cell_id(id_length=8):
232+
return uuid.uuid4().hex[:id_length]
233+
```
234+
235+
#### Option D: Join human-readable strings from a corpus randomly
226236

227237
One frequently used pattern for generating human recognizable ids is to combine common words together instead of arbitrarily random bits. Things like `danger-noodle` is a lot easier to remember or reference for a person than `ZGFuZ2VyLW5vb2RsZQ==`. Below would be how this is achieved, though it requires a set of names to use in id generation. There are dependencies in Python, as well as corpus csv files, for this that make it convenient but it would have to add to the install dependencies.
228238

@@ -233,7 +243,7 @@ def get_cell_id(num_words=2):
233243

234244
#### Preference
235245

236-
Use Option B. Option C is also viable but adds a corpus requirement to the id generation step.
246+
Use Option D for most human readable, but adds a corpus requirement to the id generation step. If corpus is not desired, use Options B or C.
237247

238248
## Questions
239249

0 commit comments

Comments
 (0)