|
| 1 | +--- |
| 2 | +description: Storing JSON and hashes with RedisVL |
| 3 | +linkTitle: JSON vs. hash storage |
| 4 | +title: JSON vs. hash storage |
| 5 | +type: integration |
| 6 | +weight: 6 |
| 7 | +--- |
| 8 | + |
| 9 | +Out of the box, Redis provides a [variety of data structures](https://redis.com/redis-enterprise/data-structures/) that can be used for your domain specific applications and use cases. |
| 10 | +In this document, you will learn how to use RedisVL with both [hash]({{< relref "/develop/data-types/hashes" >}}) and [JSON]({{< relref "/develop/data-types/json/" >}}) data. |
| 11 | + |
| 12 | +{{< note >}} |
| 13 | +This document is a converted form of [this Jupyter notebook](https://github.com/redis/redis-vl-python/blob/main/docs/user_guide/05_hash_vs_json.ipynb). |
| 14 | +{{< /note >}} |
| 15 | + |
| 16 | +Before beginning, be sure of the following: |
| 17 | + |
| 18 | +1. You have installed RedisVL and have that environment activated. |
| 19 | +1. You have a running Redis instance with the search and query capability. |
| 20 | + |
| 21 | +```python |
| 22 | +# import necessary modules |
| 23 | +import pickle |
| 24 | + |
| 25 | +from redisvl.redis.utils import buffer_to_array |
| 26 | +from jupyterutils import result_print, table_print |
| 27 | +from redisvl.index import SearchIndex |
| 28 | + |
| 29 | +# load in the example data and printing utils |
| 30 | +data = pickle.load(open("hybrid_example_data.pkl", "rb")) |
| 31 | +``` |
| 32 | + |
| 33 | +```python |
| 34 | +table_print(data) |
| 35 | +``` |
| 36 | + |
| 37 | +<table><tr><th>user</th><th>age</th><th>job</th><th>credit_score</th><th>office_location</th><th>user_embedding</th></tr><tr><td>john</td><td>18</td><td>engineer</td><td>high</td><td>-122.4194,37.7749</td><td>b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'</td></tr><tr><td>derrick</td><td>14</td><td>doctor</td><td>low</td><td>-122.4194,37.7749</td><td>b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'</td></tr><tr><td>nancy</td><td>94</td><td>doctor</td><td>high</td><td>-122.4194,37.7749</td><td>b'333?\xcd\xcc\xcc=\x00\x00\x00?'</td></tr><tr><td>tyler</td><td>100</td><td>engineer</td><td>high</td><td>-122.0839,37.3861</td><td>b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?'</td></tr><tr><td>tim</td><td>12</td><td>dermatologist</td><td>high</td><td>-122.0839,37.3861</td><td>b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?'</td></tr><tr><td>taimur</td><td>15</td><td>CEO</td><td>low</td><td>-122.0839,37.3861</td><td>b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?'</td></tr><tr><td>joe</td><td>35</td><td>dentist</td><td>medium</td><td>-122.0839,37.3861</td><td>b'fff?fff?\xcd\xcc\xcc='</td></tr></table> |
| 38 | + |
| 39 | + |
| 40 | +## Hash or JSON - how to choose? |
| 41 | + |
| 42 | +Both storage options offer a variety of features and tradeoffs. Below, you will work through a dummy dataset to learn when and how to use both data types. |
| 43 | + |
| 44 | +### Working with hashes |
| 45 | + |
| 46 | +Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable, single-level dictionary that contains multiple "rows": |
| 47 | + |
| 48 | +```python |
| 49 | +{ |
| 50 | + "model": "Deimos", |
| 51 | + "brand": "Ergonom", |
| 52 | + "type": "Enduro bikes", |
| 53 | + "price": 4972, |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +Hashes are best suited for use cases with the following characteristics: |
| 58 | + |
| 59 | +- Performance (speed) and storage space (memory consumption) are top concerns. |
| 60 | +- Data can be easily normalized and modeled as a single-level dictionary. |
| 61 | + |
| 62 | +> Hashes are typically the default recommendation. |
| 63 | +
|
| 64 | +```python |
| 65 | +# define the hash index schema |
| 66 | +hash_schema = { |
| 67 | + "index": { |
| 68 | + "name": "user-hash", |
| 69 | + "prefix": "user-hash-docs", |
| 70 | + "storage_type": "hash", # default setting -- HASH |
| 71 | + }, |
| 72 | + "fields": [ |
| 73 | + {"name": "user", "type": "tag"}, |
| 74 | + {"name": "credit_score", "type": "tag"}, |
| 75 | + {"name": "job", "type": "text"}, |
| 76 | + {"name": "age", "type": "numeric"}, |
| 77 | + {"name": "office_location", "type": "geo"}, |
| 78 | + { |
| 79 | + "name": "user_embedding", |
| 80 | + "type": "vector", |
| 81 | + "attrs": { |
| 82 | + "dims": 3, |
| 83 | + "distance_metric": "cosine", |
| 84 | + "algorithm": "flat", |
| 85 | + "datatype": "float32" |
| 86 | + } |
| 87 | + } |
| 88 | + ], |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +```python |
| 93 | +# construct a search index from the hash schema |
| 94 | +hindex = SearchIndex.from_dict(hash_schema) |
| 95 | + |
| 96 | +# connect to local redis instance |
| 97 | +hindex.connect("redis://localhost:6379") |
| 98 | + |
| 99 | +# create the index (no data yet) |
| 100 | +hindex.create(overwrite=True) |
| 101 | +``` |
| 102 | + |
| 103 | +```python |
| 104 | +# show the underlying storage type |
| 105 | +hindex.storage_type |
| 106 | + |
| 107 | + <StorageType.HASH: 'hash'> |
| 108 | +``` |
| 109 | + |
| 110 | +#### Vectors as byte strings |
| 111 | + |
| 112 | +One nuance when working with hashes in Redis is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of this can be seen below: |
| 113 | + |
| 114 | + |
| 115 | +```python |
| 116 | +# show a single entry from the data that will be loaded |
| 117 | +data[0] |
| 118 | + |
| 119 | + {'user': 'john', |
| 120 | + 'age': 18, |
| 121 | + 'job': 'engineer', |
| 122 | + 'credit_score': 'high', |
| 123 | + 'office_location': '-122.4194,37.7749', |
| 124 | + 'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'} |
| 125 | +``` |
| 126 | + |
| 127 | +```python |
| 128 | +# load hash data |
| 129 | +keys = hindex.load(data) |
| 130 | +``` |
| 131 | + |
| 132 | +```python |
| 133 | +$ rvl stats -i user-hash |
| 134 | + |
| 135 | + Statistics: |
| 136 | + ╭─────────────────────────────┬─────────────╮ |
| 137 | + │ Stat Key │ Value │ |
| 138 | + ├─────────────────────────────┼─────────────┤ |
| 139 | + │ num_docs │ 7 │ |
| 140 | + │ num_terms │ 6 │ |
| 141 | + │ max_doc_id │ 7 │ |
| 142 | + │ num_records │ 44 │ |
| 143 | + │ percent_indexed │ 1 │ |
| 144 | + │ hash_indexing_failures │ 0 │ |
| 145 | + │ number_of_uses │ 1 │ |
| 146 | + │ bytes_per_record_avg │ 3.40909 │ |
| 147 | + │ doc_table_size_mb │ 0.000767708 │ |
| 148 | + │ inverted_sz_mb │ 0.000143051 │ |
| 149 | + │ key_table_size_mb │ 0.000248909 │ |
| 150 | + │ offset_bits_per_record_avg │ 8 │ |
| 151 | + │ offset_vectors_sz_mb │ 8.58307e-06 │ |
| 152 | + │ offsets_per_term_avg │ 0.204545 │ |
| 153 | + │ records_per_doc_avg │ 6.28571 │ |
| 154 | + │ sortable_values_size_mb │ 0 │ |
| 155 | + │ total_indexing_time │ 0.587 │ |
| 156 | + │ total_inverted_index_blocks │ 18 │ |
| 157 | + │ vector_index_sz_mb │ 0.0202332 │ |
| 158 | + ╰─────────────────────────────┴─────────────╯ |
| 159 | +``` |
| 160 | + |
| 161 | +#### Performing queries |
| 162 | + |
| 163 | +Once the index is created and data is loaded into the right format, you can run queries against the index: |
| 164 | + |
| 165 | +```python |
| 166 | +from redisvl.query import VectorQuery |
| 167 | +from redisvl.query.filter import Tag, Text, Num |
| 168 | + |
| 169 | +t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17) |
| 170 | + |
| 171 | +v = VectorQuery([0.1, 0.1, 0.5], |
| 172 | + "user_embedding", |
| 173 | + return_fields=["user", "credit_score", "age", "job", "office_location"], |
| 174 | + filter_expression=t) |
| 175 | + |
| 176 | + |
| 177 | +results = hindex.query(v) |
| 178 | +result_print(results) |
| 179 | + |
| 180 | +``` |
| 181 | + |
| 182 | +<table><tr><th>vector_distance</th><th>user</th><th>credit_score</th><th>age</th><th>job</th><th>office_location</th></tr><tr><td>0</td><td>john</td><td>high</td><td>18</td><td>engineer</td><td>-122.4194,37.7749</td></tr><tr><td>0.109129190445</td><td>tyler</td><td>high</td><td>100</td><td>engineer</td><td>-122.0839,37.3861</td></tr></table> |
| 183 | + |
| 184 | +```python |
| 185 | +# clean up |
| 186 | +hindex.delete() |
| 187 | +``` |
| 188 | + |
| 189 | +### Working with JSON |
| 190 | + |
| 191 | +Redis also supports native **JSON** objects. These can be multi-level (nested) objects, with full [JSONPath]({{< relref "/develop/data-types/json/" >}}path/) support for retrieving and updating sub-elements: |
| 192 | + |
| 193 | +```python |
| 194 | +{ |
| 195 | + "name": "bike", |
| 196 | + "metadata": { |
| 197 | + "model": "Deimos", |
| 198 | + "brand": "Ergonom", |
| 199 | + "type": "Enduro bikes", |
| 200 | + "price": 4972, |
| 201 | + } |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +JSON is best suited for use cases with the following characteristics: |
| 206 | + |
| 207 | +- Ease of use and data model flexibility are top concerns. |
| 208 | +- Application data is already native JSON. |
| 209 | +- Replacing another document storage/database solution. |
| 210 | + |
| 211 | +#### Full JSON Path support |
| 212 | + |
| 213 | +Because Redis enables full JSONPath support, when creating an index schema, elements need to be indexed and selected by their path with the desired `name` and `path` that points to where the data is located within the objects. |
| 214 | + |
| 215 | +{{< note >}} |
| 216 | +By default, RedisVL will assume the path as `$.{name}` if not provided in JSON fields schema. |
| 217 | +{{< /note >}} |
| 218 | + |
| 219 | +```python |
| 220 | +# define the json index schema |
| 221 | +json_schema = { |
| 222 | + "index": { |
| 223 | + "name": "user-json", |
| 224 | + "prefix": "user-json-docs", |
| 225 | + "storage_type": "json", # JSON storage type |
| 226 | + }, |
| 227 | + "fields": [ |
| 228 | + {"name": "user", "type": "tag"}, |
| 229 | + {"name": "credit_score", "type": "tag"}, |
| 230 | + {"name": "job", "type": "text"}, |
| 231 | + {"name": "age", "type": "numeric"}, |
| 232 | + {"name": "office_location", "type": "geo"}, |
| 233 | + { |
| 234 | + "name": "user_embedding", |
| 235 | + "type": "vector", |
| 236 | + "attrs": { |
| 237 | + "dims": 3, |
| 238 | + "distance_metric": "cosine", |
| 239 | + "algorithm": "flat", |
| 240 | + "datatype": "float32" |
| 241 | + } |
| 242 | + } |
| 243 | + ], |
| 244 | +} |
| 245 | +``` |
| 246 | + |
| 247 | +```python |
| 248 | +# construct a search index from the JSON schema |
| 249 | +jindex = SearchIndex.from_dict(json_schema) |
| 250 | + |
| 251 | +# connect to a local redis instance |
| 252 | +jindex.connect("redis://localhost:6379") |
| 253 | + |
| 254 | +# create the index (no data yet) |
| 255 | +jindex.create(overwrite=True) |
| 256 | +``` |
| 257 | + |
| 258 | +```python |
| 259 | +# note the multiple indices in the same database |
| 260 | +$ rvl index listall |
| 261 | + |
| 262 | + 20:23:08 [RedisVL] INFO Indices: |
| 263 | + 20:23:08 [RedisVL] INFO 1. user-json |
| 264 | + |
| 265 | +#### Vectors as float arrays |
| 266 | + |
| 267 | +Vectorized data stored in JSON must be stored as a pure array (e.g., a Python list) of floats. Modify your sample data to account for this below: |
| 268 | + |
| 269 | +```python |
| 270 | +import numpy as np |
| 271 | + |
| 272 | +json_data = data.copy() |
| 273 | + |
| 274 | +for d in json_data: |
| 275 | + d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype=np.float32) |
| 276 | +``` |
| 277 | + |
| 278 | +```python |
| 279 | +# inspect a single JSON record |
| 280 | +json_data[0] |
| 281 | +``` |
| 282 | + |
| 283 | + {'user': 'john', |
| 284 | + 'age': 18, |
| 285 | + 'job': 'engineer', |
| 286 | + 'credit_score': 'high', |
| 287 | + 'office_location': '-122.4194,37.7749', |
| 288 | + 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]} |
| 289 | + |
| 290 | + |
| 291 | +```python |
| 292 | +keys = jindex.load(json_data) |
| 293 | +``` |
| 294 | + |
| 295 | +```python |
| 296 | +# we can now run the exact same query as above |
| 297 | +result_print(jindex.query(v)) |
| 298 | +``` |
| 299 | + |
| 300 | +<table><tr><th>vector_distance</th><th>user</th><th>credit_score</th><th>age</th><th>job</th><th>office_location</th></tr><tr><td>0</td><td>john</td><td>high</td><td>18</td><td>engineer</td><td>-122.4194,37.7749</td></tr><tr><td>0.109129190445</td><td>tyler</td><td>high</td><td>100</td><td>engineer</td><td>-122.0839,37.3861</td></tr></table> |
| 301 | + |
| 302 | +## Cleanup |
| 303 | + |
| 304 | +```python |
| 305 | +jindex.delete() |
| 306 | +``` |
0 commit comments