Skip to content

Commit 43727f9

Browse files
paulgirardroll
authored andcommitted
Improve foreign key checks performance(#254)
* An alternative method to test relations Depends on a commit in datapackage-py Proof of concept to be discussed * refacto, passing FK index opti into tableschema * adding documentation on new method and params
1 parent a92c70a commit 43727f9

File tree

2 files changed

+84
-36
lines changed

2 files changed

+84
-36
lines changed

README.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -206,21 +206,22 @@ Constructor to instantiate `Table` class. If `references` argument is provided,
206206

207207
- `(Schema)` - returns schema class instance
208208

209-
#### `table.iter(keyed=Fase, extended=False, cast=True, relations=False)`
209+
#### `table.iter(keyed=Fase, extended=False, cast=True, relations=False, foreign_keys_values=False)`
210210

211211
Iterates through the table data and emits rows cast based on table schema. Data casting can be disabled.
212212

213213
- `keyed (bool)` - iterate keyed rows
214214
- `extended (bool)` - iterate extended rows
215215
- `cast (bool)` - disable data casting if false
216-
- `relations (dict)` - dictionary of foreign key references in a form of `{resource1: [{field1: value1, field2: value2}, ...], ...}`. If provided, foreign key fields will checked and resolved to their references
216+
- `relations (dict)` - dictionary of foreign key references in a form of `{resource1: [{field1: value1, field2: value2}, ...], ...}`. If provided, foreign key fields will checked and resolved to one of their references (/!\ one-to-many fk are not completely resolved).
217+
- `foreign_keys_values (dict)` - three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: { (foreign_key_field1, foreign_key_field2) : { (value1, value2) : {one_keyedrow}, ... }}}`. If not provided but relations is true, it will be created before the validation process by *index_foreign_keys_values* method
217218
- `(exceptions.TableSchemaException)` - raises any error that occurs during this process
218219
- `(any[]/any{})` - yields rows:
219220
- `[value1, value2]` - base
220221
- `{header1: value1, header2: value2}` - keyed
221222
- `[rowNumber, [header1, header2], [value1, value2]]` - extended
222223

223-
#### `table.read(keyed=False, extended=False, cast=True, relations=False, limit=None)`
224+
#### `table.read(keyed=False, extended=False, cast=True, relations=False, limit=None, foreign_keys_values=False)`
224225

225226
Read the whole table and returns as array of rows. Count of rows could be limited.
226227

@@ -229,6 +230,7 @@ Read the whole table and returns as array of rows. Count of rows could be limite
229230
- `cast (bool)` - flag to disable data casting if false
230231
- `relations (dict)` - dict of foreign key references in a form of `{resource1: [{field1: value1, field2: value2}, ...], ...}`. If provided foreign key fields will checked and resolved to its references
231232
- `limit (int)` - integer limit of rows to return
233+
- `foreign_keys_values (dict)` - three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: { (foreign_key_field1, foreign_key_field2) : { (value1, value2) : {one_keyedrow}, ... }}}`
232234
- `(exceptions.TableSchemaException)` - raises any error that occurs during this process
233235
- `(list[])` - returns array of rows (see `table.iter`)
234236

@@ -252,6 +254,18 @@ Save data source to file locally in CSV format with `,` (comma) delimiter
252254
- `(exceptions.TableSchemaException)` - raises an error if there is saving problem
253255
- `(True/Storage)` - returns true or storage instance
254256

257+
#### `table.index_foreign_keys_values(relations)`
258+
259+
Creates a three-level dictionary of foreign key references optimized to speed up validation process in a form of `{resource1: { (foreign_key_field1, foreign_key_field2) : { (value1, value2) : {one_keyedrow}, ... }}}`.
260+
For each foreign key of the schema it will iterate through the corresponding `relations['resource']` to create an index (i.e. a dict) of existing values for the foreign fields and store on keyed row for each value combination.
261+
The optimization relies on the indexation of possible values for one foreign key in a hashmap to later speed up resolution.
262+
This method is public to allow creating the index once to apply it on multiple tables charing the same schema (typically [grouped resources in datapackage](https://github.com/frictionlessdata/datapackage-py#group))
263+
Note 1: the second key of the output is a tuple of the foreign fields, a proxy identifier of the foreign key
264+
Note 2: the same relation resource can be indexed multiple times as a schema can contain more than one Foreign Keys pointing to the same resource
265+
266+
- `relations (dict)` - dict of foreign key references in a form of `{resource1: [{field1: value1, field2: value2}, ...], ...}`. It must contain all resources pointed in the foreign keys schema definition.
267+
- `({resource1: { (foreign_key_field1, foreign_key_field2) : { (value1, value2) : {one_keyedrow}, ... }}})` - returns a three-level dictionary of foreign key references optimized to speed up validation process
268+
255269
### Schema
256270

257271
A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a url to a JSON file or a JSON object. The schema is initially validated (see [validate](#validate) below). By default validation errors will be stored in `schema.errors` but in a strict mode it will be instantly raised.

tableschema/table.py

Lines changed: 67 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from .storage import Storage
1212
from .schema import Schema
1313
from . import exceptions
14+
from collections import defaultdict
1415

1516

1617
# Module API
@@ -65,7 +66,8 @@ def schema(self):
6566
"""
6667
return self.__schema
6768

68-
def iter(self, keyed=False, extended=False, cast=True, relations=False):
69+
def iter(self, keyed=False, extended=False, cast=True, relations=False,
70+
foreign_keys_values=False):
6971
"""https://github.com/frictionlessdata/tableschema-py#schema
7072
"""
7173

@@ -74,6 +76,11 @@ def iter(self, keyed=False, extended=False, cast=True, relations=False):
7476
unique_fields_cache = {}
7577
if self.schema:
7678
unique_fields_cache = _create_unique_fields_cache(self.schema)
79+
# Prepare relation checks
80+
if relations and not foreign_keys_values:
81+
# we have to test relations but the index has not been precomputed
82+
# prepare the index to boost validation process
83+
foreign_keys_values = self.index_foreign_keys_values(relations)
7784

7885
# Open/iterate stream
7986
self.__stream.open()
@@ -110,11 +117,18 @@ def iter(self, keyed=False, extended=False, cast=True, relations=False):
110117
if self.schema:
111118
row_with_relations = dict(zip(headers, copy(row)))
112119
for foreign_key in self.schema.foreign_keys:
113-
refValue = _resolve_relations(row, headers, relations, foreign_key)
120+
refValue = _resolve_relations(row, headers, foreign_keys_values,
121+
foreign_key)
114122
if refValue is None:
115123
self.__stream.close()
116-
message = 'Foreign key "%s" violation in row "%s"'
117-
message = message % (foreign_key['fields'], row_number)
124+
keyed_row = OrderedDict(zip(headers, row))
125+
# local values of the FK
126+
local_values = tuple(keyed_row[f] for f in foreign_key['fields'])
127+
message = 'Foreign key "%s" violation in row "%s": %s not found in %s'
128+
message = message % (foreign_key['fields'],
129+
row_number,
130+
local_values,
131+
foreign_key['reference']['resource'])
118132
raise exceptions.RelationError(message)
119133
elif type(refValue) is dict:
120134
for field in foreign_key['fields']:
@@ -124,6 +138,11 @@ def iter(self, keyed=False, extended=False, cast=True, relations=False):
124138
else:
125139
# alreayd one ref, merging
126140
row_with_relations[field].update(refValue)
141+
else:
142+
# case when all original value of the FK are empty
143+
# refValue == row, there is nothing to do
144+
# an empty dict might be a better returned value for this case ?
145+
pass
127146

128147
# mutate row now that we are done, in the right order
129148
row = [row_with_relations[f] for f in headers]
@@ -139,11 +158,13 @@ def iter(self, keyed=False, extended=False, cast=True, relations=False):
139158
# Close stream
140159
self.__stream.close()
141160

142-
def read(self, keyed=False, extended=False, cast=True, relations=False, limit=None):
161+
def read(self, keyed=False, extended=False, cast=True, relations=False, limit=None,
162+
foreign_keys_values=False):
143163
"""https://github.com/frictionlessdata/tableschema-py#schema
144164
"""
145165
result = []
146-
rows = self.iter(keyed=keyed, extended=extended, cast=cast, relations=relations)
166+
rows = self.iter(keyed=keyed, extended=extended, cast=cast, relations=relations,
167+
foreign_keys_values=foreign_keys_values)
147168
for count, row in enumerate(rows, start=1):
148169
result.append(row)
149170
if count == limit:
@@ -194,6 +215,32 @@ def save(self, target, storage=None, **options):
194215
storage.write(target, self.iter(cast=False))
195216
return storage
196217

218+
def index_foreign_keys_values(self, relations):
219+
# we dont need to load the complete reference table to test relations
220+
# we can lower payload AND optimize testing foreign keys
221+
# by preparing the right index based on the foreign key definition
222+
# foreign_keys are sets of tuples of all possible values in the foreign table
223+
# foreign keys =
224+
# [reference] [foreign_keys tuple] = { (foreign_keys_values, ) : one_keyedrow, ... }
225+
foreign_keys = defaultdict(dict)
226+
if self.schema:
227+
for fk in self.schema.foreign_keys:
228+
# load relation data
229+
relation = fk['reference']['resource']
230+
231+
# create a set of foreign keys
232+
# to optimize we prepare index of existing values
233+
# this index should use reference + foreign_keys as key
234+
# cause many foreign keys may use the same reference
235+
foreign_keys[relation][tuple(fk['reference']['fields'])] = {}
236+
for row in relations[relation]:
237+
key = tuple([row[foreign_field] for foreign_field in fk['reference']['fields']])
238+
# here we should chose to pick the first or nth row which match
239+
# previous implementation picked the first, so be it
240+
if key not in foreign_keys[relation][tuple(fk['reference']['fields'])]:
241+
foreign_keys[relation][tuple(fk['reference']['fields'])][key] = row
242+
return foreign_keys
243+
197244
# Private
198245

199246
def __apply_processors(self, iterator, cast=True):
@@ -237,34 +284,21 @@ def _create_unique_fields_cache(schema):
237284
return cache
238285

239286

240-
def _resolve_relations(row, headers, relations, foreign_key):
287+
def _resolve_relations(row, headers, foreign_keys_values, foreign_key):
241288

242289
# Prepare helpers - needed data structures
243290
keyed_row = OrderedDict(zip(headers, row))
244-
fields = list(zip(foreign_key['fields'], foreign_key['reference']['fields']))
245-
reference = relations.get(foreign_key['reference']['resource'])
246-
if not reference:
247-
# should an exception beeing raised here ?
248-
return None
249-
250-
# Collect values - valid if all None
251-
values = {}
252-
empty_row = True
253-
for field, ref_field in fields:
254-
if field and ref_field:
255-
values[ref_field] = keyed_row[field]
256-
if keyed_row[field] is not None:
257-
empty_row = False
258-
259-
# Resolve values - valid if match found
260-
if not empty_row:
261-
for refValues in reference:
262-
if set(values.items()).issubset(set(refValues.items())):
263-
# return the correct reference values
264-
return refValues
265-
266-
if empty_row:
267-
# return the orignal row if empty
268-
return row
291+
# local values of the FK
292+
local_values = tuple(keyed_row[f] for f in foreign_key['fields'])
293+
if len([l for l in local_values if l]) > 0:
294+
# test existence into the foreign
295+
relation = foreign_key['reference']['resource']
296+
keys = tuple(foreign_key['reference']['fields'])
297+
foreign_values = foreign_keys_values[relation][keys]
298+
if local_values in foreign_values:
299+
return foreign_values[local_values]
300+
else:
301+
return None
269302
else:
270-
return None
303+
# empty values for all keys, return original values
304+
return row

0 commit comments

Comments
 (0)