Skip to content

Commit 2d1923a

Browse files
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: #2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: #2461
1 parent abb0174 commit 2d1923a

File tree

172 files changed

+3755
-3485
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

172 files changed

+3755
-3485
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
## 0.13.4-dev1
22

33
### Enhancements
4+
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning function are now deterministic and unique at the document level by default. Before, hashes were based only on text; however, they now also take into account the element's sequence number on a page, the page's number in the document, and the document's file name.
45

56
### Features
67

docs/source/introduction/overview.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,10 +141,8 @@ a list of elements from JSON, as seen in the snippet below
141141
Unique Element IDs
142142
******************
143143

144-
By default, the element ID is a SHA-256 hash of the element text. This is to ensure that
145-
the ID is deterministic. One downside is that the ID is not guaranteed to be unique.
146-
Different elements with the same text will have the same ID, and there could also
147-
be hash collisions. To use UUIDs in the output instead, you can pass
144+
By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
145+
To obtain globally unique IDs in the output (UUIDs), you can pass
148146
``unique_element_ids=True`` into any of the partition functions. This can be helpful
149147
if you'd like to use the IDs as a primary key in a database, for example.
150148

@@ -161,7 +159,7 @@ Element ID Design Principles
161159
#. A partitioning function can assign only one of two available ID types to a returned element: a hash or a UUID.
162160
#. All elements that are returned come with an ID, which is never None.
163161
#. No matter which type of ID is used, it will always be in string format.
164-
#. Partitioning a document returns elements with hashes as their default IDs.
162+
#. Partitioning a document returns elements with hashes as their default IDs, ensuring they are both deterministic and unique within a document.
165163

166164
For creating elements independently of partitioning functions, refer to the `Element` class documentation in the source code (`unstructured/documents/elements.py`).
167165

18 KB
Binary file not shown.
35.8 KB
Binary file not shown.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
<!DOCTYPE html>
2+
<html>
3+
4+
<head>
5+
<title>Simple Nested HTML</title>
6+
</strong>
7+
8+
<body>
9+
<h1>Example heading.</h1>
10+
<div>
11+
<span>This is a span.</span>
12+
<span>This is another span.</span>
13+
</div>
14+
<br>
15+
<h1>Example heading.</h1>
16+
<div>
17+
<span>This is a span.</span>
18+
<span>This is another span.</span>
19+
</div>
20+
21+
</body>
22+
23+
</html>
11.2 KB
Binary file not shown.

example-docs/spring-weather.html.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,4 +223,4 @@
223223
"page_number": 1
224224
}
225225
}
226-
]
226+
]

test_unstructured/documents/test_elements.py

Lines changed: 96 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
from __future__ import annotations
66

7+
import copy
78
import json
89
import pathlib
910
from functools import partial
@@ -28,9 +29,29 @@
2829
RegexMetadata,
2930
Text,
3031
Title,
32+
assign_and_map_hash_ids,
3133
)
3234

3335

36+
@pytest.mark.parametrize("element", [Element(), Text(text=""), CheckBox()])
37+
def test_Element_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
38+
element: Element,
39+
):
40+
# -- element self-assigns itself a UUID --
41+
assert isinstance(element.id, str)
42+
assert len(element.id) == 36
43+
assert element.id.count("-") == 4
44+
45+
expected_hash = "5336294a19f32ff03ef80066fbc3e0f7"
46+
# -- calling `.id_to_hash()` changes the element's id-type to hash --
47+
assert element.id_to_hash(0) == expected_hash
48+
assert element.id == expected_hash
49+
50+
# -- `.id_to_hash()` is idempotent --
51+
assert element.id_to_hash(0) == expected_hash
52+
assert element.id == expected_hash
53+
54+
3455
def test_Text_is_JSON_serializable():
3556
# -- This shold run without an error --
3657
json.dumps(Text(text="hello there!", element_id=None).to_dict())
@@ -45,25 +66,11 @@ def test_Text_is_JSON_serializable():
4566
CheckBox(),
4667
],
4768
)
48-
def test_Element_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
49-
element: Element,
50-
):
51-
assert element._element_id is None, "Element should not have an ID yet"
52-
53-
# -- element self-assigns itself a UUID only when the ID is requested --
69+
def test_Element_self_assigns_itself_a_UUID_id(element: Element):
5470
assert isinstance(element.id, str)
5571
assert len(element.id) == 36
5672
assert element.id.count("-") == 4
5773

58-
expected_hash = "e3b0c44298fc1c149afbf4c8996fb924"
59-
# -- calling `.id_to_hash()` changes the element's id-type to hash --
60-
assert element.id_to_hash() == expected_hash
61-
assert element.id == expected_hash
62-
63-
# -- `.id_to_hash()` is idempotent --
64-
assert element.id_to_hash() == expected_hash
65-
assert element.id == expected_hash
66-
6774

6875
def test_text_element_apply_cleaners():
6976
text_element = Text(text="[1] A Textbook on Crocodile Habitats")
@@ -408,9 +415,10 @@ def and_it_serializes_an_orig_elements_sub_object_to_base64_when_it_is_present(s
408415
assert meta.to_dict() == {
409416
"category_depth": 1,
410417
"orig_elements": (
411-
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWt"
412-
"JVm5WDoqNUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQp"
413-
"ZRxYZcCA/1spDtP98dU6DTEw3sa5fWOqs10vH0cLQn0="
418+
"eJyFzcsKwjAQheFXKVm7MGkzbXwDocu6EpFcTqTQG3UEtfTdbZa"
419+
"6cTnDd/jPi0CHHgNf2yAOmXCljjqXoErKoIw3hqJRXlPuyphrEr"
420+
"tM9GAbLNvNL+t2M56ctvU4o0+AXxPSo2m5g9jIb6VwBE0VBSujp"
421+
"1LJ6EiRLpwiSBf3fyvZcbo/vlqnwVvGbZzbN0KT7Hr5AG/eQyM="
414422
),
415423
"page_number": 2,
416424
}
@@ -666,3 +674,73 @@ def it_can_find_the_consolidation_strategy_for_each_of_its_known_fields(self):
666674
f"ElementMetadata field `.{field_name}` does not have a consolidation strategy."
667675
f" Add one in `ConsolidationStrategy.field_consolidation_strategies()."
668676
)
677+
678+
679+
def test_hash_ids_are_unique_for_duplicate_elements():
680+
# GIVEN
681+
parent = Text(text="Parent", metadata=ElementMetadata(page_number=1))
682+
elements = [
683+
parent,
684+
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
685+
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
686+
]
687+
688+
# WHEN
689+
updated_elements = assign_and_map_hash_ids(copy.deepcopy(elements))
690+
ids = [element.id for element in updated_elements]
691+
692+
# THEN
693+
assert len(ids) == len(set(ids)), "Recalculated IDs must be unique."
694+
assert elements[1].metadata.parent_id == elements[2].metadata.parent_id
695+
696+
for idx, updated_element in enumerate(updated_elements):
697+
assert updated_element.id != elements[idx].id, "IDs haven't changed after recalculation"
698+
if updated_element.metadata.parent_id is not None:
699+
assert updated_element.metadata.parent_id in ids, "Parent ID not in the list of IDs"
700+
assert (
701+
updated_element.metadata.parent_id != elements[idx].metadata.parent_id
702+
), "Parent ID hasn't changed after recalculation"
703+
704+
705+
def test_hash_ids_are_deterministic():
706+
parent = Text(text="Parent", metadata=ElementMetadata(page_number=1))
707+
elements = [
708+
parent,
709+
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
710+
Text(text="Element", metadata=ElementMetadata(page_number=1, parent_id=parent.id)),
711+
]
712+
713+
updated_elements = assign_and_map_hash_ids(elements)
714+
ids = [element.id for element in updated_elements]
715+
parent_ids = [element.metadata.parent_id for element in updated_elements]
716+
717+
assert ids == [
718+
"ea9eb7e80383c190f8cafce1ad666624",
719+
"4112a8d24886276e18e759d06956021b",
720+
"eba84bbe7f03e8b91a1527323040ee3d",
721+
]
722+
assert parent_ids == [
723+
None,
724+
"ea9eb7e80383c190f8cafce1ad666624",
725+
"ea9eb7e80383c190f8cafce1ad666624",
726+
]
727+
728+
729+
@pytest.mark.parametrize(
730+
("text", "sequence_number", "filename", "page_number", "expected_hash"),
731+
[
732+
# -- pdf files support page numbers --
733+
("foo", 1, "foo.pdf", 1, "4bb264eb23ceb44cd8fcc5af44f8dc71"),
734+
("foo", 2, "foo.pdf", 1, "75fc1de48cf724ec00aa8d1c5a0d3758"),
735+
# -- txt files don't have a page number --
736+
("some text", 0, "some.txt", None, "1a2627b5760c06b1440102f11a1edb0f"),
737+
("some text", 1, "some.txt", None, "e3fd10d867c4a1c0264dde40e3d7e45a"),
738+
],
739+
)
740+
def test_id_to_hash_calculates(text, sequence_number, filename, page_number, expected_hash):
741+
element = Text(
742+
text=text,
743+
metadata=ElementMetadata(filename=filename, page_number=page_number),
744+
)
745+
assert element.id_to_hash(sequence_number) == expected_hash, "Returned ID does not match"
746+
assert element.id == expected_hash, "ID should be set"

test_unstructured/documents/test_email_elements.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,41 @@
44

55
from unstructured.cleaners.core import clean_prefix
66
from unstructured.cleaners.translate import translate_text
7-
from unstructured.documents.email_elements import EmailElement, Name
7+
from unstructured.documents.email_elements import EmailElement, Name, Subject
8+
9+
10+
@pytest.mark.parametrize(
11+
"element", [EmailElement(text=""), Name(text="", name=""), Subject(text="")]
12+
)
13+
def test_EmailElement_autoassigns_a_UUID_then_becomes_an_idempotent_and_deterministic_hash(
14+
element: EmailElement,
15+
):
16+
# -- element self-assigns itself a UUID --
17+
assert isinstance(element.id, str)
18+
assert len(element.id) == 36
19+
assert element.id.count("-") == 4
20+
21+
expected_hash = "5336294a19f32ff03ef80066fbc3e0f7"
22+
# -- calling `.id_to_hash()` changes the element's id-type to hash --
23+
assert element.id_to_hash(0) == expected_hash
24+
assert element.id == expected_hash
25+
26+
# -- `.id_to_hash()` is idempotent --
27+
assert element.id_to_hash(0) == expected_hash
828

929

1030
def test_Name_should_assign_a_deterministic_and_an_idempotent_hash():
1131
element = Name(name="Example", text="hello there!")
12-
expected_hash = "c69509590d81db2f37f9d75480c8efed"
32+
expected_hash = "7d191bcecf80c122578c497de5f0dae7"
1333

1434
assert element._element_id is None, "Element should not have an ID yet"
1535

1636
# -- calculating hash for the first time --
17-
assert element.id_to_hash() == expected_hash
37+
assert element.id_to_hash(0) == expected_hash
1838
assert element.id == expected_hash
1939

2040
# -- `.id_to_hash()` is idempotent --
21-
assert element.id_to_hash() == expected_hash
41+
assert element.id_to_hash(0) == expected_hash
2242
assert element.id == expected_hash
2343

2444

@@ -30,7 +50,7 @@ def test_Name_should_assign_a_deterministic_and_an_idempotent_hash():
3050
Name(name="Example", text="hello there!", element_id=None),
3151
],
3252
)
33-
def test_EmailElement_should_assign_a_UUID_only_once_and_only_at_the_first_id_request(
53+
def test_EmailElement_assigns_a_UUID_only_once_and_only_at_the_first_id_request(
3454
element: EmailElement,
3555
):
3656
assert element._element_id is None, "Element should not have an ID yet"

test_unstructured/partition/docx/test_doc.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,20 @@
1919
from unstructured.partition.docx import partition_docx
2020

2121

22+
def test_partition_doc_for_deterministic_and_unique_ids():
23+
ids = [element.id for element in partition_doc("example-docs/duplicate-paragraphs.doc")]
24+
25+
assert ids == [
26+
"ade273c622c48d67a7be7b3816d5b4d8",
27+
"7d0b32fdf169f9578723486cb4bc1235",
28+
"1feb6e8e9c1662cfaef75907aeeb0900",
29+
"aa2a8ac10143b12f0fe2087837ea11d2",
30+
"da31ba7ed3919067d2c6572dc1617271",
31+
"1914359c179a160df921b769acf8c353",
32+
"f9d0d379fc791bae487b7a45f65caa50",
33+
]
34+
35+
2236
@pytest.fixture()
2337
def mock_document():
2438
document = docx.Document()

0 commit comments

Comments
 (0)