Commit 919d8b3
committed
TIND, PyPDF: Make metadata consistent and coherent
Due to the way Langchain uses LanceDB, the Arrow schema cannot change
between documents. By default, the PyPDF document loader from Langchain
ingests all metadata from the PDF (from both properties and XMP) and
saves them as keys in the ``metadata`` dictionary. Unfortunately, the
PDFs we are processing do not have uniform usage of metadata keys between
the different PDF files. Some have Company set to ``UC Berkeley``, some
don't; some have author information, some don't.
What's more, the way we parse TIND records into further metadata meant
that for a given key X, some records would have ``null`` values, some had
a single ``str``, and some had a ``list`` of ``str``.
This commit ensures that only the PDF metadata Langchain uses to process
documents is included, and that all TIND properties are lists of strings.
Empty values are now a list containing a single empty string.
The alternative would be to define our own LanceDB connection, define our
own schema ahead of time, and hope it never needs to change. It would
also make Langchain integration more difficult.
Additionally, fix up unit tests to handle metadata being lists now.
Closes: AP-4621 parent d565e2d commit 919d8b3
File tree
4 files changed
+20
-21
lines changed- tests/tind
- willa
- lcvendor
- tind
4 files changed
+20
-21
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
| 63 | + | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
75 | | - | |
| 75 | + | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
90 | | - | |
| 90 | + | |
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
129 | 130 | | |
130 | 131 | | |
131 | 132 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
| 141 | + | |
142 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
143 | 146 | | |
144 | 147 | | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | 148 | | |
154 | 149 | | |
155 | 150 | | |
156 | 151 | | |
157 | 152 | | |
158 | 153 | | |
159 | 154 | | |
160 | | - | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
161 | 159 | | |
162 | 160 | | |
163 | 161 | | |
164 | | - | |
| 162 | + | |
165 | 163 | | |
166 | 164 | | |
0 commit comments