Skip to content

Commit d0ee242

Browse files
authored
How to transform Unstructured JSON to a custom JSON schema (#264)
1 parent 1e78b4f commit d0ee242

File tree

2 files changed

+231
-1
lines changed

2 files changed

+231
-1
lines changed
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
---
2+
title: Transform a JSON file into a different schema
3+
---
4+
5+
## Task
6+
7+
You want to convert a JSON file that Unstructured produces into a separate JSON file that uses
8+
a different JSON schema than the one that Unstructured uses.
9+
10+
## Approach
11+
12+
Use a Python package such as [json-converter](https://pypi.org/project/json-converter/) in your Python code project
13+
to transform your source JSON file into a target JSON file that conforms to your own schema.
14+
15+
<Info>The `json-converter` package is not owned or supported by Unstructured. For questions and
16+
requests, see the [Issues](https://github.com/ebi-ait/json-converter/issues) tab of the
17+
`json-converter` repository in GitHub.</Info>
18+
19+
## Code
20+
21+
<Steps>
22+
<Step title="Install dependencies">
23+
In your local Python code project, install the [json-converter](https://pypi.org/project/json-converter/)
24+
package.
25+
26+
```bash
27+
pip install json-converter
28+
```
29+
</Step>
30+
<Step title="Identify the JSON file to transform">
31+
1. Find the local source JSON file that you want to transform.
32+
2. Note the JSON field names and structures that you want to transform. For example, the JSON file might
33+
look like the following (the ellipses indicate content omitted for brevity):
34+
35+
```json
36+
[
37+
{
38+
"type": "...",
39+
"element_id": "...",
40+
"text": "...",
41+
"metadata": {
42+
"filetype": "...",
43+
"languages": [
44+
"eng"
45+
],
46+
"page_number": 1,
47+
"filename": "..."
48+
}
49+
},
50+
{
51+
"type": "...",
52+
"element_id": "...",
53+
"text": "...",
54+
"metadata": {
55+
"filetype": "...",
56+
"languages": [
57+
"eng"
58+
],
59+
"page_number": 1,
60+
"filename": "..."
61+
}
62+
},
63+
{
64+
"...": "..."
65+
}
66+
]
67+
```
68+
</Step>
69+
<Step title="Create the JSON field mappings file">
70+
1. Decide what you want the JSON schema in the transformed file to look like. For example, the
71+
transformed JSON file might look like the following (the ellipses indicate content omitted for brevity):
72+
73+
```json
74+
[
75+
{
76+
"content_type": "...",
77+
"content_id": "...",
78+
"content": "...",
79+
"content_properties": {
80+
"page": 1
81+
}
82+
},
83+
{
84+
"content_type": "...",
85+
"content_id": "...",
86+
"content": "...",
87+
"content_properties": {
88+
"page": 1
89+
}
90+
},
91+
{
92+
"...": "..."
93+
}
94+
]
95+
```
96+
97+
2. Create the JSON field mappings file, for example:
98+
99+
```json
100+
{
101+
"content_type": ["type"],
102+
"content_id": ["element_id"],
103+
"content": ["text"]
104+
"content_properties.page": ["metadata.page_number"]
105+
}
106+
```
107+
108+
This file declares the following mappings:
109+
110+
- The `type` field is renamed to `content_type`.
111+
- The `element_id` field is renamed to `content_id`.
112+
- The `text` field is renamed to `content`.
113+
- The `page_number` field nested inside `metadata` is renamed to `page` and is nested inside `content_properties`.
114+
- All of the other fields (`filetype`, `languages`, and `filename`) are dropped.
115+
116+
For more information about the format of this JSON field mappings file, see the
117+
[Project Description](https://pypi.org/project/json-converter) in the `json-converter` page on PyPI or the
118+
[README](https://github.com/ebi-ait/json-converter) in the `json-converter` repository in GitHub.
119+
</Step>
120+
<Step title="Add and run the transform code">
121+
1. Set the following local environment variables:
122+
123+
- Set `LOCAL_FILE_INPUT_PATH` to the local path to the source JSON file.
124+
- Set `LOCAL_FILE_OUTPUT_PATH` to the local path to the target JSON file.
125+
- Set `LOCAL_FIELD_MAPPINGS_PATH` to the local path to the JSON field mappings file.
126+
127+
2. Add the following Python code file to your project:
128+
129+
```python
130+
import os, json
131+
from json_converter.json_mapper import JsonMapper
132+
133+
# Converts one JSON file with one schema to another
134+
# JSON file with a different schema.
135+
# Provide the path to the input (source) and output
136+
# (target) JSON files and the path to the JSON schema
137+
# field mappings file.
138+
def convert_json_with_schemas(
139+
input_file_path: str,
140+
output_file_path: str,
141+
field_mappings_path: str
142+
) -> None:
143+
144+
output_data = []
145+
146+
try:
147+
with open(field_mappings_path, 'r') as f:
148+
element_mappings = json.load(f)
149+
except FileNotFoundError:
150+
print(f"Error: Input JSON schema field mappings file '{input_file_path}' not found.")
151+
except IOError as e:
152+
print(f"I/O error occurred: {e}")
153+
except Exception as e:
154+
print(f"An unexpected error occurred: {e}")
155+
156+
try:
157+
with open(input_file_path, 'r') as f:
158+
input_data = json.load(f)
159+
160+
for item in input_data:
161+
converted_data = JsonMapper(item).map(element_mappings)
162+
output_data.append(converted_data)
163+
except FileNotFoundError:
164+
print(f"Error: Input JSON file '{input_file_path}' not found.")
165+
except IOError as e:
166+
print(f"I/O error occurred: {e}")
167+
except Exception as e:
168+
print(f"An unexpected error occurred: {e}")
169+
170+
try:
171+
with open(output_file_path, 'w') as f:
172+
json.dump(output_data, f, indent=2)
173+
174+
print(f"Transformation complete. Output written to '{output_file_path}'.")
175+
except IOError as e:
176+
print(f"I/O error occurred: {e}")
177+
except Exception as e:
178+
print(f"An unexpected error occurred: {e}")
179+
180+
if __name__ == "__main__":
181+
convert_json_with_schemas(
182+
input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"),
183+
output_file_path=os.getenv("LOCAL_FIELD_MAPPINGS_PATH"),
184+
field_mappings_path=os.getenv("LOCAL_FILE_OUTPUT_PATH")
185+
)
186+
```
187+
188+
3. Run the Python code file.
189+
4. Check the path specified by `LOCAL_FILE_OUTPUT_PATH` for the transformed JSON file.
190+
</Step>
191+
</Steps>
192+
193+
## Troubleshooting
194+
195+
### Error when trying to import Mapping from collections
196+
197+
**Issue**: When you run your Python code file, the following error message appears: "ImportError: cannot import name 'Mapping' from 'collections'".
198+
199+
**Cause**: When you use the `json-converter` package with newer versions of Python such as 3.11 and later,
200+
Python tries to use an outdated import in this `json-converter` package.
201+
202+
**Solution**: Update the `json-converter` package's source code to use a different import, as follows:
203+
204+
1. In your Python project, find the `json-converter` package's source location, by running the `pip show` command:
205+
206+
```bash
207+
pip show json-converter
208+
```
209+
210+
Note the path in the **Location** field.
211+
212+
2. Use your code editor to the open the path to the `json-converter` package's source code.
213+
3. In the source code, open the file named `json_mapper.py`.
214+
4. Change the following line of code...
215+
216+
```python
217+
from collections import Mapping
218+
```
219+
220+
...to the following line of code, by adding `.abc`:
221+
222+
```python
223+
from collections.abc import Mapping
224+
```
225+
226+
5. Save this source code file.
227+
6. Run your Python code file again.
228+
229+

mint.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,8 @@
378378
"api-reference/how-to/change-element-coordinate-system",
379379
"api-reference/how-to/powerpoint",
380380
"api-reference/how-to/use-langchain-ollama",
381-
"api-reference/how-to/use-langchain-llama-3"
381+
"api-reference/how-to/use-langchain-llama-3",
382+
"api-reference/how-to/transform-schemas"
382383
]
383384
},
384385
{

0 commit comments

Comments
 (0)