Skip to content

Commit 3ee819e

Browse files
authored
How to generate a JSON schema for an Unstructured JSON file (#265)
1 parent d0ee242 commit 3ee819e

File tree

2 files changed

+113
-1
lines changed

2 files changed

+113
-1
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: Generate a JSON schema for a file
3+
---
4+
5+
## Task
6+
7+
You want to generate a schema for a JSON file that Unstructured produces, so that you can validate, test,
8+
and document related JSON files across your systems.
9+
10+
## Approach
11+
12+
Use a Python package such as [genson](https://pypi.org/project/genson/) to generate schemas for your
13+
JSON files.
14+
15+
<Info>The `genson` package is not owned or supported by Unstructured. For questions and
16+
requests, see the [Issues](https://github.com/wolverdude/genson/issues) tab of the
17+
`genson` repository in GitHub.</Info>
18+
19+
## Generate a schema from the terminal
20+
21+
<Steps>
22+
<Step title="Install genson">
23+
Use [pip](https://pip.pypa.io/en/stable/installation/) to install the [genson](https://pypi.org/project/genson/) package.
24+
25+
```bash
26+
pip install genson
27+
```
28+
</Step>
29+
<Step title="Install jq">
30+
By default, `genson` generates the JSON schema as a single string without any line breaks or indented whitespace.
31+
32+
To pretty-print the schema that `genson` produces, install the [jq](https://jqlang.github.io/jq/) utility.
33+
34+
<Info>The `jq` utility is not owned or supported by Unstructured. For questions and
35+
requests, see the [Issues](https://github.com/jqlang/jq/issues) tab of the
36+
`jq` repository in GitHub.</Info>
37+
</Step>
38+
<Step title="Generate the schema">
39+
1. Run the `genson` command, specifying the path to the input (source) JSON file, and the path to
40+
the output (target) JSON schema file to be generated. Use `jq` to pretty-print the schema's content
41+
into the file to be generated.
42+
43+
```bash
44+
genson "/path/to/input/file.json" | jq '.' > "/path/to/output/schema.json"
45+
```
46+
47+
2. You can find the generated JSON schema file in the output path that you specified.
48+
</Step>
49+
</Steps>
50+
51+
## Generate a schema from Python code
52+
53+
<Steps>
54+
<Step title="Install dependencies">
55+
In your Python project, install the [genson](https://pypi.org/project/genson/) package.
56+
57+
```bash
58+
pip install genson
59+
```
60+
</Step>
61+
<Step title="Add and run the schema generation code">
62+
1. Set the following local environment variables:
63+
64+
- Set `LOCAL_FILE_INPUT_PATH` to the local path to the input (source) JSON file.
65+
- Set `LOCAL_FILE_OUTPUT_PATH` to the local path to the output (target) JSON schema file to be generated.
66+
67+
2. Add the following Python code file to your project:
68+
69+
```python
70+
import os, json
71+
from genson import SchemaBuilder
72+
73+
def json_schema_from_file(
74+
input_file_path: str,
75+
output_schema_path: str
76+
) -> None:
77+
try:
78+
with open(input_file_path, "r") as file:
79+
json_data = json.load(file)
80+
81+
builder = SchemaBuilder()
82+
builder.add_object(json_data)
83+
84+
schema = builder.to_schema()
85+
86+
try:
87+
with open(output_schema_path, "w") as schema_file:
88+
json.dump(schema, schema_file, indent=2)
89+
except IOError as e:
90+
raise IOError(f"Error writing to output file: {e}")
91+
92+
print(f"JSON schema successfully generated and saved to '{output_schema_path}'.")
93+
except FileNotFoundError:
94+
print(f"Error: Input file '{input_file_path}' not found.")
95+
except IOError as e:
96+
print(f"I/O error occurred: {e}")
97+
except Exception as e:
98+
print(f"An unexpected error occurred: {e}")
99+
100+
if __name__ == "__main__":
101+
json_schema_from_file(
102+
input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"),
103+
output_schema_path=os.getenv("LOCAL_FILE_OUTPUT_PATH")
104+
)
105+
```
106+
107+
3. Run the Python code file.
108+
4. Check the path specified by `LOCAL_FILE_OUTPUT_PATH` for the generated JSON schema file.
109+
</Step>
110+
</Steps>
111+

mint.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -379,7 +379,8 @@
379379
"api-reference/how-to/powerpoint",
380380
"api-reference/how-to/use-langchain-ollama",
381381
"api-reference/how-to/use-langchain-llama-3",
382-
"api-reference/how-to/transform-schemas"
382+
"api-reference/how-to/transform-schemas",
383+
"api-reference/how-to/generate-schema"
383384
]
384385
},
385386
{

0 commit comments

Comments
 (0)