You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
1290
+
1291
+
system_content=f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
1318
1292
1319
1293
Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern.
1320
1294
@@ -1335,31 +1309,27 @@ def generate_schema(
1335
1309
1336
1310
# What are the instructions and details for this schema generation?
1337
1311
{prompt_template}"""
1338
-
}
1339
-
1340
-
user_message= {
1341
-
"role": "user",
1342
-
"content": f"""
1312
+
1313
+
user_content=f"""
1343
1314
HTML to analyze:
1344
1315
```html
1345
1316
{html}
1346
1317
```
1347
1318
"""
1348
-
}
1349
1319
1350
1320
ifquery:
1351
-
user_message["content"]+=f"\n\n## Query or explanation of target/goal data item:\n{query}"
1321
+
user_content+=f"\n\n## Query or explanation of target/goal data item:\n{query}"
1352
1322
iftarget_json_example:
1353
-
user_message["content"]+=f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
1323
+
user_content+=f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
1354
1324
1355
1325
ifqueryandnottarget_json_example:
1356
-
user_message["content"]+="""IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
1326
+
user_content+="""IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
1357
1327
elifnotqueryandtarget_json_example:
1358
-
user_message["content"]+="""IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
1328
+
user_content+="""IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
1359
1329
elifnotqueryandnottarget_json_example:
1360
-
user_message["content"]+="""IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
1361
-
1362
-
user_message["content"]+="""IMPORTANT:
1330
+
user_content+="""IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
1331
+
1332
+
user_content+="""IMPORTANT:
1363
1333
0/ Ensure your schema remains reliable by avoiding selectors that appear to generate dynamically and are not dependable. You want a reliable schema, as it consistently returns the same data even after many page reloads.
1364
1334
1/ DO NOT USE use base64 kind of classes, they are temporary and not reliable.
1365
1335
2/ Every selector must refer to only one unique element. You should ensure your selector points to a single element and is unique to the place that contains the information. You have to use available techniques based on CSS or XPATH requested schema to make sure your selector is unique and also not fragile, meaning if we reload the page now or in the future, the selector should remain reliable.
@@ -1368,20 +1338,98 @@ def generate_schema(
1368
1338
Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else.
1369
1339
"""
1370
1340
1341
+
return"\n\n".join([system_content, user_content])
1342
+
1343
+
@staticmethod
1344
+
defgenerate_schema(
1345
+
html: str,
1346
+
schema_type: str="CSS",
1347
+
query: str=None,
1348
+
target_json_example: str=None,
1349
+
llm_config: 'LLMConfig'=create_llm_config(),
1350
+
provider: str=None,
1351
+
api_token: str=None,
1352
+
**kwargs
1353
+
) ->dict:
1354
+
"""
1355
+
Generate extraction schema from HTML content and optional query (sync version).
1356
+
1357
+
Args:
1358
+
html (str): The HTML content to analyze
1359
+
query (str, optional): Natural language description of what data to extract
1360
+
provider (str): Legacy Parameter. LLM provider to use
1361
+
api_token (str): Legacy Parameter. API token for LLM provider
1362
+
llm_config (LLMConfig): LLM configuration object
1363
+
**kwargs: Additional args passed to LLM processor
1364
+
1365
+
Returns:
1366
+
dict: Generated schema following the JsonElementExtractionStrategy format
0 commit comments