Replies: 1 comment
-
To preprocess complex nested HTML content with Beautiful Soup to mark entities and relations, and then use an extractor to recognize and process these marked entities in a .txt file for technical legal content, you can follow these steps:
Here is an example code snippet to achieve this: from bs4 import BeautifulSoup
from llama_index.core.node_parser.file.html import HTMLNodeParser
from llama_index.core.schema import BaseNode, TextNode
# Step 1: Parse the HTML and mark entities and relations
html_content = """
<html>
<body>
<p>John Doe is a lawyer at XYZ Law Firm.</p>
<p>He specializes in corporate law.</p>
</body>
</html>
"""
soup = BeautifulSoup(html_content, "html.parser")
# Example of marking entities and relations
for p in soup.find_all('p'):
if "John Doe" in p.text:
p.string = p.text.replace("John Doe", "[XXX SPECIAL ENTITY 1: John Doe]")
if "XYZ Law Firm" in p.text:
p.string = p.text.replace("XYZ Law Firm", "[XXX SPECIAL ENTITY 2: XYZ Law Firm]")
# Step 2: Extract the marked entities and relations
class CustomHTMLNodeParser(HTMLNodeParser):
def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]:
text = node.get_content(metadata_mode=MetadataMode.NONE)
soup = BeautifulSoup(text, "html.parser")
html_nodes = []
last_tag = None
current_section = ""
tags = soup.find_all(self.tags)
for tag in tags:
tag_text = self._extract_text_from_tag(tag)
if tag.name == last_tag or last_tag is None:
last_tag = tag.name
current_section += f"{tag_text.strip()}\n"
else:
html_nodes.append(
self._build_node_from_split(
current_section.strip(), node, {"tag": last_tag}
)
)
last_tag = tag.name
current_section = f"{tag_text}\n"
if current_section:
html_nodes.append(
self._build_node_from_split(
current_section.strip(), node, {"tag": last_tag}
)
)
return html_nodes
# Create a BaseNode with the modified HTML content
base_node = BaseNode(content=str(soup))
# Use the custom parser to extract nodes
parser = CustomHTMLNodeParser.from_defaults()
nodes = parser.get_nodes_from_node(base_node)
# Step 3: Save the processed content to a .txt file
with open("processed_legal_content.txt", "w") as file:
for node in nodes:
file.write(node.get_content() + "\n") This code will parse the HTML content, mark the entities "John Doe" and "XYZ Law Firm", extract the marked entities using the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am a bit of a novice so apologies if this is an easy answer. I have been testing property graphs to try and understand how they work and how to use them. I have been using the schema extractor and simple extractor to try and get an LLM (i've been using gpt 4o mini FWIW) to try and identify and extract entities. This hasn't worked very well so far. I think everyone has some problem with extracting good entities and relations but it's probably more acute for me given the technical legal content i'm dealing with.
Given the struggles I'm having I started to think it might be better for me to pre-process my content with beautiful soup to identify the entities. My data originates with a lot of very complex nested html, from which I can use BS4 to mark various entities and relations. So, what I wanted to do was pre-process my documents with BS4 a spit out .txt text file that included extra embedded content prepended to entities with things like [XXX HEADING LEVEL 1] or [XXX SPECIAL ENTITY 2].
With this content in the .txt file, i then wanted an extractor to go through and know that when it comes across the text [XXX HEADING LEVEL 1: XXX] that is a specific type of entity and to treat that as the entity and use the content after the ":" as the actual name of the entity. This removes the subjective judgement of the LLM. Don't get me wrong, it's judgement are good. But for complex law I might be able to do a better job using custom BS4 logic to parse the complex nested html and manually identify entities.
The difficulty I have is that I can't see an extractor that would do this and/or i don't understand how to make the existing extractors operate like this. Is this a simple ask? Is there an easy way to do this?
I can see there is an example by Logan to manually create entities and relations and upsert them. This type of process is just too manual for what i need.
Any thoughts?
Beta Was this translation helpful? Give feedback.
All reactions