property graphs - pre-processing with Beautiful Soup #15109

Jez77 · 2024-08-02T10:06:13Z

Jez77
Aug 2, 2024

Hi,

I am a bit of a novice so apologies if this is an easy answer. I have been testing property graphs to try and understand how they work and how to use them. I have been using the schema extractor and simple extractor to try and get an LLM (i've been using gpt 4o mini FWIW) to try and identify and extract entities. This hasn't worked very well so far. I think everyone has some problem with extracting good entities and relations but it's probably more acute for me given the technical legal content i'm dealing with.

Given the struggles I'm having I started to think it might be better for me to pre-process my content with beautiful soup to identify the entities. My data originates with a lot of very complex nested html, from which I can use BS4 to mark various entities and relations. So, what I wanted to do was pre-process my documents with BS4 a spit out .txt text file that included extra embedded content prepended to entities with things like [XXX HEADING LEVEL 1] or [XXX SPECIAL ENTITY 2].

With this content in the .txt file, i then wanted an extractor to go through and know that when it comes across the text [XXX HEADING LEVEL 1: XXX] that is a specific type of entity and to treat that as the entity and use the content after the ":" as the actual name of the entity. This removes the subjective judgement of the LLM. Don't get me wrong, it's judgement are good. But for complex law I might be able to do a better job using custom BS4 logic to parse the complex nested html and manually identify entities.

The difficulty I have is that I can't see an extractor that would do this and/or i don't understand how to make the existing extractors operate like this. Is this a simple ask? Is there an easy way to do this?

I can see there is an example by Logan to manually create entities and relations and upsert them. This type of process is just too manual for what i need.

Any thoughts?

@dosu · 2024-08-02T10:12:12Z

dosubot[bot]
bot Aug 2, 2024

To preprocess complex nested HTML content with Beautiful Soup to mark entities and relations, and then use an extractor to recognize and process these marked entities in a .txt file for technical legal content, you can follow these steps:

Parse the HTML and mark entities and relations:
Use Beautiful Soup to parse the HTML content and identify the entities and relations. You can then mark these entities and relations with custom tags or annotations.
Extract the marked entities and relations:
Use the HTMLNodeParser class to extract the marked entities and relations from the HTML content.
Save the processed content to a .txt file:
Write the extracted and processed content to a .txt file.

Here is an example code snippet to achieve this:

from bs4 import BeautifulSoup
from llama_index.core.node_parser.file.html import HTMLNodeParser
from llama_index.core.schema import BaseNode, TextNode

# Step 1: Parse the HTML and mark entities and relations
html_content = """
<html>
<body>
<p>John Doe is a lawyer at XYZ Law Firm.</p>
<p>He specializes in corporate law.</p>
</body>
</html>
"""

soup = BeautifulSoup(html_content, "html.parser")

# Example of marking entities and relations
for p in soup.find_all('p'):
    if "John Doe" in p.text:
        p.string = p.text.replace("John Doe", "[XXX SPECIAL ENTITY 1: John Doe]")
    if "XYZ Law Firm" in p.text:
        p.string = p.text.replace("XYZ Law Firm", "[XXX SPECIAL ENTITY 2: XYZ Law Firm]")

# Step 2: Extract the marked entities and relations
class CustomHTMLNodeParser(HTMLNodeParser):
    def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]:
        text = node.get_content(metadata_mode=MetadataMode.NONE)
        soup = BeautifulSoup(text, "html.parser")
        html_nodes = []
        last_tag = None
        current_section = ""

        tags = soup.find_all(self.tags)
        for tag in tags:
            tag_text = self._extract_text_from_tag(tag)
            if tag.name == last_tag or last_tag is None:
                last_tag = tag.name
                current_section += f"{tag_text.strip()}\n"
            else:
                html_nodes.append(
                    self._build_node_from_split(
                        current_section.strip(), node, {"tag": last_tag}
                    )
                )
                last_tag = tag.name
                current_section = f"{tag_text}\n"

        if current_section:
            html_nodes.append(
                self._build_node_from_split(
                    current_section.strip(), node, {"tag": last_tag}
                )
            )

        return html_nodes

# Create a BaseNode with the modified HTML content
base_node = BaseNode(content=str(soup))

# Use the custom parser to extract nodes
parser = CustomHTMLNodeParser.from_defaults()
nodes = parser.get_nodes_from_node(base_node)

# Step 3: Save the processed content to a .txt file
with open("processed_legal_content.txt", "w") as file:
    for node in nodes:
        file.write(node.get_content() + "\n")

This code will parse the HTML content, mark the entities "John Doe" and "XYZ Law Firm", extract the marked entities using the HTMLNodeParser, and save the processed content to a .txt file ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

property graphs - pre-processing with Beautiful Soup #15109

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

property graphs - pre-processing with Beautiful Soup #15109

Uh oh!

Jez77 Aug 2, 2024

Replies: 1 comment

Uh oh!

dosubot[bot] bot Aug 2, 2024

Jez77
Aug 2, 2024

dosubot[bot]
bot Aug 2, 2024