How to extract content from documents #562

Cyp9715 · 2025-09-09T07:35:51Z

Cyp9715
Sep 9, 2025

Hello. I am currently extracting document content through pipelines and then processing it myself before passing it on, using the basic code below for extraction.

However, as you know, this code has potential risks (especially when files contain actual <source... syntax internally).

def pipe(self, user_message: str, model_id: str, messages: List[dict], body: dict) -> Union[str, Generator, Iterator]:
    for msg in messages:
        content = msg.get('content', '')
        source_matches = re.findall(r'<source[^>]*name="([^"]*)"[^>]*>(.*?)</source>', content, re.DOTALL)
            
        for filename, source_content in source_matches:
            documents.append(source_content)

Pipelines are excellent, but debugging is complex, so I need advice on how to handle this cleanly.

Additionally, I know how to filter this, but please let me know if there are more fundamental solutions than that.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to extract content from documents #562

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

How to extract content from documents #562

Uh oh!

Uh oh!

Cyp9715 Sep 9, 2025

Replies: 0 comments

Cyp9715
Sep 9, 2025