Skip to content

MultiSource Recall hang due to regex catastrophic backtracking #116

@sinflrobot

Description

@sinflrobot

Like some others, I was having issues with the local complete datagen hanging during the multi_source_recall pipeline. I suspect this is the underlying issue reported in #109 and possibly #107. It appears that for some generated responses which don't perfectly match the expected format, the regex will hang due to catastrophic backtracking.

Here is an example of the fault: https://regex101.com/r/mDBOUg/1

I made a work around as I am not great with regex, that breaks the response up incrementally and so far has prevented the hang ups:

Hacked version of: multi_source_helpers.extract_qa_tuples

def extract_qa_tuples(text):
    questionblocks = [s for s in re.split("\*\*QUESTION:\s*\*\*\s*",text,flags=re.IGNORECASE) if s]

    response=[]
    for qb in questionblocks:
        ab = [s for s in re.split("\*\*ANSWER:\s*",qb,flags=re.IGNORECASE) if s]
        if len(ab)>2: 
            if re.search("Thought Process",ab[1],flags=re.IGNORECASE):
                response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":ab[1].strip()})
        elif len(ab)==2:
            response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":""})
    print( response)
    return response

I am also guessing this issue caused some of the original issues that were chalked up to threading, as logging frequently occurs out of order (or totally missing when the hang occurs) due to the lack of flush=True in the debug print statements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions