MultiSource Recall hang due to regex catastrophic backtracking

Like some others, I was having issues with the local complete datagen hanging during the multi_source_recall pipeline. I suspect this is the underlying issue reported in #109 and possibly #107.  It appears that for some generated responses which don't perfectly match the expected format, the regex will hang due to catastrophic backtracking.  

Here is an example of the fault: [https://regex101.com/r/mDBOUg/1](https://regex101.com/r/mDBOUg/1) 

I made a work around as I am not great with regex, that breaks the response up incrementally and so far has prevented the hang ups:

Hacked version of: multi_source_helpers.extract_qa_tuples
```
def extract_qa_tuples(text):
    questionblocks = [s for s in re.split("\*\*QUESTION:\s*\*\*\s*",text,flags=re.IGNORECASE) if s]

    response=[]
    for qb in questionblocks:
        ab = [s for s in re.split("\*\*ANSWER:\s*",qb,flags=re.IGNORECASE) if s]
        if len(ab)>2: 
            if re.search("Thought Process",ab[1],flags=re.IGNORECASE):
                response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":ab[1].strip()})
        elif len(ab)==2:
            response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":""})
    print( response)
    return response
```
I am also guessing this issue caused some of the original issues that were chalked up to threading, as logging frequently occurs out of order (or totally missing when the hang occurs) due to the lack of flush=True in the debug print statements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiSource Recall hang due to regex catastrophic backtracking #116

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MultiSource Recall hang due to regex catastrophic backtracking #116

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions