-
Notifications
You must be signed in to change notification settings - Fork 239
Description
Like some others, I was having issues with the local complete datagen hanging during the multi_source_recall pipeline. I suspect this is the underlying issue reported in #109 and possibly #107. It appears that for some generated responses which don't perfectly match the expected format, the regex will hang due to catastrophic backtracking.
Here is an example of the fault: https://regex101.com/r/mDBOUg/1
I made a work around as I am not great with regex, that breaks the response up incrementally and so far has prevented the hang ups:
Hacked version of: multi_source_helpers.extract_qa_tuples
def extract_qa_tuples(text):
questionblocks = [s for s in re.split("\*\*QUESTION:\s*\*\*\s*",text,flags=re.IGNORECASE) if s]
response=[]
for qb in questionblocks:
ab = [s for s in re.split("\*\*ANSWER:\s*",qb,flags=re.IGNORECASE) if s]
if len(ab)>2:
if re.search("Thought Process",ab[1],flags=re.IGNORECASE):
response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":ab[1].strip()})
elif len(ab)==2:
response.append({"question": ab[0].strip(),"answer":ab[len(ab)-1].strip(),"thoughts":""})
print( response)
return response
I am also guessing this issue caused some of the original issues that were chalked up to threading, as logging frequently occurs out of order (or totally missing when the hang occurs) due to the lack of flush=True in the debug print statements.