Skip to content

1541 movecopy all relevant parts of cngt scripts into the signbank codebase#1701

Open
vanlummelhuizen wants to merge 7 commits intomasterfrom
1541-movecopy-all-relevant-parts-of-cngt_scripts-into-the-signbank-codebase
Open

1541 movecopy all relevant parts of cngt scripts into the signbank codebase#1701
vanlummelhuizen wants to merge 7 commits intomasterfrom
1541-movecopy-all-relevant-parts-of-cngt_scripts-into-the-signbank-codebase

Conversation

@vanlummelhuizen
Copy link
Collaborator

Relevant CNGT_scripts code is moved to this repo.

@susanodd
Copy link
Collaborator

susanodd commented Mar 5, 2026

I found this same message in the requirements update #1700 (running the tests). I have some test eaf files locally that it uses to run the corpus tests.)

..CONFIGURE CORPUS
/home/susan/signbank/src/NGT-signbank/signbank/signCounter.py:90: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/TIME_SLOT'
  for time_slot in xml.findall("//TIME_SLOT"):
/home/susan/signbank/src/NGT-signbank/signbank/signCounter.py:110: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/TIER'
  for tier in xml.findall("//TIER"):

I'm using a newer version of whatever the requirements are as in that issue. (I don't know what the answer is. It runs the tests okay.)

The tests all pass. The video tests probably aren't testing anything relevant for this.

A lot of the video methods got rewritten because Jetske encountered problems when converting some files because the frame rate ended up changing the length of the video of the result so it would fail. She did lots of things with the eaf files and the annotated sentence videos. There aren't any tests for those.

Maybe you have some idea what is needed for extra tests? (Also regarding that weird character string that ends up in the filename.)

The code works for this branch!! With the exception of seeing those Future messages. (That can be delayed to that other issue. I didn't want to keep changing my venv.)

Copy link
Collaborator

@susanodd susanodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

Copy link
Collaborator

@susanodd susanodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR vendors previously external CNGT_scripts functionality into the Signbank codebase, updating internal imports and removing the git-based dependency so frequency/video processing can run without that external repo.

Changes:

  • Add internalized script modules for video resizing, middle-frame still extraction, and EAF sign counting.
  • Replace CNGT_scripts imports in the video and frequency pipelines with in-repo equivalents (including a new EAF creation-time helper).
  • Remove the CNGT_scripts dependency from requirements.txt.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
signbank/video/resizeVideos.py Adds an in-repo VideoResizer used by the video app.
signbank/video/models.py Switches VideoResizer import from CNGT_scripts to the in-repo module.
signbank/video/extractMiddleFrame.py Adds an in-repo MiddleFrameExtracter used for generating still images.
signbank/tools.py Switches MiddleFrameExtracter import and introduces get_eaf_creation_time.
signbank/signCounter.py Adds an in-repo SignCounter used by the corpus frequency pipeline.
signbank/frequency.py Updates corpus update logic to use the in-repo SignCounter + get_eaf_creation_time.
requirements.txt Removes the git-based CNGT_scripts dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +200 to +240
for signer_id in (1, 2):
unit = [] # Overlapping glosses are put in a unit.
last_end_on = '' # The hand (L or R) of the last seen gloss
last_end = None # The end timeSlot of the last seen gloss

right_tier_id = tier_id_hand['R']
left_tier_id = tier_id_hand['L']
if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:
right_hand_data = list_of_glosses[right_tier_id]
left_hand_data = list_of_glosses[left_tier_id]

if "annotations" in right_hand_data and "annotations" in left_hand_data:
right_hand_annotations = right_hand_data['annotations']
left_hand_annotations = left_hand_data['annotations']
while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:
if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:
if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:
last_end_on = 'R'
else:
last_end_on = 'L'
elif len(right_hand_data['annotations']) > 0:
last_end_on = 'R'
else:
last_end_on = 'L'

current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]
current_hand_begin = current_hand_data['annotations'][0]['begin']
if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):
# Begin new unit
list_of_gloss_units.append(unit)
unit = []

unit.append(current_hand_data['annotations'][0])

current_hand_end = current_hand_data['annotations'][0]['end']
if last_end is None or current_hand_end > last_end:
last_end = current_hand_end

current_hand_data['annotations'].pop(0)

list_of_gloss_units.append(unit)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_units_two_handed iterates for signer_id in (1, 2): but signer_id is never used. This causes the same units to be appended twice, which will double-count tokens for two-handed tiers. Remove the unused loop and build list_of_gloss_units once.

Suggested change
for signer_id in (1, 2):
unit = [] # Overlapping glosses are put in a unit.
last_end_on = '' # The hand (L or R) of the last seen gloss
last_end = None # The end timeSlot of the last seen gloss
right_tier_id = tier_id_hand['R']
left_tier_id = tier_id_hand['L']
if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:
right_hand_data = list_of_glosses[right_tier_id]
left_hand_data = list_of_glosses[left_tier_id]
if "annotations" in right_hand_data and "annotations" in left_hand_data:
right_hand_annotations = right_hand_data['annotations']
left_hand_annotations = left_hand_data['annotations']
while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:
if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:
if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:
last_end_on = 'R'
else:
last_end_on = 'L'
elif len(right_hand_data['annotations']) > 0:
last_end_on = 'R'
else:
last_end_on = 'L'
current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]
current_hand_begin = current_hand_data['annotations'][0]['begin']
if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):
# Begin new unit
list_of_gloss_units.append(unit)
unit = []
unit.append(current_hand_data['annotations'][0])
current_hand_end = current_hand_data['annotations'][0]['end']
if last_end is None or current_hand_end > last_end:
last_end = current_hand_end
current_hand_data['annotations'].pop(0)
list_of_gloss_units.append(unit)
unit = [] # Overlapping glosses are put in a unit.
last_end_on = '' # The hand (L or R) of the last seen gloss
last_end = None # The end timeSlot of the last seen gloss
right_tier_id = tier_id_hand['R']
left_tier_id = tier_id_hand['L']
if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:
right_hand_data = list_of_glosses[right_tier_id]
left_hand_data = list_of_glosses[left_tier_id]
if "annotations" in right_hand_data and "annotations" in left_hand_data:
right_hand_annotations = right_hand_data['annotations']
left_hand_annotations = left_hand_data['annotations']
while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:
if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:
if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:
last_end_on = 'R'
else:
last_end_on = 'L'
elif len(right_hand_data['annotations']) > 0:
last_end_on = 'R'
else:
last_end_on = 'L'
current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]
current_hand_begin = current_hand_data['annotations'][0]['begin']
if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):
# Begin new unit
list_of_gloss_units.append(unit)
unit = []
unit.append(current_hand_data['annotations'][0])
current_hand_end = current_hand_data['annotations'][0]['end']
if last_end is None or current_hand_end > last_end:
last_end = current_hand_end
current_hand_data['annotations'].pop(0)
list_of_gloss_units.append(unit)

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +262
re.sub(r'\n', '', gloss)
re.sub(r'\t', '', gloss)
re.sub(r'\s\s+', ' ', gloss)
re.sub(r'^\s+', '', gloss)
re.sub(r'\s+$', '', gloss)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gloss “cleanup” calls to re.sub(...) don’t assign back to gloss, so they have no effect (since strings are immutable). This means whitespace/newline normalization isn’t applied and counts may diverge from the intended behavior. Assign the return value of each substitution back to gloss (or chain them).

Suggested change
re.sub(r'\n', '', gloss)
re.sub(r'\t', '', gloss)
re.sub(r'\s\s+', ' ', gloss)
re.sub(r'^\s+', '', gloss)
re.sub(r'\s+$', '', gloss)
gloss = re.sub(r'\n', '', gloss)
gloss = re.sub(r'\t', '', gloss)
gloss = re.sub(r'\s\s+', ' ', gloss)
gloss = re.sub(r'^\s+', '', gloss)
gloss = re.sub(r'\s+$', '', gloss)

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +68
self.generate_result()
except KeyError as ke:
sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))
# except:
# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_result() is called inside the per-file loop, recomputing self.sign_counts after every processed file. This is unnecessary work if the final aggregated result is all that’s needed (and in frequency.py the code reads freqsPerPerson directly). Consider moving generate_result() out of the loop (run once after all files) or making it opt-in.

Suggested change
self.generate_result()
except KeyError as ke:
sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))
# except:
# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))
except KeyError as ke:
sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))
# except:
# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))
self.generate_result()

Copilot uses AI. Check for mistakes.
output_file]
print(" ".join(cmd))
if not dry_run:
Popen(cmd)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ffmpeg process is started with Popen(cmd) but never waited on. When resizing multiple files this can spawn many concurrent ffmpeg processes and return before output files are complete. Use p = Popen(cmd); p.wait() (or subprocess.run) and handle non-zero exit codes.

Suggested change
Popen(cmd)
process = Popen(cmd)
return_code = process.wait()
if return_code != 0:
raise RuntimeError("ffmpeg command failed with exit code {}".format(return_code))

Copilot uses AI. Check for mistakes.
self.delete_frames = delete_frames

if output_dir:
self.output_dir = output_dir.rstrip(os.sep)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.output_dir is only set when output_dir is truthy. When output_dir is None or an empty string (e.g., running the script without -o), self.output_dir is undefined and this will raise AttributeError at the os.path.isdir(self.output_dir) check. Set a default (e.g., require -o and raise a clear error, or default to a temp directory/current directory) before using self.output_dir.

Suggested change
self.output_dir = output_dir.rstrip(os.sep)
self.output_dir = output_dir.rstrip(os.sep)
else:
# Default to the current working directory if no output_dir is provided
self.output_dir = os.getcwd()

Copilot uses AI. Check for mistakes.

try:
for something in self.metadata[person].keys():
if something != 'self.region_metadata_id':
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the metadata aggregation loop, if something != 'self.region_metadata_id': compares against the literal string 'self.region_metadata_id' rather than the instance field self.region_metadata_id. As written, it will never exclude the region key and will produce an extra frequencyPer... bucket for the region metadata. Compare against self.region_metadata_id instead.

Suggested change
if something != 'self.region_metadata_id':
if something != self.region_metadata_id:

Copilot uses AI. Check for mistakes.
Comment on lines 1038 to 1043
eaf_path = get_path_of_eaf_file(dataset_eaf_folder, uploaded_paths, document_id)

document_creation_dates_of_eaf_files = [ get_creation_time(eaf_path) ]
document_creation_dates_of_eaf_files = [get_eaf_creation_time(eaf_path)]

dictionary_documentIds_to_documentObjs = dictionary_documentIdentifiers_to_documentObjects(corpus,
document_identifiers_of_eaf_files,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document_creation_dates_of_eaf_files is constructed as a list, but dictionary_documentIdentifiers_to_documentObjects expects a dict keyed by document identifier (it indexes by document_id). This will raise a TypeError if the document does not already exist. Build a {document_id: creation_time} dict here for consistency with other call sites.

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +60
probe_cmd = self.ffmpeg_cmd[0:2] + "probe"
cmd = [probe_cmd, "-of", "json", "-show_streams", video_file]
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probe_cmd = self.ffmpeg_cmd[0:2] + "probe" only works when ffmpeg_cmd is exactly ffmpeg/avconv. If settings ever provide an absolute path (e.g. /usr/local/bin/ffmpeg) or a different wrapper command, this produces an invalid probe command. Consider accepting an explicit ffprobe_cmd, or derive it robustly (e.g. via Path(ffmpeg_cmd).with_name('ffprobe') when a path is used, falling back to ffprobe).

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
scale_formula = "scale=(trunc((iw/(ih/%f))/2+0.5))*2:%f" % (self.resize_scale, self.resize_scale)
path, ext = os.path.splitext(video_file)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale_formula uses self.resize_scale directly; with the current default resize_scale=-1 this generates an invalid ffmpeg scale expression. It would be safer to validate resize_scale (must be a positive int) and raise a clear error early if it’s not set.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +111
def extract_time_slots(self, xml):
for time_slot in xml.findall("//TIME_SLOT"):
time_slot_id = time_slot.attrib['TIME_SLOT_ID']
self.time_slots[time_slot_id] = time_slot.attrib['TIME_VALUE']

def get_tier_id(self, tier):
return tier.attrib['TIER_ID']

def get_participant(self, tier):
if 'PARTICIPANT' in tier.attrib:
return tier.attrib['PARTICIPANT']
return ""

def get_linguistic_type(self, tier):
if 'LINGUISTIC_TYPE_REF' in tier.attrib:
return tier.attrib['LINGUISTIC_TYPE_REF']
return ""
# End helper functions

def group_tiers_per_participant(self, xml):
grouped_tiers = defaultdict(list)
for tier in xml.findall("//TIER"):
if self.get_linguistic_type(tier).lower() == self.gloss_tier_type \
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using xml.findall("//TIME_SLOT") / xml.findall("//TIER") is not valid ElementPath syntax for lxml.etree.ElementTree.findall (it expects relative paths like .//TIME_SLOT). This can result in zero matches and an empty time_slots/tier list, breaking counting. Prefer .findall('.//TIME_SLOT') and .findall('.//TIER') (or use xml.xpath('//TIME_SLOT')).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@Woseseltops Woseseltops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it's better to have all code together. I had a quick high level look at the code and commented on a few things I noticed.

Also, not sure if it 's important enough to change, but I believe python files typically have underscores instead of camel case, so extract_middle_frame.py.

Comment on lines +63 to +64
new_dir = self.output_dir + os.sep + \
os.path.basename(video_file) + "-frames" + os.sep + dir_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F string would work well here

Comment on lines +99 to +119
frames = sorted(os.listdir(frames_dir))
if len(frames) != 0:
middle_frame_index = int(floor(len(frames)/2))
middle_frame = frames[middle_frame_index]
video_base_name = os.path.basename(video_file)
video_name = os.path.splitext(video_base_name)[0]
video_still = dirs[1] + os.sep + video_name + '.png'
if not dry_run:
copyfile(frames_dir + os.sep + middle_frame, video_still)

# Create the 320x180 version
cmd = [
"convert",
video_still,
"-resize", "x180",
dirs[1] + os.sep + video_name + '_320x180.png'
]
print(" ".join(cmd), file=sys.stderr)
if not dry_run:
p = Popen(cmd)
p.wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a guard clause here, so

if len(frames) == 0:
    return

Comment on lines +16 to +17
__author__ = "Micha Hulsbosch"
__date__ = "August 2016"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only now realize I'm reviewing code from almost a decade ago :)

Comment on lines +60 to +62
if len(self.all_files) > 0:
for f in self.all_files:
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(self.all_files) == 0:
    return

Comment on lines +67 to +68
# except:
# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed :)

Comment on lines +345 to +346
if csv_file:
# Flatten result dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not csv_file:
    print(json.dumps(result, sort_keys=True, indent=4))

@vanlummelhuizen
Copy link
Collaborator Author

@Woseseltops applied you suggestion and more. How do you like it now?

Copy link
Collaborator

@Woseseltops Woseseltops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move/copy all relevant parts of CNGT_scripts into the Signbank codebase

4 participants