1541 movecopy all relevant parts of cngt scripts into the signbank codebase by vanlummelhuizen · Pull Request #1701 · Signbank/Global-signbank

vanlummelhuizen · 2026-03-04T13:21:19Z

Relevant CNGT_scripts code is moved to this repo.

… Signbank codebase #1541

…cripts-into-the-signbank-codebase

susanodd · 2026-03-05T09:58:25Z

I found this same message in the requirements update #1700 (running the tests). I have some test eaf files locally that it uses to run the corpus tests.)

..CONFIGURE CORPUS
/home/susan/signbank/src/NGT-signbank/signbank/signCounter.py:90: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/TIME_SLOT'
  for time_slot in xml.findall("//TIME_SLOT"):
/home/susan/signbank/src/NGT-signbank/signbank/signCounter.py:110: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/TIER'
  for tier in xml.findall("//TIER"):

I'm using a newer version of whatever the requirements are as in that issue. (I don't know what the answer is. It runs the tests okay.)

The tests all pass. The video tests probably aren't testing anything relevant for this.

A lot of the video methods got rewritten because Jetske encountered problems when converting some files because the frame rate ended up changing the length of the video of the result so it would fail. She did lots of things with the eaf files and the annotated sentence videos. There aren't any tests for those.

Maybe you have some idea what is needed for extra tests? (Also regarding that weird character string that ends up in the filename.)

The code works for this branch!! With the exception of seeing those Future messages. (That can be delayed to that other issue. I didn't want to keep changing my venv.)

susanodd

Approve

susanodd

Approve

Copilot

Pull request overview

This PR vendors previously external CNGT_scripts functionality into the Signbank codebase, updating internal imports and removing the git-based dependency so frequency/video processing can run without that external repo.

Changes:

Add internalized script modules for video resizing, middle-frame still extraction, and EAF sign counting.
Replace CNGT_scripts imports in the video and frequency pipelines with in-repo equivalents (including a new EAF creation-time helper).
Remove the CNGT_scripts dependency from requirements.txt.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
`signbank/video/resizeVideos.py`	Adds an in-repo `VideoResizer` used by the video app.
`signbank/video/models.py`	Switches `VideoResizer` import from `CNGT_scripts` to the in-repo module.
`signbank/video/extractMiddleFrame.py`	Adds an in-repo `MiddleFrameExtracter` used for generating still images.
`signbank/tools.py`	Switches `MiddleFrameExtracter` import and introduces `get_eaf_creation_time`.
`signbank/signCounter.py`	Adds an in-repo `SignCounter` used by the corpus frequency pipeline.
`signbank/frequency.py`	Updates corpus update logic to use the in-repo `SignCounter` + `get_eaf_creation_time`.
`requirements.txt`	Removes the git-based `CNGT_scripts` dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T14:24:30Z

signbank/sign_counter.py

+        for signer_id in (1, 2):
+            unit = []  # Overlapping glosses are put in a unit.
+            last_end_on = ''  # The hand (L or R) of the last seen gloss
+            last_end = None  # The end timeSlot of the last seen gloss
+
+            right_tier_id = tier_id_hand['R']
+            left_tier_id = tier_id_hand['L']
+            if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:
+                right_hand_data = list_of_glosses[right_tier_id]
+                left_hand_data = list_of_glosses[left_tier_id]
+
+                if "annotations" in right_hand_data and "annotations" in left_hand_data:
+                    right_hand_annotations = right_hand_data['annotations']
+                    left_hand_annotations = left_hand_data['annotations']
+                    while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:
+                        if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:
+                            if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:
+                                last_end_on = 'R'
+                            else:
+                                last_end_on = 'L'
+                        elif len(right_hand_data['annotations']) > 0:
+                            last_end_on = 'R'
+                        else:
+                            last_end_on = 'L'
+
+                        current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]
+                        current_hand_begin = current_hand_data['annotations'][0]['begin']
+                        if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):
+                            # Begin new unit
+                            list_of_gloss_units.append(unit)
+                            unit = []
+
+                        unit.append(current_hand_data['annotations'][0])
+
+                        current_hand_end = current_hand_data['annotations'][0]['end']
+                        if last_end is None or current_hand_end > last_end:
+                            last_end = current_hand_end
+
+                        current_hand_data['annotations'].pop(0)
+
+            list_of_gloss_units.append(unit)


to_units_two_handed iterates for signer_id in (1, 2): but signer_id is never used. This causes the same units to be appended twice, which will double-count tokens for two-handed tiers. Remove the unused loop and build list_of_gloss_units once.

Suggested change

for signer_id in (1, 2):

unit = [] # Overlapping glosses are put in a unit.

last_end_on = '' # The hand (L or R) of the last seen gloss

last_end = None # The end timeSlot of the last seen gloss

right_tier_id = tier_id_hand['R']

left_tier_id = tier_id_hand['L']

if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:

right_hand_data = list_of_glosses[right_tier_id]

left_hand_data = list_of_glosses[left_tier_id]

if "annotations" in right_hand_data and "annotations" in left_hand_data:

right_hand_annotations = right_hand_data['annotations']

left_hand_annotations = left_hand_data['annotations']

while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:

if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:

if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:

last_end_on = 'R'

else:

last_end_on = 'L'

elif len(right_hand_data['annotations']) > 0:

last_end_on = 'R'

else:

last_end_on = 'L'

current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]

current_hand_begin = current_hand_data['annotations'][0]['begin']

if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):

# Begin new unit

list_of_gloss_units.append(unit)

unit = []

unit.append(current_hand_data['annotations'][0])

current_hand_end = current_hand_data['annotations'][0]['end']

if last_end is None or current_hand_end > last_end:

last_end = current_hand_end

current_hand_data['annotations'].pop(0)

list_of_gloss_units.append(unit)

unit = [] # Overlapping glosses are put in a unit.

last_end_on = '' # The hand (L or R) of the last seen gloss

last_end = None # The end timeSlot of the last seen gloss

right_tier_id = tier_id_hand['R']

left_tier_id = tier_id_hand['L']

if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses:

right_hand_data = list_of_glosses[right_tier_id]

left_hand_data = list_of_glosses[left_tier_id]

if "annotations" in right_hand_data and "annotations" in left_hand_data:

right_hand_annotations = right_hand_data['annotations']

left_hand_annotations = left_hand_data['annotations']

while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0:

if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0:

if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']:

last_end_on = 'R'

else:

last_end_on = 'L'

elif len(right_hand_data['annotations']) > 0:

last_end_on = 'R'

else:

last_end_on = 'L'

current_hand_data = list_of_glosses[tier_id_hand[last_end_on]]

current_hand_begin = current_hand_data['annotations'][0]['begin']

if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap):

# Begin new unit

list_of_gloss_units.append(unit)

unit = []

unit.append(current_hand_data['annotations'][0])

current_hand_end = current_hand_data['annotations'][0]['end']

if last_end is None or current_hand_end > last_end:

last_end = current_hand_end

current_hand_data['annotations'].pop(0)

list_of_gloss_units.append(unit)

Copilot · 2026-03-06T14:24:30Z

signbank/sign_counter.py

+                    re.sub(r'\n', '', gloss)
+                    re.sub(r'\t', '', gloss)
+                    re.sub(r'\s\s+', ' ', gloss)
+                    re.sub(r'^\s+', '', gloss)
+                    re.sub(r'\s+$', '', gloss)


The gloss “cleanup” calls to re.sub(...) don’t assign back to gloss, so they have no effect (since strings are immutable). This means whitespace/newline normalization isn’t applied and counts may diverge from the intended behavior. Assign the return value of each substitution back to gloss (or chain them).

Suggested change

re.sub(r'\n', '', gloss)

re.sub(r'\t', '', gloss)

re.sub(r'\s\s+', ' ', gloss)

re.sub(r'^\s+', '', gloss)

re.sub(r'\s+$', '', gloss)

gloss = re.sub(r'\n', '', gloss)

gloss = re.sub(r'\t', '', gloss)

gloss = re.sub(r'\s\s+', ' ', gloss)

gloss = re.sub(r'^\s+', '', gloss)

gloss = re.sub(r'\s+$', '', gloss)

Copilot · 2026-03-06T14:24:30Z

signbank/signCounter.py

+                    self.generate_result()
+                except KeyError as ke:
+                    sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))
+                # except:
+                #     sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))


generate_result() is called inside the per-file loop, recomputing self.sign_counts after every processed file. This is unnecessary work if the final aggregated result is all that’s needed (and in frequency.py the code reads freqsPerPerson directly). Consider moving generate_result() out of the loop (run once after all files) or making it opt-in.

Suggested change

self.generate_result()

except KeyError as ke:

sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))

# except:

# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))

except KeyError as ke:

sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0]))

# except:

# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))

self.generate_result()

Copilot · 2026-03-06T14:24:31Z

signbank/video/resize_videos.py

+               output_file]
+        print(" ".join(cmd))
+        if not dry_run:
+            Popen(cmd)


The ffmpeg process is started with Popen(cmd) but never waited on. When resizing multiple files this can spawn many concurrent ffmpeg processes and return before output files are complete. Use p = Popen(cmd); p.wait() (or subprocess.run) and handle non-zero exit codes.

Suggested change

Popen(cmd)

process = Popen(cmd)

return_code = process.wait()

if return_code != 0:

raise RuntimeError("ffmpeg command failed with exit code {}".format(return_code))

Copilot · 2026-03-06T14:24:31Z

signbank/video/extract_middle_frame.py

+        self.delete_frames = delete_frames
+
+        if output_dir:
+            self.output_dir = output_dir.rstrip(os.sep)


self.output_dir is only set when output_dir is truthy. When output_dir is None or an empty string (e.g., running the script without -o), self.output_dir is undefined and this will raise AttributeError at the os.path.isdir(self.output_dir) check. Set a default (e.g., require -o and raise a clear error, or default to a temp directory/current directory) before using self.output_dir.

Suggested change

self.output_dir = output_dir.rstrip(os.sep)

self.output_dir = output_dir.rstrip(os.sep)

else:

# Default to the current working directory if no output_dir is provided

self.output_dir = os.getcwd()

Copilot · 2026-03-06T14:24:33Z

signbank/sign_counter.py

+
+                    try:
+                        for something in self.metadata[person].keys():
+                            if something != 'self.region_metadata_id':


In the metadata aggregation loop, if something != 'self.region_metadata_id': compares against the literal string 'self.region_metadata_id' rather than the instance field self.region_metadata_id. As written, it will never exclude the region key and will produce an extra frequencyPer... bucket for the region metadata. Compare against self.region_metadata_id instead.

Suggested change

if something != 'self.region_metadata_id':

if something != self.region_metadata_id:

Copilot · 2026-03-06T14:24:33Z

signbank/frequency.py

    eaf_path = get_path_of_eaf_file(dataset_eaf_folder, uploaded_paths, document_id)

-    document_creation_dates_of_eaf_files = [ get_creation_time(eaf_path) ]
+    document_creation_dates_of_eaf_files = [get_eaf_creation_time(eaf_path)]

    dictionary_documentIds_to_documentObjs = dictionary_documentIdentifiers_to_documentObjects(corpus,
                                                                                               document_identifiers_of_eaf_files,


document_creation_dates_of_eaf_files is constructed as a list, but dictionary_documentIdentifiers_to_documentObjects expects a dict keyed by document identifier (it indexes by document_id). This will raise a TypeError if the document does not already exist. Build a {document_id: creation_time} dict here for consistency with other call sites.

Copilot · 2026-03-06T14:24:33Z

signbank/video/resize_videos.py

+        probe_cmd = self.ffmpeg_cmd[0:2] + "probe"
+        cmd = [probe_cmd, "-of", "json", "-show_streams", video_file]


probe_cmd = self.ffmpeg_cmd[0:2] + "probe" only works when ffmpeg_cmd is exactly ffmpeg/avconv. If settings ever provide an absolute path (e.g. /usr/local/bin/ffmpeg) or a different wrapper command, this produces an invalid probe command. Consider accepting an explicit ffprobe_cmd, or derive it robustly (e.g. via Path(ffmpeg_cmd).with_name('ffprobe') when a path is used, falling back to ffprobe).

Copilot · 2026-03-06T14:24:33Z

signbank/video/resize_videos.py

+        scale_formula = "scale=(trunc((iw/(ih/%f))/2+0.5))*2:%f" % (self.resize_scale, self.resize_scale)
+        path, ext = os.path.splitext(video_file)


scale_formula uses self.resize_scale directly; with the current default resize_scale=-1 this generates an invalid ffmpeg scale expression. It would be safer to validate resize_scale (must be a positive int) and raise a clear error early if it’s not set.

Copilot · 2026-03-06T14:24:34Z

signbank/sign_counter.py

+    def extract_time_slots(self, xml):
+        for time_slot in xml.findall("//TIME_SLOT"):
+            time_slot_id = time_slot.attrib['TIME_SLOT_ID']
+            self.time_slots[time_slot_id] = time_slot.attrib['TIME_VALUE']
+
+    def get_tier_id(self, tier):
+        return tier.attrib['TIER_ID']
+
+    def get_participant(self, tier):
+        if 'PARTICIPANT' in tier.attrib:
+            return tier.attrib['PARTICIPANT']
+        return ""
+
+    def get_linguistic_type(self, tier):
+        if 'LINGUISTIC_TYPE_REF' in tier.attrib:
+            return tier.attrib['LINGUISTIC_TYPE_REF']
+        return ""
+    # End helper functions
+
+    def group_tiers_per_participant(self, xml):
+        grouped_tiers = defaultdict(list)
+        for tier in xml.findall("//TIER"):
+            if self.get_linguistic_type(tier).lower() == self.gloss_tier_type \


Using xml.findall("//TIME_SLOT") / xml.findall("//TIER") is not valid ElementPath syntax for lxml.etree.ElementTree.findall (it expects relative paths like .//TIME_SLOT). This can result in zero matches and an empty time_slots/tier list, breaking counting. Prefer .findall('.//TIME_SLOT') and .findall('.//TIER') (or use xml.xpath('//TIME_SLOT')).

Woseseltops

Yes I think it's better to have all code together. I had a quick high level look at the code and commented on a few things I noticed.

Also, not sure if it 's important enough to change, but I believe python files typically have underscores instead of camel case, so extract_middle_frame.py.

Woseseltops · 2026-03-06T14:24:11Z

signbank/video/extract_middle_frame.py

+            new_dir = self.output_dir + os.sep + \
+                      os.path.basename(video_file) + "-frames" + os.sep + dir_name


F string would work well here

Woseseltops · 2026-03-06T14:25:57Z

signbank/video/extractMiddleFrame.py

+        frames = sorted(os.listdir(frames_dir))
+        if len(frames) != 0:
+            middle_frame_index = int(floor(len(frames)/2))
+            middle_frame = frames[middle_frame_index]
+            video_base_name = os.path.basename(video_file)
+            video_name = os.path.splitext(video_base_name)[0]
+            video_still = dirs[1] + os.sep + video_name + '.png'
+            if not dry_run:
+                copyfile(frames_dir + os.sep + middle_frame, video_still)
+
+            # Create the 320x180 version
+            cmd = [
+                "convert",
+                video_still,
+                "-resize", "x180",
+                dirs[1] + os.sep + video_name + '_320x180.png'
+            ]
+            print(" ".join(cmd), file=sys.stderr)
+            if not dry_run:
+                p = Popen(cmd)
+                p.wait()


Perhaps a guard clause here, so

if len(frames) == 0: return

Woseseltops · 2026-03-06T14:28:29Z

signbank/video/resizeVideos.py

+__author__ = "Micha Hulsbosch"
+__date__ = "August 2016"


Only now realize I'm reviewing code from almost a decade ago :)

Woseseltops · 2026-03-06T14:30:15Z

signbank/signCounter.py

+        if len(self.all_files) > 0:
+            for f in self.all_files:
+                try:


if len(self.all_files) == 0: return

Woseseltops · 2026-03-06T14:30:34Z

signbank/signCounter.py

+                # except:
+                #     sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))


I think this can be removed :)

Woseseltops · 2026-03-06T14:34:57Z

signbank/signCounter.py

+    if csv_file:
+        # Flatten result dict


if not csv_file: print(json.dumps(result, sort_keys=True, indent=4))

vanlummelhuizen · 2026-03-11T13:03:09Z

@Woseseltops applied you suggestion and more. How do you like it now?

Woseseltops

<3

vanlummelhuizen added 5 commits May 22, 2025 15:44

refactor: Move function from CNGT_scripts to Signbank code base #1541

e363b63

refactor: Move signCounter form CNGT_scripts to Signbank codebase #1541

5024598

refactor: Move resizeVideo and extractMiddleFrom from CNGT_scripts to…

863094d

… Signbank codebase #1541

chore: Remove CNGT_scripts from requirements.txt #1541

32508a0

Merge branch 'master' into 1541-movecopy-all-relevant-parts-of-cngt_s…

9c03062

…cripts-into-the-signbank-codebase

vanlummelhuizen linked an issue Mar 4, 2026 that may be closed by this pull request

Move/copy all relevant parts of CNGT_scripts into the Signbank codebase #1541

Open

vanlummelhuizen requested review from Woseseltops and susanodd March 4, 2026 13:21

vanlummelhuizen mentioned this pull request Mar 4, 2026

Move/copy all relevant parts of CNGT_scripts into the Signbank codebase #1541

Open

susanodd mentioned this pull request Mar 5, 2026

Update packages requirements #1700

Merged

susanodd reviewed Mar 5, 2026

View reviewed changes

susanodd approved these changes Mar 5, 2026

View reviewed changes

Woseseltops requested a review from Copilot March 6, 2026 14:18

Copilot started reviewing on behalf of Woseseltops March 6, 2026 14:18 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Woseseltops approved these changes Mar 6, 2026

View reviewed changes

vanlummelhuizen added 2 commits March 11, 2026 13:27

Improved readability and naming conventions, per review #1541

0c11f43

Some more improved readability and naming conventions, per review #1541

4fcd0d2

Woseseltops approved these changes Mar 20, 2026

View reviewed changes

-            Popen(cmd)
+            process = Popen(cmd)
+            return_code = process.wait()
+            if return_code != 0:
+                raise RuntimeError("ffmpeg command failed with exit code {}".format(return_code))

	if something != 'self.region_metadata_id':
	if something != self.region_metadata_id:

		probe_cmd = self.ffmpeg_cmd[0:2] + "probe"
		cmd = [probe_cmd, "-of", "json", "-show_streams", video_file]

		scale_formula = "scale=(trunc((iw/(ih/%f))/2+0.5))*2:%f" % (self.resize_scale, self.resize_scale)
		path, ext = os.path.splitext(video_file)

		new_dir = self.output_dir + os.sep + \
		os.path.basename(video_file) + "-frames" + os.sep + dir_name

		# except:
		# sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1])))

Conversation

vanlummelhuizen commented Mar 4, 2026

Uh oh!

susanodd commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

susanodd left a comment

Choose a reason for hiding this comment

Uh oh!

susanodd left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops left a comment

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Woseseltops Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

vanlummelhuizen commented Mar 11, 2026

Uh oh!

Woseseltops left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

susanodd commented Mar 5, 2026 •

edited

Loading