1541 movecopy all relevant parts of cngt scripts into the signbank codebase#1701
Conversation
… Signbank codebase #1541
…cripts-into-the-signbank-codebase
|
I found this same message in the requirements update #1700 (running the tests). I have some test eaf files locally that it uses to run the corpus tests.) I'm using a newer version of whatever the requirements are as in that issue. (I don't know what the answer is. It runs the tests okay.) The tests all pass. The video tests probably aren't testing anything relevant for this. A lot of the video methods got rewritten because Jetske encountered problems when converting some files because the frame rate ended up changing the length of the video of the result so it would fail. She did lots of things with the eaf files and the annotated sentence videos. There aren't any tests for those. Maybe you have some idea what is needed for extra tests? (Also regarding that weird character string that ends up in the filename.) The code works for this branch!! With the exception of seeing those Future messages. (That can be delayed to that other issue. I didn't want to keep changing my venv.) |
There was a problem hiding this comment.
Pull request overview
This PR vendors previously external CNGT_scripts functionality into the Signbank codebase, updating internal imports and removing the git-based dependency so frequency/video processing can run without that external repo.
Changes:
- Add internalized script modules for video resizing, middle-frame still extraction, and EAF sign counting.
- Replace
CNGT_scriptsimports in the video and frequency pipelines with in-repo equivalents (including a new EAF creation-time helper). - Remove the
CNGT_scriptsdependency fromrequirements.txt.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
signbank/video/resizeVideos.py |
Adds an in-repo VideoResizer used by the video app. |
signbank/video/models.py |
Switches VideoResizer import from CNGT_scripts to the in-repo module. |
signbank/video/extractMiddleFrame.py |
Adds an in-repo MiddleFrameExtracter used for generating still images. |
signbank/tools.py |
Switches MiddleFrameExtracter import and introduces get_eaf_creation_time. |
signbank/signCounter.py |
Adds an in-repo SignCounter used by the corpus frequency pipeline. |
signbank/frequency.py |
Updates corpus update logic to use the in-repo SignCounter + get_eaf_creation_time. |
requirements.txt |
Removes the git-based CNGT_scripts dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for signer_id in (1, 2): | ||
| unit = [] # Overlapping glosses are put in a unit. | ||
| last_end_on = '' # The hand (L or R) of the last seen gloss | ||
| last_end = None # The end timeSlot of the last seen gloss | ||
|
|
||
| right_tier_id = tier_id_hand['R'] | ||
| left_tier_id = tier_id_hand['L'] | ||
| if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses: | ||
| right_hand_data = list_of_glosses[right_tier_id] | ||
| left_hand_data = list_of_glosses[left_tier_id] | ||
|
|
||
| if "annotations" in right_hand_data and "annotations" in left_hand_data: | ||
| right_hand_annotations = right_hand_data['annotations'] | ||
| left_hand_annotations = left_hand_data['annotations'] | ||
| while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0: | ||
| if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0: | ||
| if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']: | ||
| last_end_on = 'R' | ||
| else: | ||
| last_end_on = 'L' | ||
| elif len(right_hand_data['annotations']) > 0: | ||
| last_end_on = 'R' | ||
| else: | ||
| last_end_on = 'L' | ||
|
|
||
| current_hand_data = list_of_glosses[tier_id_hand[last_end_on]] | ||
| current_hand_begin = current_hand_data['annotations'][0]['begin'] | ||
| if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap): | ||
| # Begin new unit | ||
| list_of_gloss_units.append(unit) | ||
| unit = [] | ||
|
|
||
| unit.append(current_hand_data['annotations'][0]) | ||
|
|
||
| current_hand_end = current_hand_data['annotations'][0]['end'] | ||
| if last_end is None or current_hand_end > last_end: | ||
| last_end = current_hand_end | ||
|
|
||
| current_hand_data['annotations'].pop(0) | ||
|
|
||
| list_of_gloss_units.append(unit) |
There was a problem hiding this comment.
to_units_two_handed iterates for signer_id in (1, 2): but signer_id is never used. This causes the same units to be appended twice, which will double-count tokens for two-handed tiers. Remove the unused loop and build list_of_gloss_units once.
| for signer_id in (1, 2): | |
| unit = [] # Overlapping glosses are put in a unit. | |
| last_end_on = '' # The hand (L or R) of the last seen gloss | |
| last_end = None # The end timeSlot of the last seen gloss | |
| right_tier_id = tier_id_hand['R'] | |
| left_tier_id = tier_id_hand['L'] | |
| if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses: | |
| right_hand_data = list_of_glosses[right_tier_id] | |
| left_hand_data = list_of_glosses[left_tier_id] | |
| if "annotations" in right_hand_data and "annotations" in left_hand_data: | |
| right_hand_annotations = right_hand_data['annotations'] | |
| left_hand_annotations = left_hand_data['annotations'] | |
| while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0: | |
| if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0: | |
| if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']: | |
| last_end_on = 'R' | |
| else: | |
| last_end_on = 'L' | |
| elif len(right_hand_data['annotations']) > 0: | |
| last_end_on = 'R' | |
| else: | |
| last_end_on = 'L' | |
| current_hand_data = list_of_glosses[tier_id_hand[last_end_on]] | |
| current_hand_begin = current_hand_data['annotations'][0]['begin'] | |
| if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap): | |
| # Begin new unit | |
| list_of_gloss_units.append(unit) | |
| unit = [] | |
| unit.append(current_hand_data['annotations'][0]) | |
| current_hand_end = current_hand_data['annotations'][0]['end'] | |
| if last_end is None or current_hand_end > last_end: | |
| last_end = current_hand_end | |
| current_hand_data['annotations'].pop(0) | |
| list_of_gloss_units.append(unit) | |
| unit = [] # Overlapping glosses are put in a unit. | |
| last_end_on = '' # The hand (L or R) of the last seen gloss | |
| last_end = None # The end timeSlot of the last seen gloss | |
| right_tier_id = tier_id_hand['R'] | |
| left_tier_id = tier_id_hand['L'] | |
| if right_tier_id in list_of_glosses and left_tier_id in list_of_glosses: | |
| right_hand_data = list_of_glosses[right_tier_id] | |
| left_hand_data = list_of_glosses[left_tier_id] | |
| if "annotations" in right_hand_data and "annotations" in left_hand_data: | |
| right_hand_annotations = right_hand_data['annotations'] | |
| left_hand_annotations = left_hand_data['annotations'] | |
| while len(right_hand_annotations) > 0 or len(left_hand_annotations) > 0: | |
| if len(right_hand_annotations) > 0 and len(left_hand_annotations) > 0: | |
| if right_hand_annotations[0]['begin'] <= left_hand_annotations[0]['begin']: | |
| last_end_on = 'R' | |
| else: | |
| last_end_on = 'L' | |
| elif len(right_hand_data['annotations']) > 0: | |
| last_end_on = 'R' | |
| else: | |
| last_end_on = 'L' | |
| current_hand_data = list_of_glosses[tier_id_hand[last_end_on]] | |
| current_hand_begin = current_hand_data['annotations'][0]['begin'] | |
| if last_end is not None and current_hand_begin > (last_end - self.minimum_overlap): | |
| # Begin new unit | |
| list_of_gloss_units.append(unit) | |
| unit = [] | |
| unit.append(current_hand_data['annotations'][0]) | |
| current_hand_end = current_hand_data['annotations'][0]['end'] | |
| if last_end is None or current_hand_end > last_end: | |
| last_end = current_hand_end | |
| current_hand_data['annotations'].pop(0) | |
| list_of_gloss_units.append(unit) |
| re.sub(r'\n', '', gloss) | ||
| re.sub(r'\t', '', gloss) | ||
| re.sub(r'\s\s+', ' ', gloss) | ||
| re.sub(r'^\s+', '', gloss) | ||
| re.sub(r'\s+$', '', gloss) |
There was a problem hiding this comment.
The gloss “cleanup” calls to re.sub(...) don’t assign back to gloss, so they have no effect (since strings are immutable). This means whitespace/newline normalization isn’t applied and counts may diverge from the intended behavior. Assign the return value of each substitution back to gloss (or chain them).
| re.sub(r'\n', '', gloss) | |
| re.sub(r'\t', '', gloss) | |
| re.sub(r'\s\s+', ' ', gloss) | |
| re.sub(r'^\s+', '', gloss) | |
| re.sub(r'\s+$', '', gloss) | |
| gloss = re.sub(r'\n', '', gloss) | |
| gloss = re.sub(r'\t', '', gloss) | |
| gloss = re.sub(r'\s\s+', ' ', gloss) | |
| gloss = re.sub(r'^\s+', '', gloss) | |
| gloss = re.sub(r'\s+$', '', gloss) |
signbank/signCounter.py
Outdated
| self.generate_result() | ||
| except KeyError as ke: | ||
| sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0])) | ||
| # except: | ||
| # sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1]))) |
There was a problem hiding this comment.
generate_result() is called inside the per-file loop, recomputing self.sign_counts after every processed file. This is unnecessary work if the final aggregated result is all that’s needed (and in frequency.py the code reads freqsPerPerson directly). Consider moving generate_result() out of the loop (run once after all files) or making it opt-in.
| self.generate_result() | |
| except KeyError as ke: | |
| sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0])) | |
| # except: | |
| # sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1]))) | |
| except KeyError as ke: | |
| sys.stderr.write("KeyError in file %s: '%s'\n" % (f, ke.args[0])) | |
| # except: | |
| # sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1]))) | |
| self.generate_result() |
| output_file] | ||
| print(" ".join(cmd)) | ||
| if not dry_run: | ||
| Popen(cmd) |
There was a problem hiding this comment.
The ffmpeg process is started with Popen(cmd) but never waited on. When resizing multiple files this can spawn many concurrent ffmpeg processes and return before output files are complete. Use p = Popen(cmd); p.wait() (or subprocess.run) and handle non-zero exit codes.
| Popen(cmd) | |
| process = Popen(cmd) | |
| return_code = process.wait() | |
| if return_code != 0: | |
| raise RuntimeError("ffmpeg command failed with exit code {}".format(return_code)) |
| self.delete_frames = delete_frames | ||
|
|
||
| if output_dir: | ||
| self.output_dir = output_dir.rstrip(os.sep) |
There was a problem hiding this comment.
self.output_dir is only set when output_dir is truthy. When output_dir is None or an empty string (e.g., running the script without -o), self.output_dir is undefined and this will raise AttributeError at the os.path.isdir(self.output_dir) check. Set a default (e.g., require -o and raise a clear error, or default to a temp directory/current directory) before using self.output_dir.
| self.output_dir = output_dir.rstrip(os.sep) | |
| self.output_dir = output_dir.rstrip(os.sep) | |
| else: | |
| # Default to the current working directory if no output_dir is provided | |
| self.output_dir = os.getcwd() |
|
|
||
| try: | ||
| for something in self.metadata[person].keys(): | ||
| if something != 'self.region_metadata_id': |
There was a problem hiding this comment.
In the metadata aggregation loop, if something != 'self.region_metadata_id': compares against the literal string 'self.region_metadata_id' rather than the instance field self.region_metadata_id. As written, it will never exclude the region key and will produce an extra frequencyPer... bucket for the region metadata. Compare against self.region_metadata_id instead.
| if something != 'self.region_metadata_id': | |
| if something != self.region_metadata_id: |
| eaf_path = get_path_of_eaf_file(dataset_eaf_folder, uploaded_paths, document_id) | ||
|
|
||
| document_creation_dates_of_eaf_files = [ get_creation_time(eaf_path) ] | ||
| document_creation_dates_of_eaf_files = [get_eaf_creation_time(eaf_path)] | ||
|
|
||
| dictionary_documentIds_to_documentObjs = dictionary_documentIdentifiers_to_documentObjects(corpus, | ||
| document_identifiers_of_eaf_files, |
There was a problem hiding this comment.
document_creation_dates_of_eaf_files is constructed as a list, but dictionary_documentIdentifiers_to_documentObjects expects a dict keyed by document identifier (it indexes by document_id). This will raise a TypeError if the document does not already exist. Build a {document_id: creation_time} dict here for consistency with other call sites.
| probe_cmd = self.ffmpeg_cmd[0:2] + "probe" | ||
| cmd = [probe_cmd, "-of", "json", "-show_streams", video_file] |
There was a problem hiding this comment.
probe_cmd = self.ffmpeg_cmd[0:2] + "probe" only works when ffmpeg_cmd is exactly ffmpeg/avconv. If settings ever provide an absolute path (e.g. /usr/local/bin/ffmpeg) or a different wrapper command, this produces an invalid probe command. Consider accepting an explicit ffprobe_cmd, or derive it robustly (e.g. via Path(ffmpeg_cmd).with_name('ffprobe') when a path is used, falling back to ffprobe).
| scale_formula = "scale=(trunc((iw/(ih/%f))/2+0.5))*2:%f" % (self.resize_scale, self.resize_scale) | ||
| path, ext = os.path.splitext(video_file) |
There was a problem hiding this comment.
scale_formula uses self.resize_scale directly; with the current default resize_scale=-1 this generates an invalid ffmpeg scale expression. It would be safer to validate resize_scale (must be a positive int) and raise a clear error early if it’s not set.
| def extract_time_slots(self, xml): | ||
| for time_slot in xml.findall("//TIME_SLOT"): | ||
| time_slot_id = time_slot.attrib['TIME_SLOT_ID'] | ||
| self.time_slots[time_slot_id] = time_slot.attrib['TIME_VALUE'] | ||
|
|
||
| def get_tier_id(self, tier): | ||
| return tier.attrib['TIER_ID'] | ||
|
|
||
| def get_participant(self, tier): | ||
| if 'PARTICIPANT' in tier.attrib: | ||
| return tier.attrib['PARTICIPANT'] | ||
| return "" | ||
|
|
||
| def get_linguistic_type(self, tier): | ||
| if 'LINGUISTIC_TYPE_REF' in tier.attrib: | ||
| return tier.attrib['LINGUISTIC_TYPE_REF'] | ||
| return "" | ||
| # End helper functions | ||
|
|
||
| def group_tiers_per_participant(self, xml): | ||
| grouped_tiers = defaultdict(list) | ||
| for tier in xml.findall("//TIER"): | ||
| if self.get_linguistic_type(tier).lower() == self.gloss_tier_type \ |
There was a problem hiding this comment.
Using xml.findall("//TIME_SLOT") / xml.findall("//TIER") is not valid ElementPath syntax for lxml.etree.ElementTree.findall (it expects relative paths like .//TIME_SLOT). This can result in zero matches and an empty time_slots/tier list, breaking counting. Prefer .findall('.//TIME_SLOT') and .findall('.//TIER') (or use xml.xpath('//TIME_SLOT')).
Woseseltops
left a comment
There was a problem hiding this comment.
Yes I think it's better to have all code together. I had a quick high level look at the code and commented on a few things I noticed.
Also, not sure if it 's important enough to change, but I believe python files typically have underscores instead of camel case, so extract_middle_frame.py.
| new_dir = self.output_dir + os.sep + \ | ||
| os.path.basename(video_file) + "-frames" + os.sep + dir_name |
There was a problem hiding this comment.
F string would work well here
signbank/video/extractMiddleFrame.py
Outdated
| frames = sorted(os.listdir(frames_dir)) | ||
| if len(frames) != 0: | ||
| middle_frame_index = int(floor(len(frames)/2)) | ||
| middle_frame = frames[middle_frame_index] | ||
| video_base_name = os.path.basename(video_file) | ||
| video_name = os.path.splitext(video_base_name)[0] | ||
| video_still = dirs[1] + os.sep + video_name + '.png' | ||
| if not dry_run: | ||
| copyfile(frames_dir + os.sep + middle_frame, video_still) | ||
|
|
||
| # Create the 320x180 version | ||
| cmd = [ | ||
| "convert", | ||
| video_still, | ||
| "-resize", "x180", | ||
| dirs[1] + os.sep + video_name + '_320x180.png' | ||
| ] | ||
| print(" ".join(cmd), file=sys.stderr) | ||
| if not dry_run: | ||
| p = Popen(cmd) | ||
| p.wait() |
There was a problem hiding this comment.
Perhaps a guard clause here, so
if len(frames) == 0:
return
signbank/video/resizeVideos.py
Outdated
| __author__ = "Micha Hulsbosch" | ||
| __date__ = "August 2016" |
There was a problem hiding this comment.
Only now realize I'm reviewing code from almost a decade ago :)
signbank/signCounter.py
Outdated
| if len(self.all_files) > 0: | ||
| for f in self.all_files: | ||
| try: |
There was a problem hiding this comment.
if len(self.all_files) == 0:
return
signbank/signCounter.py
Outdated
| # except: | ||
| # sys.stderr.write("Unexpected error: %s %s\n" % (str(sys.exc_info()[0]), str(sys.exc_info()[1]))) |
There was a problem hiding this comment.
I think this can be removed :)
signbank/signCounter.py
Outdated
| if csv_file: | ||
| # Flatten result dict |
There was a problem hiding this comment.
if not csv_file:
print(json.dumps(result, sort_keys=True, indent=4))|
@Woseseltops applied you suggestion and more. How do you like it now? |
Relevant CNGT_scripts code is moved to this repo.