[FIX] Issue#1665 Enhanced Matroska Language Tag Handling #1671

RemZapCypher · 2025-03-01T15:06:40Z

In raising this pull request, I confirm the following:

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows:

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Description

Introduced improved handling of language tags in the Matroska parser. It addresses an issue where IETF BCP47 language tags (e.g., "en-US") were not being correctly processed, leading to potential segmentation faults and inaccurate subtitle extraction. Like in issue #1665

The Initial Problem: Modern MKV Files and IETF Language Tags

Modern Matroska (MKV) files are increasingly using IETF BCP47 language tags to identify subtitle tracks. These tags offer more precision than the traditional 3-letter ISO 639-2 codes, allowing for specification of regional variations, scripts, and other linguistic details (e.g., en-GB for British English, es-MX for Mexican Spanish).

The existing parser was primarily designed for the older 3-letter codes and did not fully account for the presence and proper handling of these IETF tags. This resulted in the parser failing to correctly identify and utilize the IETF language tags, leading to issues such as:

Incorrect Language Identification: Subtitle tracks with IETF tags might not be recognized or might be misidentified.
Filename Generation Errors: Output filenames might not accurately reflect the language of the subtitle track.
Matching Failures: Users might not be able to select specific language tracks using command-line options if those tracks were identified using IETF tags.
Segmentation Faults: In certain scenarios, the lack of proper handling could lead to segmentation faults due to accessing uninitialized memory.

Summary of Changes

Corrected IETF Language Tag Storage: Added sub_track->lang_ietf = lang_ietf; during subtitle track creation to ensure IETF language tags are properly stored in the matroska_sub_track structure.
Intelligent Filename Generation: Modified generate_filename_from_track() to prioritize IETF language tags when available, creating more descriptive and accurate filenames.
Improved Language Matching: Enhanced matroska_save_all() to first attempt matching against IETF language tags before falling back to 3-letter ISO 639 codes, improving language selection accuracy.
Robust Memory Management: Ensured proper allocation, assignment, and freeing of the lang_ietf field to prevent memory leaks and segmentation faults.

This enhancement is crucial for:

Modern Standards Compliance: Supporting IETF BCP47 language tags, the modern standard for language identification.
Improved Accuracy: Enabling more precise language identification, including regional variants and dialects.
Increased Compatibility: Ensuring correct processing of Matroska files that utilize extended language tags.

How Has This Been Tested?

Tested with various Matroska files containing both 3-letter language codes and IETF language tags.
Verified correct subtitle extraction and filename generation for different language variants.
Confirmed no memory leaks or segmentation faults occur during parsing.

Thank you,
Tank0nf.

…#1671) * fix unknown element for IETF tag * added documentation changes * added formatting for clang-format

RemZapCypher added 2 commits March 1, 2025 20:11

fix unknown element for IETF tag

dbc6a6d

added documentation changes

e74e00a

RemZapCypher changed the title ~~[FIX] Enhanced Matroska Language Tag Handling #1665~~ [FIX] #1665 Enhanced Matroska Language Tag Handling Mar 1, 2025

RemZapCypher changed the title ~~[FIX] #1665 Enhanced Matroska Language Tag Handling~~ [FIX] Issue#1665 Enhanced Matroska Language Tag Handling Mar 1, 2025

added formatting for clang-format

54e07c4

This comment was marked as outdated.

Sign in to view

cfsmp3 merged commit b62027a into CCExtractor:master Mar 23, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] Issue#1665 Enhanced Matroska Language Tag Handling #1671

[FIX] Issue#1665 Enhanced Matroska Language Tag Handling #1671

Uh oh!

RemZapCypher commented Mar 1, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FIX] Issue#1665 Enhanced Matroska Language Tag Handling #1671

[FIX] Issue#1665 Enhanced Matroska Language Tag Handling #1671

Uh oh!

Conversation

RemZapCypher commented Mar 1, 2025

Description

The Initial Problem: Modern MKV Files and IETF Language Tags

Summary of Changes

This enhancement is crucial for:

How Has This Been Tested?

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants