Skip to content

Conversation

@RemZapCypher
Copy link
Contributor

In raising this pull request, I confirm the following:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows:

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Description

Introduced improved handling of language tags in the Matroska parser. It addresses an issue where IETF BCP47 language tags (e.g., "en-US") were not being correctly processed, leading to potential segmentation faults and inaccurate subtitle extraction. Like in issue #1665

The Initial Problem: Modern MKV Files and IETF Language Tags

Modern Matroska (MKV) files are increasingly using IETF BCP47 language tags to identify subtitle tracks. These tags offer more precision than the traditional 3-letter ISO 639-2 codes, allowing for specification of regional variations, scripts, and other linguistic details (e.g., en-GB for British English, es-MX for Mexican Spanish).

The existing parser was primarily designed for the older 3-letter codes and did not fully account for the presence and proper handling of these IETF tags. This resulted in the parser failing to correctly identify and utilize the IETF language tags, leading to issues such as:

  • Incorrect Language Identification: Subtitle tracks with IETF tags might not be recognized or might be misidentified.
  • Filename Generation Errors: Output filenames might not accurately reflect the language of the subtitle track.
  • Matching Failures: Users might not be able to select specific language tracks using command-line options if those tracks were identified using IETF tags.
  • Segmentation Faults: In certain scenarios, the lack of proper handling could lead to segmentation faults due to accessing uninitialized memory.

Summary of Changes

  • Corrected IETF Language Tag Storage: Added sub_track->lang_ietf = lang_ietf; during subtitle track creation to ensure IETF language tags are properly stored in the matroska_sub_track structure.
  • Intelligent Filename Generation: Modified generate_filename_from_track() to prioritize IETF language tags when available, creating more descriptive and accurate filenames.
  • Improved Language Matching: Enhanced matroska_save_all() to first attempt matching against IETF language tags before falling back to 3-letter ISO 639 codes, improving language selection accuracy.
  • Robust Memory Management: Ensured proper allocation, assignment, and freeing of the lang_ietf field to prevent memory leaks and segmentation faults.

This enhancement is crucial for:

  • Modern Standards Compliance: Supporting IETF BCP47 language tags, the modern standard for language identification.
  • Improved Accuracy: Enabling more precise language identification, including regional variants and dialects.
  • Increased Compatibility: Ensuring correct processing of Matroska files that utilize extended language tags.

How Has This Been Tested?

  • Tested with various Matroska files containing both 3-letter language codes and IETF language tags.
  • Verified correct subtitle extraction and filename generation for different language variants.
  • Confirmed no memory leaks or segmentation faults occur during parsing.

Thank you,
Tank0nf.

@RemZapCypher RemZapCypher changed the title [FIX] Enhanced Matroska Language Tag Handling #1665 [FIX] #1665 Enhanced Matroska Language Tag Handling Mar 1, 2025
@RemZapCypher RemZapCypher changed the title [FIX] #1665 Enhanced Matroska Language Tag Handling [FIX] Issue#1665 Enhanced Matroska Language Tag Handling Mar 1, 2025
@ccextractor-bot

This comment was marked as outdated.

@ccextractor-bot

This comment was marked as outdated.

@cfsmp3 cfsmp3 merged commit b62027a into CCExtractor:master Mar 23, 2025
17 checks passed
vatsalkeshav pushed a commit to vatsalkeshav/ccextractor-z that referenced this pull request Mar 29, 2025
…#1671)

* fix unknown element for IETF tag

* added documentation changes

* added formatting for clang-format
vatsalkeshav pushed a commit to vatsalkeshav/ccextractor-z that referenced this pull request Apr 12, 2025
…#1671)

* fix unknown element for IETF tag

* added documentation changes

* added formatting for clang-format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants