Skip to content

Conversation

@hrideshmg
Copy link
Contributor

@hrideshmg hrideshmg commented Mar 15, 2025

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Closes #985

DVB subtitle extraction is currently broken on the latest master build. I've verified this by testing it on the following few files:

  • 09-ITV_Red_Heat.ts
  • 2016-12-15-BBC4.ts
  • CHANNEL_4_2016-06-21.ts
  • chan7_BBC NEWS.ts

I've found two root issues on why this is the case:

  1. The first issue is that the ocr'd text in ocr_bitmap() is freed before being returned. Removing this free causes memory leaks (as pointed out by Memory leak on OCR code #1511).
  2. The second issue lies within the quantize_map() function
    • Passing --quant 0 (or 2) with the first fix enables proper extraction of DVB subtitles.

I've spent the past two days trying to understand this function and have narrowed it down to the erode() function introduced in PR 1510. I believe this is better explained visually, so here are the subtitle bitmaps before and after the erode() call for two different video files:

Before

before_erosion
before_erosion_CHANNEL_4_2016-06-21

After

after_erosion
after_erosion_CHANNEL_4_2016-06-21

Fixes

  1. The memory leaks are caused due to empty strings that were not being freed due to an if condition that was prematurely returning. I've handled this case and tested it on the files mentioned in Memory leak on OCR code #1511.
  2. After analyzing the erode() function, I noticed that the text was being eroded based on transparency rather than the text background. This method will only work for bitmaps which have their quantized text color be transparent.
if (alpha[bitmap[row * w + col]] || alpha[bitmap[(row + 1) * w + col]] ||
    alpha[bitmap[row * w + (col + 1)]] || alpha[bitmap[(row + 1) * w + (col + 1)]])

I've modified erode and dilate so that they now use the text and text background color rather than the alpha.
I'm getting these colors from the loop which populates the mcit variable. This approach has been pretty successful in my limited amount of testing, however it relies on the assumption that the background and text color will always be the second and third most frequently occurring colors respectively.

channel5-2018-02-12.ts is one exception though, in it the text color is the fourth most frequently occurring color (black, the bg color is repeated twice for some reason). So erosion succeeds but dilation fails, the result is still better than the raw quantized results but it might be worthwhile to disable quantization by default.

@hrideshmg hrideshmg changed the title fix DVB OCR [FIX] DVB OCR: Memory Leak & Quantization Issues Mar 15, 2025
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 9e2a594...:

Report Name Tests Passed
Broken 0/13
CEA-708 0/14
DVB 0/7
DVD 0/3
DVR-MS 0/2
General 0/27
Hauppage 0/3
MP4 0/3
NoCC 0/10
Options 0/86
Teletext 0/21
WTV 0/13
XDS 0/34

All tests passing on the master branch were passed completely.

NOTE: The following tests have been failing on the master branch as well as the PR:


Check the result page for more info.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Mar 16, 2025

This seems reasonable.

Hopefully we'll have a working test platform soon to verify :-( @canihavesomecoffee

@hrideshmg
Copy link
Contributor Author

Hopefully we'll have a working test platform soon to verify :-( @canihavesomecoffee

That is what I'm currently working on :)

https://ccextractor.zulipchat.com/#narrow/stream/478694-general/topic/Are.20the.20sample.20platform.20tests.20broken.3F

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 9e2a594...:

Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 80/86
Teletext 21/21
WTV 13/13
XDS 34/34

All tests passing on the master branch were passed completely.

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:


Check the result page for more info.

@cfsmp3 cfsmp3 merged commit 9685ad6 into CCExtractor:master Mar 22, 2025
27 of 31 checks passed
vatsalkeshav pushed a commit to vatsalkeshav/ccextractor-z that referenced this pull request Mar 29, 2025
* fix: do not free ocr text before return

* fix(OCR): erode and dilate function
vatsalkeshav pushed a commit to vatsalkeshav/ccextractor-z that referenced this pull request Apr 12, 2025
* fix: do not free ocr text before return

* fix(OCR): erode and dilate function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PROPOSAL] Unable to extract subtitles from DVB video samples when build using cmake by passing -DWITH_FFMPEG=ON flag

3 participants