Skip to content

Lots of transcription fixes#2

Draft
Akshat752 wants to merge 14 commits intojvarn:mainfrom
Akshat752:main
Draft

Lots of transcription fixes#2
Akshat752 wants to merge 14 commits intojvarn:mainfrom
Akshat752:main

Conversation

@Akshat752
Copy link

Fixed various transcription issues, including but not limited to:

  • Emanata lines read as characters.
  • ™ symbol transcribed as asterisks.
  • Misplaced words.

YOURE WHILE I GIVE -> YOU'RE GOING TO DO EMAIL WHILE I GIVE
Changed ** to trademark unicode characters and fixed closing quotations not being transcribed properly
Removes extra characters such as "\ I/", "\0/", "\1/", "\11", "\ 1/", "\|/", "\N1/", "NI/", "|I/", "N/", "1l/", "1//", "I/", which are caused by the 3 vertical lines which indicate a character is yelling.
Fixed more errors in transcription and removed tildes which were added in between lines during ocr.
@Akshat752
Copy link
Author

One interesting thing I noticed during bf7e29c was how the extra symbols accurately told you the emotions of characters in the strip. For example even if a sentence just ended with a question mark and not an exclamation mark, if I saw something like \11 I could always tell the character would be angry. I'm not sure if this would be relevant to your semantic analysis but I thought it was worth mentioning.

Additionally, I am currently in the process of fixing issues where O's have been transcribed as 0's. However this would cause unique_words_raw.txt to be incorrect. For example it current includes both no-o-o and no-0-0. I assume this won't matter since you have a python script to create unique_words_raw.txt but I want to make sure it's fine if I fix this issue.

Mostly resulting from 0's turned into O's
@Akshat752 Akshat752 marked this pull request as draft March 13, 2026 22:31
@Akshat752
Copy link
Author

I am currently going through all of the entries in unique_words_raw.txt and manually checking the comic strip when I find entries that look like misspellings. This will take some time. When I think I have fixed everything I will rerun the python script to generate unique_words_raw.txt and push changes.

cat's
cat-5
cat5
cat6
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat5

cat
cat-5
cat5
cat6
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat5

@jvarn jvarn self-assigned this Mar 16, 2026
@Akshat752
Copy link
Author

Akshat752 commented Mar 16, 2026

@jvarn DILBERT SCOTT ADAMS dilbert.com shows up in a lot of the 9 panel strips. I was thinking about just removing them but realized you can use it as a search query for these comics. Should I just delete it or move it to the top/bottom of the transcript? Or alternatively add something like "(9 panel)" (or maybe that should be implemented as a tag) in the transcript so people can search for them?
Screenshot 2026-03-16 at 10 31 52 AM

@Akshat752
Copy link
Author

In cases like these should "Dilbert Greets His Blind Date" be listed as the title of the comic, if it's untitled? Or should it stay in the transcript?
Screenshot 2026-03-17 at 3 52 58 PM

@jvarn
Copy link
Owner

jvarn commented Mar 18, 2026

In cases like these should "Dilbert Greets His Blind Date" be listed as the title of the comic, if it's untitled? Or should it stay in the transcript? Screenshot 2026-03-17 at 3 52 58 PM

That's a good idea

@jvarn
Copy link
Owner

jvarn commented Mar 18, 2026

@jvarn DILBERT SCOTT ADAMS dilbert.com shows up in a lot of the 9 panel strips. I was thinking about just removing them but realized you can use it as a search query for these comics. Should I just delete it or move it to the top/bottom of the transcript? Or alternatively add something like "(9 panel)" (or maybe that should be implemented as a tag) in the transcript so people can search for them? Screenshot 2026-03-16 at 10 31 52 AM

@Akshat752 I'd just remove them

Issues with missing/extra line breaks and dialog out of order. Also fixed a bunch of spelling mistakes.
Remove "DILBERT SCOTT ADAMS dilbert.com" and other strip credits from the transcripts.
@Akshat752
Copy link
Author

Now that the strips cannot be viewed it is difficult to find the answers for the mystery artist series. Can I include the answers at the bottom of the transcript?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants