Mojibake Brute-force recovery tool written in Python 3

A quick and dirty brute-force tool for recovering Mojibake

What is Mojibake?

A mojibake is a text encoded in a encoding scheme, but decoded and displayed to user with a different encoding scheme than the original. It creates a garbled text that cannot be read by the user.

For example: 文字化け (Translation: mojibake) encoded in EUC_JP EUC_JP read as ISO-8859-1: Ê¸»ú²½¤±

For more information, please visit this Wikipedia article

How does it work?

To untangle a mojibake, first we need to know 2 things:

Mojibake's encoding
Original text's encoding

After we know those 2 things, we can start untangling them. First, we must get the original hex of the Mojibake, this is achieved by encoding the Mojibake again with the Mojibake's encoding scheme. Python conveniently has an encode function build into every string instance (called string_instance.encode(encoding), or bytes(str, encoding), which returns a bytes object)

For example: we know that this garbled text Ê¸»ú²½¤± is ISO-8859-1, we can obtain the original hex data by encoding the garbled text with ISO-8859-1, and it gives us this:

>>> "Ê¸»ú²½¤±".encode("ISO-8859-1")
b'\xca\xb8\xbb\xfa\xb2\xbd\xa4\xb1'

Translated to human "readable" hex is: 0xCA 0xB8 0xBB 0xFA 0xB2 0xBD 0xA4 0xB1

Now we need to know the original text's encoding, if we know this, we can just decode the hex with the original text's encoding scheme. Again, python also conveniently provides a decode function to every instance of bytes (don't tell me you don't know hex in a bytes object in python, anyway, use bytes_instance.decode(encoding) or str(bytes, encoding))

For example, in the previous example, we know it's EUC_JP, so we can just decode the hex with EUC_JP and call it a day.

>>> b'\xca\xb8\xbb\xfa\xb2\xbd\xa4\xb1'.decode("euc_jp")
文字化け

Okay? Then how can you know the text's encoding in real life scenarios?

Well... Chances are... you don't, that's why this tool is developed, this tools tries every encoding scheme possible (provided by the Standard Codecs table by Python 3.9) and decode it with the encoding scheme that you provided. You must guess the original text's encoding scheme. You also have to know what the text's language is, if not, you won't be able to tell the difference between mojibake and the actual text, just like Korean encoded in SHIFT_JIS (?) and you don't know Korean and Japanese, at least you have to know how to tell the difference.

This program's job is to brute-force the Mojibake's encoding. Although the opposite can be done too, you know the mojibake's encoding scheme and this program brute-forces the original text's encoding, but I'm too lazy to implement that, feel free to create a pull request.

Why don't you brute-force both the original and mojibake's encoding scheme?

Python supports 97 encoding schemes, which brings us to an interesting question, recovering a Mojibake requires the user to check if the produced results makes sense or not. And sifting through 97*97 results is just... Unbelievable. Sure, Python will throw exceptions on 70% of the recovery tries, but it's still a gigantic number of results.

How do I use it?

py bruteforce_encoding.py [encoding] [message]

Parameters:

encoding: target encoding scheme, in other words: the message's original encoding scheme (you must guess it)

message: message to bruteforce (Tip: use quotation marks, some mojibakes have a space character in them, and causes the program to not recognize the parameters correctly)

The arrow in the output (-->) means 'reencoded as'

Commandline is $h!t and I don't support it

Neither do I. If you hate it, feel free to learn Python and TKinter and create a pull request. ~~Or rewrite the whole damn thing~~

FAQ Section

Why?

Because I play kapanese games and my system language is set to Chinese, creating Mojibake very often while playing ~~japanese visual novels~~ japanese games.

Your code looks awful

Thanks, please send your critiques to thebuster000 at gmail dot com, or create issues. I'm still learning Python and C#, recommendations and critiques would be awesome.

You don't even know what you are talking about! You don't know how to explain!

I only know how to use those functions, not how they work under the hood, explaining everything would be a big hassle, the above explanation is more or less an oversimplified version of the Wikipedia article, and limited by my English level, I apologize for my poor English.

Miscellaneous

Visual novel? Karenai Sekai to Owaru Hana

Anime? Non Non Biyori

Hotel? Trivago

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
bruteforce_encoding.py		bruteforce_encoding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mojibake Brute-force recovery tool written in Python 3

What is Mojibake?

How does it work?

Okay? Then how can you know the text's encoding in real life scenarios?

Why don't you brute-force both the original and mojibake's encoding scheme?

How do I use it?

Commandline is $h!t and I don't support it

FAQ Section

Why?

Your code looks awful

You don't even know what you are talking about! You don't know how to explain!

Miscellaneous

About

Uh oh!

Releases

Packages

Languages

License

qtnull/PyMojibakeBruteforce

Folders and files

Latest commit

History

Repository files navigation

Mojibake Brute-force recovery tool written in Python 3

What is Mojibake?

How does it work?

Okay? Then how can you know the text's encoding in real life scenarios?

Why don't you brute-force both the original and mojibake's encoding scheme?

How do I use it?

Commandline is $h!t and I don't support it

FAQ Section

Why?

Your code looks awful

You don't even know what you are talking about! You don't know how to explain!

Miscellaneous

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages