Chartect.IO can recognize the following charsets:
- UTF-8
- UTF-16 (BE and LE)
- UTF-32 (BE and LE)
- windows-1252 (mostly equivalent to iso8859-1)
- windows-1251 and ISO-8859-5 (cyrillic)
- windows-1253 and ISO-8859-7 (greek)
- windows-1255 (logical hebrew. Includes ISO-8859-8-I and most of x-mac-hebrew)
- ISO-8859-8 (visual hebrew)
- Big-5
- gb18030 (superset of gb2312)
- HZ-GB-2312
- Shift-JIS
- EUC-KR, EUC-JP, EUC-TW
- ISO-2022-JP, ISO-2022-KR, ISO-2022-CN
- KOI8-R
- x-mac-cyrillic
- IBM855 and IBM866
- X-ISO-10646-UCS-4-3412 and X-ISO-10646-UCS-4-2413 (unusual BOM)
- ASCII
Portable .Net Framework 4.0+, Windows Phone 8+, Windows 8+, Universal Windows Apps, .Net Core (dotnet, dnx)
Import the library:
using Chartect.IO;
you can feed a StreamDetector to the detector:
using System.IO;
using Chartect.IO;
public class program
{
public static void Main(String[] args)
{
var filename = args[0];
var detector = new StreamDetector();
using (FileStream stream = File.OpenRead(filename))
{
detector.Read(stream);
detector.DataEnd();
if (detector.Charset != null)
{
Console.WriteLine("Charset: {0}, confidence: {1}", detector.Charset, detector.Confidence);
}
else
{
Console.WriteLine("Detection failed.");
}
}
}
}
or use StringDetector. StringDetector assumes that there is only one string (so you don't have to call DataEnd):
var detector = new StringDetector();
var input = "ðÏÓÌÅ ÏËÏÎÞÁÔÅÌØÎÏÇÏ ÒÁÚÏÒÅÎÉÑ ÏÔÃÁ ÓÅÍÅÊÓÔ×Á";
detector.Read(input);
if (detector.Charset != null)
{
Console.WriteLine("Charset: {0}, confidence: {1}", detector.Charset, detector.Confidence);
}
else
{
Console.WriteLine("Detection failed.");
}
You can also use ArrayDetector to take in an array of bytes.
Chartect.IO is a fork of the UDE C# port of Mozilla Universal Charset Detector by Rudi Pettazzi from https://code.google.com/p/ude/.
This work was based on the original source code from Mozilla available at:
http://lxr.mozilla.org/mozilla/source/intl/chardet/src/
The article "A composite approach to language/encoding detection" describes the algorithms of Universal Charset Detector and is available at:
http://www-archive.mozilla.org/projects/intl/chardet.html
Some data-structures used into this port have been adapted from the Java port "juniversalchardet", available at:
http://code.google.com/p/juniversalchardet/
Also there is "chardet" (in Python) available at:
http://chardet.feedparser.org/
This library is subject to the Mozilla Public License Version 1.1 (the "License"). An initial check of this work is available under the LGPL but subsequent versions use MPL as a sole alternative as allowed under the original terms.