C# Text Encoding and Transcoding in few steps

While developing on a tool that to allow user to enter a message in whatever language and to print it I discovered that I didn’t know enough about character encoding.

Character encoding is is a set of unique rapresentations called character: they can be the 26 letters of the English alphabet or even the set of signals in the Morse code. As a byte can store up to 256 characters, originally, computers were using ASCII to map the first 128 elements. After a while the limit of 128 became to be too restrictive, computers were spread all over the world and the need of supporting different languages implied the mapping of the remaining free space (128 to 255) with the other languages. Loads of encoding schemes(named also character maps or code pages) have been released over the years. By the way, one byte is not sufficient to include all the characters (Chinese, Russian and so on..) so the Unicode encoding model have been created. Unicode uses 2 bytes and the 65536 combinations it allows covers all the characters actually possible. The .Net framework uses Unicode for string encoding. Here’s an encoding/decoding example using cyrillic text.

[TestMethod]
public void Encoding_Test()
{
   string cyrillicText = "Мне очень понравилась ваша фотография и письмо";

   System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
   System.Text.UTF8Encoding encodingUTF8 = new System.Text.UTF8Encoding();
   System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();

   byte[] textBytesASCII = encodingASCII.GetBytes(cyrillicText);
   byte[] textBytesUTF8 = encodingUTF8.GetBytes(cyrillicText);
   byte[] textBytesCyrillic = encodingUNICODE.GetBytes(cyrillicText);

   Console.WriteLine("{0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(textBytesASCII));
   Console.WriteLine("{0}: {1}", encodingUTF8.ToString(), encodingUTF8.GetString(textBytesUTF8));
   Console.WriteLine("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(textBytesCyrillic));
}

image

The framework also expose a Convert method to switch from an encoding to another one, this operation is usally callled Transcoding:

[TestMethod]
public void Transcoding_Test()
{
   string sampleText = "Unicode character u0066";
   System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
   System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();
   byte[] sampleTextEncoded = encodingUNICODE.GetBytes(sampleText);
   //print out the string with UNICODE encoding
   Console.WriteLine("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(sampleTextEncoded));
   //this is the output we get if we try to decode with ASCII without converting
   Console.WriteLine("Not converted - {0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(sampleTextEncoded));
   //convert the text with Unicode encoding
   sampleTextEncoded = Encoding.Convert(encodingUNICODE, encodingASCII, sampleTextEncoded);
   Console.WriteLine("Converted - {0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(sampleTextEncoded));
}

image

For more info : http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx




Leave a Reply

Your email address will not be published. Required fields are marked *