Archive for the ‘Data formats’ Category

Character encodings by example

Tuesday, March 20th, 2007

In this text I asume you have a basic understanding of character sets. Take a look at the reference section if you need to take a look into that area. You can use the code charts at http://www.unicode.org/charts/ to see the code points of the characters we will use. I use Microsoft Calculator to convert between hex, decimal and binary numbers.

In this example we will write a Java program that writes bytes to text files. The files will consist of characters encoded in ISO–8859–1 (Latin–1) and the UTF–8 encoding of the Unicode character set. We will use Microsoft Notepad, to view these files.

The Java program

The program below takes a file name and binary strings as arguments. Each binary string represents one byte in the text file, so make sure not to exceed a string length of 8 to get expected results. If you want to you can modify the main method to use decimal numbers instead.

import java.io.*;
 
public class Bits {
 
    public static void main(String[] args) throws Exception {	
	FileOutputStream out = new FileOutputStream(args[0]);
	for (int i = 1; i < args.length; i++) {
	    int bits = Integer.parseInt(args[i],2);
	    System.out.print("Byte " + i + ": " + args[i] + " ");
	    System.out.print(bits + " ");
	    System.out.println(Integer.toHexString(bits));
	    out.write(bits);
	}
	out.close();
    }
}

Example 1

Find the character A in the Basic Latin code chart. Notice it has the hexadecimal code point 0041 or if converted the binary code point 1000001. Lets write a byte with this value to the text file example1.txt:

java Bits example1.txt 1000001

If you open the file with Microsoft Notepad you will see that it contains the letter A as expected. If you pick save as from the menu you can see that notepad suggests the ANSI encoding. ANSI is an extended ASCII encoding, as is Latin–1. ANSI and Latin–1 has differences, but we will use characters that are encoded the same way for both encodings. So just think of ANSI as Latin-1 throughout the text.

Example 2

Click the save as option in the notepad menu. Save example1.txt as example2.txt with the UTF–8 encoding. Take a look at the file properties of the newly created file. Notice that it has a size of 4 bytes, but if you open it again it still contains the lonely letter A. The 3 new bytes are located at the beginning of the file, and is the Byte Order Mark (BOM) for the UTF–8 encoding. The BOM is the only way for notepad to know that this is a text document encoded in UTF–8, and not Latin–1. This is because the character A is encoded the same way for both encodings. Let’s create example2.txt manually:

java Bits example2.txt 11101111 10111011 10111111 1000001

Example 3

Let’s try to encode a character that is encoded differently in the two encodings. Below we make three text files with the Norwegian letter Å. One encoded in Latin–1, the second in UTF–8 and the last one in UTF–8 too, but without the BOM. The Latin-1 chart tells us that Å has the hexadecimal code point 00C5. If we convert it using Microsoft Calculator we get the binary string 11000101 or the decimal number 197.

java Bits example3-1.txt 11000101
java Bits example3-2.txt 11101111 10111011 10111111 11000011 10000101
java Bits example3-3.txt 11000011 10000101

Notice that notepad recognizes the UTF–8 encoded Å in example3-3.txt even though we left out the BOM. Also notice that we need two bytes to encode Å in UTF–8. Some characters in UTF–8 is even encoded in four bytes. The combination of the two bytes 11000011 and 10000101 is decoded into the code point 197 when read. 197 references the letter Å for both the Unicode and the Latin–1 character set.

Now I hope you have a better understanding of how characters are encoded. If you want to know more you could take a look at the suggested readings. I personally recommend the XML in a Nutshell book. It’s almost everything you need.

Suggested readings

The UTF–8 character encoding, XML and PHP5

Tuesday, July 11th, 2006

In this article you will get to know a litte about how character sets, XML and PHP5 work together. We will also look a bit deeper into the most common character sets and encodings.

UTF-8

First we need to talk a little so that you will be able to grasp the two terms:

  • character set
  • character encoding

Data is stored on disk as bits. To get a character out of a bit sequence the computer:

  1. first reads a number of bits,
  2. decodes the bits to get a code point and
  3. maps the code point into a character

For instance:

  1. the fetched bit sequence 1100001
  2. could be decoded into code point 97
  3. which maps to the character a using the Unicode character set.

It’s also common to express code points in hex so for instance the arabic character ف has a 641 hex and a 1601 decimal code point.

Now it should be obvious that character encodings maps bits to code points while character sets maps code points to characters. Unicode has a lot of different encodings, but ISO–8859–1 (Latin–1) has only one. UTF–8 is one of the Unicode encodings. While Unicode and Latin–1 have an equal character map below code point 255 the UTF–8 encoding is different from Latin–1 from code point 128 and up.

XML and PHP5

Use the following XML declaration:

<?xml version="1.0" encoding="ISO-8859-1"?>

to tell your software that your XML file has a Latin–1 encoding. The default is UTF–8 so if you omit the encoding part, you are really saying that your file contains UTF–8 characters.

Opera, Firefox and Internet Explorer knows how to display Unicode characters represented in the UTF-8 encoding. The problem is that web servers like Apache has a default Latin–1 content header. Browsers then think they are reading Latin–1 and displays characters, which have different encodings like å, wrong. The following PHP code changes the header so that browsers know that it’s UTF–8 they are receiving:

header('Content-Type: text/html; charset=utf-8');

Suggested readings:

The power of archetype principles in familiar context

Friday, November 12th, 2004

In this paper we will build a simple but powerful system containing some of the principles of archetypes. We will first look at how we can specify archetypes. Then we will make some instances and build the part of the system that displays them properly. Later we will implement a user interface where we can add, alter and remove instances. At the end we will try to use this system in both a useful and familiar context. The full article can be downloaded below: