Posts Tagged ‘Java’

Character encodings by example

Tuesday, March 20th, 2007

In this text I asume you have a basic understanding of character sets. Take a look at the reference section if you need to take a look into that area. You can use the code charts at http://www.unicode.org/charts/ to see the code points of the characters we will use. I use Microsoft Calculator to convert between hex, decimal and binary numbers.

In this example we will write a Java program that writes bytes to text files. The files will consist of characters encoded in ISO–8859–1 (Latin–1) and the UTF–8 encoding of the Unicode character set. We will use Microsoft Notepad, to view these files.

The Java program

The program below takes a file name and binary strings as arguments. Each binary string represents one byte in the text file, so make sure not to exceed a string length of 8 to get expected results. If you want to you can modify the main method to use decimal numbers instead.

import java.io.*;
 
public class Bits {
 
    public static void main(String[] args) throws Exception {	
	FileOutputStream out = new FileOutputStream(args[0]);
	for (int i = 1; i < args.length; i++) {
	    int bits = Integer.parseInt(args[i],2);
	    System.out.print("Byte " + i + ": " + args[i] + " ");
	    System.out.print(bits + " ");
	    System.out.println(Integer.toHexString(bits));
	    out.write(bits);
	}
	out.close();
    }
}

Example 1

Find the character A in the Basic Latin code chart. Notice it has the hexadecimal code point 0041 or if converted the binary code point 1000001. Lets write a byte with this value to the text file example1.txt:

java Bits example1.txt 1000001

If you open the file with Microsoft Notepad you will see that it contains the letter A as expected. If you pick save as from the menu you can see that notepad suggests the ANSI encoding. ANSI is an extended ASCII encoding, as is Latin–1. ANSI and Latin–1 has differences, but we will use characters that are encoded the same way for both encodings. So just think of ANSI as Latin-1 throughout the text.

Example 2

Click the save as option in the notepad menu. Save example1.txt as example2.txt with the UTF–8 encoding. Take a look at the file properties of the newly created file. Notice that it has a size of 4 bytes, but if you open it again it still contains the lonely letter A. The 3 new bytes are located at the beginning of the file, and is the Byte Order Mark (BOM) for the UTF–8 encoding. The BOM is the only way for notepad to know that this is a text document encoded in UTF–8, and not Latin–1. This is because the character A is encoded the same way for both encodings. Let’s create example2.txt manually:

java Bits example2.txt 11101111 10111011 10111111 1000001

Example 3

Let’s try to encode a character that is encoded differently in the two encodings. Below we make three text files with the Norwegian letter Å. One encoded in Latin–1, the second in UTF–8 and the last one in UTF–8 too, but without the BOM. The Latin-1 chart tells us that Å has the hexadecimal code point 00C5. If we convert it using Microsoft Calculator we get the binary string 11000101 or the decimal number 197.

java Bits example3-1.txt 11000101
java Bits example3-2.txt 11101111 10111011 10111111 11000011 10000101
java Bits example3-3.txt 11000011 10000101

Notice that notepad recognizes the UTF–8 encoded Å in example3-3.txt even though we left out the BOM. Also notice that we need two bytes to encode Å in UTF–8. Some characters in UTF–8 is even encoded in four bytes. The combination of the two bytes 11000011 and 10000101 is decoded into the code point 197 when read. 197 references the letter Å for both the Unicode and the Latin–1 character set.

Now I hope you have a better understanding of how characters are encoded. If you want to know more you could take a look at the suggested readings. I personally recommend the XML in a Nutshell book. It’s almost everything you need.

Suggested readings

SMS–Gateway

Sunday, January 14th, 2007

Studiet datateknikk, som jeg fullførte ved Høgskolen i Tromsø i 2003, avsluttes blant annet med et hovedprosjekt. Prosjektet foregikk over de to siste semestrene, der vi planla før jul og implementerte på våren og sommeren. Planleggingsfasen utgjorde 1 vekttall mens vi fikk 5 vekttall for arbeidet etter jul.

Hovedprosjektene utføres hovedsaklig som gruppearbeid. Jeg og tre andre studenter gikk sammen. Vi hadde som målsetning å finne oss et lærerikt og interessant prosjekt. Dette fant vi i en oppgave fra Telenor Forskning og Utvikling. Prosjektet hadde tittelen SMS–Gateway. Gruppa vår fikk navnet JavaSMS.

Dette er den opprinnelige oppgaven som Telenor Forskning og Utvikling presenterte for Høgskolen:

Telenor FOU vil gjerne ha utført følgende oppgave:

SMS–Gateway. Det er ønskelig med en pakke i Java for sending og mottak av SMS meldinger. Vi ønsker et sett med interface og klasser i Java for maskinell behandling av SMS meldinger.

Studentene må:

  • Sette seg inn i hvordan man kommuniserer med seriell port i Java.
  • Lære seg hvordan man kommuniserer med mobiltelefon via seriellport og AT–kommandoer.
  • Kode og dekode PDUer (GSM–internt SMS–format) i henhold til GSM 07.05/GSM 07.07-standardene.

Leveranser:

  • Initiell modell (leveranse des-02).
  • Java–kode og ferdig kompilerte klasser.
  • Eksempelapplikasjon som bruker klassene.
  • JavaDOC dokumentasjon.
  • Vanlig prosjektrapport.

Spesielle erfaringer som jeg vil trekke frem: