Encoding – Getting Those Strange Characters to Behave

The problems surrounding encoding is slowly starting to get the attention of web developers around the world. Previously content was mainly written in English, so we only needed support for the characters used in the English language. As pages in other languages started to appear they needed a way to display their special characters. The solutions they found rarely added support for special characters from other languages. This approached worked for a while, but now we’re seeing content written in one language with comments in every language imaginable. So a method to display all of these different characters at the same time is needed.

Figuring out how to solve this problem in a standardised way posed some problems, so various software vendors went with their own solutions to begin with.

Laying the Foundations

Before dwelling deeper into the solutions created it is important to define the terminology that I will use (as it differs between various sources). Some basic knowledge about how encoding works will also be supplied to create the basis for the rest of this blog entry.

A character set, or more specifically, a coded character set is a set of character symbols, each of which has a unique numerical ID, which is called the character’s code point. The computer representation of the code points is decided by the character encoding.

An example of a character set is the 128-character ASCII character set which mainly consist of the letters, numbers and basic punctuation used in the English language. Another common character set is the ISO-8859-1, or Latin 1, character set which extended the ASCII set to also contain extra characters used in various European languages (e.g the accented characters used in French). The most comprehensive character set in use today is the Universal Character Set (UCS) with over 1.1 million code points, as such it contains all characters our current languages need.

Every HTML document uses the UCS – or more accurately the ISO 10646 character set, which is a less involved standard describing the same set of characters. Older browsers and less powerful devices might not support the complete character set, but that doesn’t change the fact a HTML
document can contain any character found in the UCS.

However, between each document the character encoding might change. It is the variable character encodings which is the root of all the problems web developers experience in regards to erroneous character display. The problems are often caused by a missing definition of the encoding used on the page (which causes the browser to guess), or usage of different encodings for various parts of the page.

The Problem

As the web evolved the standard usage of the ISO-8859-1 encoding started to show some problems. People wanted to use characters not supported.

The first solution to this was to use character entities, or character numerical references. Instead of typing the literal character you would type something like — (—) or א(א). The problem with this approach was when submitting data through a form. If the user typed an א the browser would convert it to the numerical reference (א) before submitting the data, since the encoding didn’t have proper support for the character. But then the application had no way of differentiating between a user who typed the א-character and one who typed the entity code.

The Solution

The solution found for this problem was to start using the UTF-8 character set. UTF-8 is an extension of the old ASCII encoding and it has support for all kinds of characters, so the encoding never needs to use entities to display information. The application has full control over the kind of information it receives, and the users has access to all characters they need.

UTF-8 manages to support this wide range of characters by using more than one byte of storage for those characters that require it. Most characters used on an English page will still only use one byte (as ISO does), but when using characters that doesn’t fit into that small amount it expands by using up to four bytes in total for a single character. There are other UTF encodings available as well. They have a base usage of two (UTF-16) or four bytes(UTF-32), but are not compatible with the ASCII standard so they should not be used unless strictly necessary.

In Code We Trust

Creating a UTF-8 application in PHP isn’t without problems though. Converting an old non-UTF application is even more error prone as PHP has no good and error-free tool for conversion between the various encodings, as I cover at the end of this entry.

The first step in making a web application use UTF-8 is to tell the browser which encoding to expect and thereby how it should encode data submitted through forms. This is done by either using a <meta> tag in the page’s <head>-section:

Or by defining the Content-Type in the header:

 

The header sent will always override the meta tag, so to be on the safe side you should always send an appropriate header. If you forget it a default header might be sent from Apache which can be different from your desired header. To be on the safe side doing both is advisable, though the <meta> tag will only be used when the content is viewed offline. When the browser knows that content is being sent as UTF-8 it will automatically send content back using the same encoding, so that part doesn’t require any extra work.

Once that is done the biggest problem when creating a UTF-8 application remains. Most of the string related functions in PHP (e.g strlen) doesn’t work any more. Those functions assumes that each character in a string is encoded using ISO-8859 and takes exactly one byte of storage. When characters in a UTF-8 string uses more than one byte the calculations return the wrong result. An example is strlen("Iñtërnâtiônàlizætiøn") which would return 27. From manually counting we see that the result is off by 7. The difference comes from the 7 characters that need two bytes to be stored using UTF-8.

If you’re aware of this problem to begin with you can avoid many of the problems by coding around the use of these functions. It is possible to do so, but requires extra work and attention from the programmer. Luckily PHP supplies us with a more elegant solution – the mbstring extension.

The mbstring extension supplies PHP with multi byte versions of the standard string functions, e.g mb_strlen which is an extension of the standard strlen function. So prefixing the previously erroneous string functions with mb_ should solve most of those problems encountered.

A more elegant solution exists though, you can override the standard function names to invoke the multi byte versions. This is done by setting the mbstring.func_overload value to 7 in php.ini. The mbstring.func_overload can be set to other values as well if you only want to overload a subset of the string functions.

The encoding must also be set in the text editor and in any databases used. Most modern code editors have this option, so as long as one is aware of the problem it should not be any problem to fix it. Setting the encoding on the database level is usually no problem, but the approach varies between the different systems.

Legacy Code

All is well and nice when we’re creating a new application. We can start storing and handling text as UTF-8 from the very beginning. But what if we have and old application that now needs to support foreign characters?

Unfortunately there is no easy solution for this. If all your previous content is in English it should be safe to simply add the appropriate headers, since the basic characters are the same in all common encodings.

Conversion of characters not commonly used in English is trickier. PHP’s iconv library can be used for most conversions, but there is no guarantee that the converted content is completely correct. On forums and mailing lists you can also find a lot of hand crafted functions for conversion between various encodings – which might, or might not, work. In the end the only fool proof way to convert content is to rewrite the required sections by hand, or at least double check the changes done by an automated process like iconv.

Further reading, and sources used from this entry: