ColdFusion and character sets

(January 12, 2004)

In a typical ColdFusion web application, users enter text into forms, the text is stored in a database, and the text is later queried from the database and displayed. As you work on these kinds of applications, sooner or later you're likely to encounter non-standard characters. Probably the most common example is an accent in someone's name, such as José or Jürgen, but another example is "smart quotes" inserted by Word.

If you were hand-coding an HTML document containing these characters, you would use the special escape codes provided, such as é for é and ü for ü. But in an application where users enter or enter text into a form field, you have to accept that users will be entering such characters directly or copying them from a word processor. (Also, even if a user enters the right codes, in many browsers these will be converted to the actual character if the user returns to edit their data in the form.) And unless your application is very simple, you'll start seeing question marks or boxes in place of those special characters on the web.

This page is intended to provide basic help for developers experiencing problems with non-standard characters in English-language data. I don't have any experience with non-English-language web applications (such as a site that is entirely in Korean, for example), so I don't know if these suggestions would apply.

If you've never heard of Unicode or aren't comfortable with the idea of character sets, here is a good introduction to what every programmer should know about Unicode.

Step 1: Don't panic

The first thing to do when you start to see incorrectly displayed characters in your application is not to panic. The problem is not your fault and not your users' fault, and it can be fixed.

The problem with non-standard characters basically reflects the fact that we are still in transition from a world in which software was designed to handle small character sets in specific languages, to a world connected by the Internet in which every computer has to (theoretically) be able to display any character in any language. When funny question marks and boxes start showing up in your application, they are the result of inconsistent, conflicting character set settings. One part of the process is using one character set, and another part is using a different one; some of the characters (like standard English letters) match up so they display, but others don't match up and can't be displayed.

So, making your application work with non-standard characters is basically a process of making sure that every part of your site uses the same character set. That includes:

  • The database where the text is stored,
  • Any ColdFusion code that manipulates the text,
  • Any process in which the text is transmitted or re-written, and
  • The formatting of the final HTML document that the user sees.

Step 2: Pick a standard

To make everything consistent, you need to either pick one character set to use across the board, or identify multiple character sets that you know you are going to have to convert between. Your choice of standard will probably be determined by your database:

  1. If you have an existing database that supports the Unicode character set (such as SQL Server), you'll want to use Unicode throughout your application.
  2. If your database currently uses some other standard, you have some choices:
    1. convert your data to Unicode (within the same database or to a different database platform) and then use Unicode throughout your application;
    2. convert your data to and from Unicode when interacting with the database, and use Unicode the rest of the time;
    3. or, leave your data in another character set and use that character set throughout.

If your choice is (2)(b), then you'll need some custom tags or other code that can convert between whatever character set your database uses and Unicode. For example, the CF_CharsetConvert extension converts from the ISO-8859-1 character set to the UTF-8 (Unicode) character set and back. Whenever you read data from the database to display on a page or in a form, or save data to the database, some conversion will need to take place.

The remaining choices are the same in the sense that you are just trying to apply a single standard. However, if your choice is (1) or (2)(a), your life is made somewhat simpler by the fact that ColdFusion MX defaults to the UTF-8 Unicode character set pretty consistently; there may only be a few things you need to check on to make sure everything is consistent. If your choice is (2)(c) then you will need to add more code and do more testing to overcome ColdFusion's choice of UTF-8 everywhere that it applies.

Step 3: Apply your standard

Macromedia provides a helpful list of tags and functions that control character encoding. This list gives you an idea of where in your ColdFusion code character encoding may be an issue -- the places where you should make sure that your character set of choice is being used. In most of these situations, ColdFusion defaults to the UTF-8 character set; therefore, if you choose a different character set, you should look for each of these tags and functions in your code and make sure you specify your character set of choice in each one.

The rest of this section is an overview of the most common areas in your application to look at.

General processing

Anytime ColdFusion looks at your code, it has to choose a character set to interpret it. It also needs to know what character set to use to display the value of variables and output queries, and to interpret included files.

By default, ColdFusion MX applies the UTF-8 character set to everything. If UTF-8 is your choice, you don't need to change anything. If you've chosen a different character set, use the CFPROCESSINGDIRECTIVE tag at the start of every CFML file to specify a standard:

<cfprocessingdirective pageEncoding="iso-8859-1">

If you include one file within another (using CFINCLUDE or automatically as with using Application.cfm), every file that is processed needs to include this tag. If it's missing anywhere, that particular part of your application goes back to being UTF-8.

Content type

Your web application probably generates HTML pages for your users to view. Although these have a .cfm extension, they contain HTML, HEAD, and BODY tags like any other HTML page. HTML allows you to use the META tag to tell your users' web browser what character set to use to interpret the HTML page:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

By default, ColdFusion MX forces any .cfm page to be seen as using the UTF-8 character set, regardless of what character set you specify in the META tag. However, even if UTF-8 is your choice, you should still place the above meta tag at the start of the HEAD area of your HTML. This is because the page may not always be obtained by the browser directly from ColdFusion. If the page is cached (using CFCACHE or by a search engine such as Google) it will be just another HTML page, and if no character set is specified the browser will be left to guess which one to use. The same thing will happen if a user saves the page to disk for later viewing.

If you don't want to use UTF-8, you should put in a META tag specifying your character set of choice. However, you also need to override the default content type specified by ColdFusion. Use the CFCONTENT tag as in this example:

<cfcontent type="text/html; charset=iso-8859-1">

The CFCONTENT tag must appear at the top of your code before any content you want the user to see -- it should either be at the beginning of your Application.cfm (after the CFAPPLICATION) or at the top of each template. Any content before the CFCONTENT will be discarded by ColdFusion. Unlike CFPROCESSINGDIRECTIVE, you shouldn't place CFCONTENT in every included file.

Finally, if you already use CFCONTENT to output data from your application in some form other than HTML (such as JavaScript or HTML), it's a good idea to specify the character set of your output:

<cfcontent type="application/x-javascript; charset=utf-8">

Forms and URLs

Your application probably contains forms to edit data, and it probably passes data between pages in URLs. By default, ColdFusion MX uses the UTF-8 character set for these operations. If UTF-8 is your choice, you don't need to change anything. If you've chosen a different character set, specify it at the start of your application (next to the CFCONTENT tag from the prior section):

<cfscript>
  setEncoding("form","utf-8");
  setEncoding("url","utf-8");
</cfscript>

File operations

If your application writes web pages to files or stores data in text files, you'll want to keep these consistent as well. In my experience, you should specify a character set even if your choice is UTF-8 (the default), or set the character set of your choice. Use the charset attribute of the CFFILE tag, such as:

<cffile action="write" file="myfile.html" output="#MyContent#" charset="utf-8">

Sending email

If your application sends mail using CFFILE, the same rules apply. If you send text email, you should specify a character set for the text using the charset attribute of the CFMAIL tag, and if you send HTML email you should use a META tag to specify the character set for the email client/web browser to use to read the email.

Extra info: SQL Server

Those of you using a recent version of SQL Server as your database platform have easy access to Unicode data types. Simply select the nchar, nvarchar, and ntext data types in place of char, varchar, and text. Just bear in mind that data in Unicode fields takes up double the storage space of regular text. So don't use the Unicode data types for fields that you know won't contain special characters (such as letter codes or file names that you control).