Internationalizing ARM
The ARM API is designed to enable applications to use native code pages (or code sets) and languages, and for measurement agents to be able to support many different languages. Users of agents should contact the providers to see if the agent supports the needed code pages and languages.
The ARM API supports any code page as long as no characters are encoded with binary zero bytes (octets). This is because most strings are passed as NULL terminated strings, and the NULL terminator character is a binary zero byte. If a binary zero byte is encountered before the end of the string, the agent would interpret the zero byte as the NULL terminator and truncate the string. Most code pages meet this requirement.
There are code pages that contain binary zero bytes, but there are alternate ways to encode the characters. A well-known example is the Unicode standard. In its native format using 16 bit characters (UTC-2), there are binary zero bytes. However, the UTF-8 encoding of the same Unicode characters does not contain binary zero bytes, and this format is entirely compatible with the ARM API.
Agents that support native languages will often use the following technique. When the application links to the agent it links to a part of the agent that executes in the same process space as the application. Typically this small part of the agent communicates with the main part of the agent across an inter-process communications (IPC) channel. The small part of the agent that executes in the same process as the application can issue an operating system call to find out what code page and language the process is using. It can then pass this information to the main part of the agent, and the main part of the agent can convert from the native code page as necessary, or the small part of the agent can make the conversion itself.
On some operating systems there are more than one way to define what language and code page the application is using. In order to avoid any ambiguity, the following conventions apply for the specified operating systems.
There are the following three restrictions on the use of native languages.
- Windows 95, Windows 98, Windows NT (win32): the location information is set using setThreadLocale(). Note that this is NOT setlocale(), which is used on Unix, and is also supported on Windows. setThreadLocale() is used because it is the most granular choice
- Unix: the location information is set using setlocale().
- TBD: OS/400 and S/390.
- The strings can contain no binary zero bytes except for the NULL terminator character (as was mentioned above).
- All the strings should be encoded using the same code page and language information as the process that executes the arm_init call. This also implies that the code page and language information should not change after the arm_init call.
- This technique does not apply to any string data passed within the optional buffers on arm_start, arm_update, and arm_stop. This is because these strings are not null terminated (note that it does apply to the metric descriptions passed within the optional buffer on arm_getid). Further, these strings are often about things that are external to the program, such as a part number or an error code, so the requirement to use the same code page and language information as the process is unacceptable. The application developer is strongly recommended to restrict these strings to the first 128 bytes of the standard Latin code pages for ASCII or the first 256 bytes for EBCDIC (depending on the platform).