String Formatting in C and C++

Contents

Format Specifiers: Format string specifier issues

Format Specifiers - %h and %l: Single and Wide Prefixes

Format Specifiers - %s: Small %s Specifier Issues

Format Specifiers - %S: Large %s Specifier Issues

Multibyte Notes: Notes on problems with multibyte characters

Locale and Formatting: How Locale influences Formatting

String Formatting

Format Specifiers - Strings and Characters

For purposes of this section, we will refer to both single byte and multibyte strings as char strings, since they are treated essentially the same and what's most important in this context is that they both are of type char*.

This section addresses input and output formatting issues stemming from the use of wchar_t wide characters, both as parameters to single and multi byte functions such as printf, as well as issues with the wide version of functions themselves, such as wprintf.

In addition to the familiar %s and %c specifiers, there are also capital letter versions of these qualifiers, (e.g. %S and %C), and also the single and wide prefix characters %h and %l respectively. Note that Windows and ANSI treat most of these specifiers differently, so special care needs to be taken for code bases intended to be compiled for both platforms.

Single and Wide Specifiers (%h and %l)

To specify that a parameter is to treated as a single byte parameter, irregardless of whether a single or wide function call is being used, the format specifiers %hs and %hc should be used for strings and characters respectively.

To specify that a parameter is to treated as a wide parameter, irregardless of whether a single or wide function call is being used, the format specifiers %ls and %lc should be used for strings and characters respectively.

Both Windows and ANSI behave the same in regards to these prefixed specifiers.

Unqualified String Specifiers (small %s)

These tables show how Windows and ANSI treat parameters based on the %s specifier and the style of function call (single, generic, or wide):

Windows function	Specifier	Parameter needs to be
printf/sprintf (single/MBCS)	%s	char*
_tprintf/_stprintf (generic)	%s	TCHAR*
wprintf/swprintf (wide)	%s	wchar_t*

ANSI function	Specifier	Parameter needs to be
printf/sprintf (single/MBCS)	%s	char*
wprintf/swprintf (wide)	%s	char*

Note that ANSI in essence always treats %s in the same way as %hs, in other words it is always assumed to be single byte string.

Windows on the other hand treats %s differently based on the type of function call. (and is not ANSI-standard because of this). For single byte function calls, %s acts like the single byte %hs specifier, but for wide functions calls, %s acts like the wide %ls specifier.

For Windows Generic calls, the parameter is expected to be of type TCHAR, so that if the code is compiled with the _UNICODE flag off it will be assumed to be a single byte string, and with the _UNICODE flag on it will be assumed to be a wide string. (Note the requirement for %s to mean both single byte or wide depending on Generic compile flags is probably the reason why Microsoft took a non-ANSI standard approach to these specifiers.)

Unqualified String Specifiers (large %S)

These tables show how Windows and ANSI treat parameters based on the %S specifier and the style of function call (single, generic, or wide):

Windows function	Specifier	Parameter needs to be
printf/sprintf (single/MBCS)	%S	wchar_t*
_tprintf/_stprintf (generic)	%S	(don't use)
wprintf/swprintf (wide)	%S	char*

ANSI function	Specifier	Parameter needs to be
printf/sprintf (single/MBCS)	%S	wchar_t*
wprintf/swprintf (wide)	%S	wchar_t*

Both ANSI and Windows treat %S basically as opposite of %s in terms of single byte or wide, which ironically means that Windows and ANSI again handle these specifiers differently.

Note that ANSI in essence always treats %S in the same way as %ls, in other words it is always assumed to be wide string.

Windows on the other hand treats %S differently based on the type of function call. For single byte function calls, %S acts like the wide %ls specifier, but for wide functions calls, %S acts like the single byte %hs specifier.

This specifier should not be used for Windows Generic calls. Since %S is the "opposite" of %s, the parameter would need to be wide if the _UNICODE flag is off, and single byte if the _UNICODE flag is on. The TCHAR generic type does not work this way, and there's not "anti-TCHAR" kind of datatype.

Multibyte Notes

In regards to multibyte encodings such as UTF-8 and Shift-JIS, the single byte functions work correctly, with only a few minor notes.

One is that for the input functions such as scanf, the single character formats like %c and %C are not multibyte compatible. This is because obviously a multibyte character cannot fit in a single char, which is required as the parameter type

Another is that for functions that take count parameters, like snprintf, the count parameter is always given in terms of bytes, never the number of multibyte characters.

Locale Influences on Formatting

All of the printf family of functions use the LC_NUMERIC category setting of the current locale when formatting numbers. This category should therefore be set properly with the setlocale function before calling these functions.

An example of how this is used is seen with the floating-point decimal point separator. The United States uses a period ('.') for the separator, while many European countries such as France use a comma (','). (For example, 123.45 vs. 123,45)

For output, the issue is primarily just ensuring that the locale is set properly.

For input however, such as for scanf note that this setting may influence how numeric values are parsed. Therefore, it can be quite important to consider the source of the numeric value string being parsed.

Consider the case where a number string is in some canonical form that always uses one particular style and is therefore locale-independent. For example a numeric string value that is stored in a database or comes from a protocol that has a locale-independent string format that always uses a period for a floating point separator. If your application is operating in the French locale for example, in this case you will have to temporarily set the locale to a locale like United States English in order to parse the numeric, and then afterwards reset the locale to French.

On the other hand, if the string is retrieved from some source such as a dialog box where the value is likely to be the in the local format, you probably will want to leave the locale as it is currently set for the application.