Home:Professional:Windows Console Applications and Character Encoding

Results

This describes my observations and the best of my understanding of what is happening in regard to text processing in Windows console applications. This is based on a much earlier (but incomplete) analysis I did previously under Windows XP; this analysis relates to the current version of Windows 11. This work was done with the help of a utility program (the Encoding Explorer) that let me quickly try different combinations of modes and environmental settings and observe the results.

The discussion looks only at output from a native application to the console and intentionally ignores the parallel discussion about input. There are a couple of reasons why I chose to ignore the input side (not least avoiding the doubling of work), but focusing on one side of the equation at least greatly simplifies the discussion and makes it less confusing. I'm operating with the assumption that any observations and statements that can be made about the output side of I/O should apply (mutatis mutandis) to input as well.

Sample Data

I used the following input data for my investigations. The wide character code points were specifically chosen so that their interpretation agrees with the corresponding narrow character interpretation under Code Page 437.
ordinal display narrow wide Unicode name
0 A 41 0041 LATIN CAPITAL LETTER A
1 CE 256C BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL
2 ú A3 00FA LATIN SMALL LETTER U WITH ACUTE
3 δ EB 03B4 GREEK SMALL LETTER DELTA
4 î 8C 00EE LATIN SMALL LETTER I WITH CIRCUMFLEX
5 0A 000A LINE FEED (LF)
Note that these characters are intentionally not uniformly representable in other character sets, such as ASCII, ISO-8859-1 or -15, or Code Page 1252.

Methods

These represent the different APIs and variations that can be used in a native console application to output text: I further distinguish between

Windows API

There is no fundamental difference between WriteConsoleA() and WriteFile(); it is specifically not the case, say, that WriteConsole() does any kind of special text-mode processing such as CR/LF conversion. The differences are mostly utilitarian in that the Console API as a whole is specifically aware of the fact that the output device is a console and provides corresponding control over it (such as color settings); and that there are the two WriteConsoleA() and WriteConsoleW() versions which may possibly be of use when writing code that needs to be able to deal with both these worlds. WriteConsole() will not work if standard output has been redirected to a file, in which case the handle returned by GetStdHandle() is not a console but a file handle. WriteConsoleA() and WriteFile() appear to both produce byte-for-byte what was presented on input; and the console itself displays these as characters as per the Console Output Code Page.

WriteConsoleW() assumes as usual that the wide-character input is UTF-16 that is interpreted by the console as UTF-8 regardless of the Console Output Code Page setting.

‘POSIX style’

In this case we use _open() and _write(). The different modes are selected by the oflag argument to _open() or the flag argument to _setmode() for an already-open file. These functions, while they are provided by Microsoft in the C runtime library, are not actually part of Standard C but derive originally from Unix/POSIX. Specifying nonsensical Unicode + binary mode (as _O_BINARY | _O_U8TEXT) does not trigger an error and has the same effect as specifying just binary mode.

Writing directly to a file a BOM is generated for wide, unicode, and wide unicode modes (but obviously not for text or binary). When redirecting to a file, Command Prompt also generates a BOM for those three modes but PowerShell does not.

NOTE The Windows header file fcntl.h claims that _O_WTEXT should produce a BOM and _O_U8TEXT and _O_U16TEXT should not; but I found all three those modes to behave identically in that regard). In fact, overall I could not find any difference whatsoever in behavior between _O_WTEXT and _O_U16TEXT.

Standard C

Standard C I/O is done with FILE stream handles represented by stdout or created by fopen(). C file streams assume a 'narrow' or 'wide' orientation determined implicitly, by whether fprintf() ir fwprintf() is first called on them; or theoretically, by calling fwide() explicitly. Note though that fwide() is documented as being unimplemented. fwrite() is used for unformatted output.

The different modes are selected by the mode argument to [fopen()](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen) and work exactly the same as the corresponding POSIX modes.

Again, specifying nonsensical Unicode + binary mode (as "wb,ccs=utf-8") does not trigger an error and has the same effect as specifying just binary mode. All four text modes perform CR/LF conversion, and all three wide-character modes insert a prefix BOM.

As documented, it's possible to change the mode of an already-open stream by extracting the POSIX file descriptor and applying _setmode():

_setmode(fileno(stdout), _O_U8TEXT)
All this clearly suggests that the Standard C APIs are implemented on top of the POSIX style functions.

NOTE The BOM that is normally generated when writing to a file in one of the wide-character modes is not generated when the standard output streams are set to one of those modes with _setmode().

NOTE Unicode mode is not supported through narrow-character fprintf even with the "%S" wide-character [format specifier](https://learn.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=msvc-170#type-field-characters) (fwprintf has to be used).

Standard C++

Standard C++ uses I/O Streams based on std::basic_ostream. Formatted I/O is done through the usual << operators; unformatted I/O is done through .write().

An important point to make about C++ file stream I/O is that even though a wide-character file stream does exist (i.e. std::wfstream) that does accept wide-character text, the underlying C FILE of a C++ file stream (at least in Windows) is always narrow; and there is no standard or specific non-standard API to change this. This implies that and data sent through a wide-character C++ file stream is converted from the wide character set (always UTF-16) to some narrow character set. This is counter to the behavior of the C wide-character streams which dynamically assume either a narrow or wide orientation (as described above).

The narrow character set that is converted to is specified by means of a C++ locale; this can be set either globally through std::global::locale() or on an individual stream through .imbue(). Setting the global C++ locale sets the C locale as well, but the converse is not generally true: i.e., setlocale() does not affect the C++ locale except for standard output and presumably standard error and input as well. This might be due to the C++ and C streams having been synchronized through something like std::ios_base::sync_with_stdio(), but I haven't verified this.

It is important to understand that this is an actual conversion, not just a reinterpretation of the byte stream; it happens whether the output is sent to the console or to a file. If a character cannot be represented in the given narrow character set, it is generally approximated. For example, our wide character sample string is converted as follows with the given locales:

locale displayed converted bytes
C (default) A 41
.437/.OCP A╬úδî 41 ce a3 eb 8c 0d 0a
.1252/.ACP A+údî 41 2b fa 64 ee 0d 0a
.65001/.utf8 A╬úδî 41 e2 95 ac c3 ba ce b4 c3 ae 0d 0a

Only the Code Page 437 and UTF-8 locales are thus able to represent this UTF-16 sample string correctly; the default C locale actually completely gives up after the first nonrepresentable character.

A std::locale object can be constructed with a std::codecvt facet, giving more precise and explicit control over the conversion. In particular, using a ‘non-converting’ character set conversion that converts UTF-16 into itself (see codecvt_utf16). In theory seems to allow the construction of a truly wide-character stream which maintains UTF-16; however, internally the stream is still narrow. This can be seen in any of the text modes by observing the effect of CR/LF translation which interprets the supposed UTF-16 code point 000A as narrow-character 0A 00 and converts it to 0D 0A 00. This obviously messes up the attempted output of UTF-16; which is why the codecvt_utf16 approach can only be made to work in binary mode. This in turn requires the programmer to perform the BOM marking and CR/LF translation, at which point one starts to question the purpose of using wfstream at all. Note also that the codecvt classes and the entire <codecvt> header are deprecated as of C++17.

A different ‘back door’ to avoid this and obtain an almost true wide-character C++ file stream is to open the C FILE with the ccs=utf16-le mode and use the Windows STL extension that allows construction the std::wfstream from it. So while the C++ stream itself has a narrow-character 'associated character sequence', the underlying C and POSIX streams are wide. Therefore, BOM insertion and CR/LF translation happens correctly and as expected, and also no locale is needed.

See the Locales section below for a reference on valid locale strings.

Modes

I've found that there are broadly four (nominally five) fundamental modes in which character data can be interpreted by the various API methods. Some of these modes have direct representations in the corresponding API methods; others need explicit work by the programmer to achieve.

Binary Mode

This is the simplest mode in which data is not interpreted as text but as binary data (either 'narrow' bytes or 'wide' 16-bit words) and has no character set interpretation. In this mode we expect no conversions to be applied to the data. All the other modes cause data to be interpreted as characters (i.e., text).

Text Mode (Narrow-Character)

The canonical text mode sees data as (narrow) byte-size characters. In this, as in all the text modes, LF is translated to CR LF. The Standard C locale determines the character set interpretation of the narrow characters.

Wide-Character Text Mode

In this case the input is interpreted as wide-character UTF-16 and presented on output as UTF-16 as well. The locale and Console Output Code Page are ignored.

‘Unicode’ Mode

Several output methods in Windows support something that is vaguely and confusingly referred to as ‘Unicode mode’. This is like wide-chracter text mode in that ths input is interpreted as UTF-16; internally however it is converted to UTF-8. Again, the locale and Console Output Code Page are ignored.

‘Wide Unicode’ Mode

Some of the Windows output methods distinguish between narrow and wide-character Unicode modes; in practice, I haven't found any difference between regular and ‘Unicode’ wide-character modes. In the C++ methods, regular wide-character mode refers to the standard wide-character streams, and ‘Unicode wide-character mode’ to a nonstandard method of achieving UTF-16 output.

This works with Standard C unformatted and POSIX-style output; but against formatted Standard C I/O only using fwprintf() or a run-time assertion is triggered. This makes some sense given that the input is interpreted as wide characters (though it doesn't explain why fprintf("%S") still doesn't work), and is exactly specified in the Microsoft documentation for _setmode(), further confirming the equivalence between Standard C and POSIX-style ‘Unicode mode’. Finally, applying _setmode() to put standard output in Unicode mode seems to show the same behavior and displays "칁ઌ", which visually corresponds with

Now convinced that _setmode() allows us to perform Unicode mode I/O to the console, and given the observation that the Console Output Code Page appears to be overridden by Unicode mode, I was wondering whether it was the presence of the Unicode BOM itself that is recognized by the console: but that is not the case. Just outputting the same UTF-8 string of bytes in binary mode does not trigger the console to interpret UTF-8; and it also does not trigger the effect of ignoring the Output Code Page.

NOTE I actually don't understand why the BOM occurs in Unicode Mode when writing directly to a file, but not when redirected to a file in the console. The former seems to suggest the BOM is added in the POSIX layer and removed by the console; but I'm not able to confirm this removal by explicitly writing a UTF-8 BOM to the console.

‘Wide-Character Unicode Mode’

The output method APIs do nominally support something that might be called 'wide-character Unicode mode' (corresponding to _O_U16TEXT in the POSIX API); however, in all of my testing I could not find any difference between it and 'regular' wide-character text mode.

Summary

This table summarizes the available methods and modes.
Method Binary Text Wide Unicode Wide Unicode
Windows API WriteFile() WriteConsoleA() WriteConsoleW() n/a n/a
‘POSIX’ style _open(_O_BINARY) _open(_O_TEXT) _open(_O_WTEXT) _open(_O_U8TEXT) _open(_O_U16TEXT)
C unformatted fopen("wb")
fwrite()
fopen("w")
fwrite()
fopen("w,ccs=unicode")
fwrite()
fopen("w,ccs=utf-8")
fwrite()
fopen("w,ccs=utf-16le")
fwrite()
C formatted fopen("wb")
fprintf()
fopen("w")
fprintf()
fopen("w,ccs=unicode")
fwprintf()
fopen("w,ccs=utf-8")
fwprintf()
fopen("w,ccs=utf-16le")
fwprintf()
C++ unformatted ostream(ios::binary)
.write()
ostream()
.write()
wostream()
.write()
fopen("w,ccs=utf-8")
wostream(FILE)
.write()
fopen("w,ccs=utf-16le")
wostream(FILE)
codecvt_utf16
.write()
C++ formatted ostream(ios::binary)
<<
ostream()
<<
wostream()
<<
fopen("w,ccs=utf-8")
wostream(FILE)
<<
fopen("w,ccs=utf-16le")
wostream(FILE)
codecvt_utf16
<<

Console

How the console displays characters is influenced by three properties.

Console Font

This used to be an issue: without a ‘good’ font being configured in the Command Prompt, you would not get the correct characters displayed, no matter what. In Windows 11 (at least) this does not appear to be an issue anymore, as the default font seems to have adequate character set coverage. On systems before Windows 11, check to make sure something like Lucida Console is configured.

Console Code Page

There are two different code pages that appear like they should be associated with console output: There is a difference in how these two are handled by Command Prompt and PowerShell and chcp, explained here and summarized in the table:
Console Code Page Console Output Code Page
Command Prompt chcp/console chcp/console
PowerShell chcp/console 437/process
Here are some specific Code Pages I looked at; see Code Page Identifiers for a complete list:

Locale

The Standard C locale setting (through setlocale()) interplays with the Console Output Code Page in that it defines the character set of the output that is set to the console. Setting the Standard C locale in general has no effect with on direct or console-redirected file output (although see above for C++ I/O) but affects the presentation within the console itself with the POSIX-style method in ‘text’ mode. This affects the POSIX-style, C (formatted or unformatted), C++ (formatted or unformatted) methods but not the Windows API. It has no effect in any other mode; presumably because input in those modes either already has a well-defined character set interpretation (wide, unicode, and wide unicode) or has no character interpretation at all (binary).

Valid locale names even include the ability to specify code pages explicitly, so this all appears very similar to the effect of the Console Output Code Page; however, it is a different mechanism as can be seen by the fact that they interact. Note that on my system, .OCP (OEM Code Page) invokes Code Page 437 and .ACP (ANSI Code Page) invokes Code Page 1252.

Examples

These show the display of the sample bytes under different combinations of locale and Console Output Code Page:
Locale Console Output Code Page Output
.437/.OCP/not set 437/not set A╬úδî
1252 A+údî
.1252/.ACP 1252 AΣëŒ
437/not set AI£ëO
When the locale and Console Output Code Page are aligned, the displayed output is as expected and correct for that given Code Page. Otherwise, the characters are interpreted as existing in the locale character set but approximated with characters from the Console Output Page. For example: Note that setting the Console Output Code Page to 65001 (for UTF-8) allows the characters to be correctly displayed according to the specified locale in every case. This again serves to reinforce that the COCP does not dictate the character set encoding of the data, but the available character repertoire.

Console File Redirection

When the output of a command is redirected in the console, note first of all that the console APIs such as WriteConsole() no longer work; WriteFile() must be used. However, the character set encoding of the file depends on a number of factors.

Command Prompt

Supposedly depends on whether it was invoked as CMD /U or CMD /A, but I haven't been able to find any difference: the input is passed through to the output without any character set interpretation or any CR/LF translation.

PowerShell

The ‘redirection operator’ > is a shortcut for | OutFile. The documentation claims that its default encoding is UTF-8-no-BOM, but my observation is that UTF-16-with-BOM is the default. Either way, supposedly this can be overridden with the -Encoding argument. Also, CR/LF translation is applied and CR/LF is appended after the end of the last line if none is present. It appears that the input byte stream is always interpreted as Code Page 437; neither the Console Code Page nor the Console Output Code Page settings have any effect on this.

To illustrate the effect of PowerShell file redirection: our sample byte stream becomes

FF FE 41 00 6c 25 FA 00 B4 03 EE 00 0D 00 0A 00 0D 00 0A 00 00 00 0D 00 0A 00
corresponding to the UTF-16 code points
FEFF 0041 256C 00FA 03B4 00EE 000D 000A 000D 000A 0000 000D 000A
which are exactly the interpretations under Code Page 437. For example, CE translates to 256C (BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL).

Unicode Mode and File Redirection

Because of the multiple places where character set translations are done, you can construct some really ludicrous situations. For example, if you open a file in POSIX Unicode Mode, as described above internally your input is intepreted as UTF-16 CE41 EBA3 0A8C and translated into UTF-8 ec b9 81 ee ae a3 e0 aa 8c. Redirecting this in PowerShell then interprets the UTF-8 bytes as Code Page 437 and translates them into UTF-16: Adding a BOM and CR/LF results in the nonsensical monstrosity:
FF FE 1E 22 63 25 FC 00 B5 03 AB 00 FA 00 B1 03 AC 00 EE 00 0D 00 0A 00
which opens as "칁ઌ". This is controlled by the PowerShell $OutputEncoding variable: It seems they affect the Console Output Code Page that is seen by the application; however, setting it through SetConsoleOutputCP() has no effect. So at a minimum, this could be used to check what encoding PowerShell is expecting.

Resources

All pages under this domain © Copyright 1999-2023 by: Ben Hekster