I'm writing a cross-platform application in C++. All strings are UTF-8-encoded internally. Consider the following simplified code:
#include <string>
#include <iostream>
int main() {
std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
std::cout << test;
return 0;
}
On Unix systems, std::cout
expects 8-bit strings to be UTF-8-encoded, so this code works fine.
On Windows, however, std::cout
expects 8-bit strings to be in Latin-1 or a similar non-Unicode format (depending on the codepage). This leads to the following output:
Greek: ╬▒╬▓╬│╬┤; German: ├£bergr├Â├ƒentr├ñger
What can I do to make std::cout
interpret 8-bit strings as UTF-8 on Windows?
This is what I tried:
#include <string>
#include <iostream>
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U8TEXT);
std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
std::cout << test;
return 0;
}
I was hoping that _setmode
would do the trick. However, this results in the following assertion error in the line that calls operator<<
:
Microsoft Visual C++ Runtime Library
Debug Assertion Failed!
Program: d:\visual studio 2015\Projects\utf8test\Debug\utf8test.exe File: minkernel\crts\ucrt\src\appcrt\stdio\fputc.cpp Line: 47
Expression: ( (_Stream.is_string_backed()) || (fn = _fileno(_Stream.public_stream()), ((_textmode_safe(fn) == __crt_lowio_text_mode::ansi) && !_tm_unicode_safe(fn))))
For information on how your program can cause an assertion failure, see the Visual C++ documentation on asserts.
The problem is not std::cout
but the windows console. Using C-stdio you will get the ü
with fputs( "\xc3\xbc", stdout );
after setting the UTF-8 codepage (either using SetConsoleOutputCP
or chcp
) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).
If you output one byte after the other with putc('\xc3'); putc('\xbc');
you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.
See UTF-8 output on Windows console for a lenghty discussion.
For my own project, I finally implemented a std::stringbuf
doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.
An alternative approach would be overwriting cout
's streambuf, using fputs
for the actual output:
#include <iostream>
#include <sstream>
#include <Windows.h>
class MBuf: public std::stringbuf {
public:
int sync() {
fputs( str().c_str(), stdout );
str( "" );
return 0;
}
};
int main() {
SetConsoleOutputCP( CP_UTF8 );
setvbuf( stdout, nullptr, _IONBF, 0 );
MBuf buf;
std::cout.rdbuf( &buf );
std::cout << u8"Greek: αβγδ\n" << std::flush;
}
I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.
This seems to be part of the problem indeed. If I use
SetConsoleOutputCP(CP_UTF8);
as suggested by Miles and switch to a non-raster font as suggested by Paul and usefputs
instead ofstd::cout
, it works! -- Now I need to find out whether there's a way to getstd::cout
to behave correctly.I don't think there is a way. And
fputs
is not guaranteed to work either, see my doubleputc
example. You could try to changecout
'sstreambuf
(seerdbuf()
) with one understanding UTF-8 (keeping the characters together) and usingfputs
.I found that this behavior can be fixed by enabling buffering; see my answer. Thanks for pointing me in the right direction!
Regarding your edit: I'm afraid it's not working for me. I'm getting "Greek: ╬▒╬▓╬│╬┤".
Are you testing from within Visual Studio? I noticed that this only works starting the program directly from a cmd instance.