I'm running a script on WSL Debian which fetches Windows files from a locally mounted share drive. Issue is that the file names are wrongly encoded, even-though #encoding
returns #<Encoding:UTF-8>
. Example:
"J\u00E9r\u00E9my".encoding # #<Encoding:UTF-8>
\u00E9
is the Unicode character for é
, so I assume that the encoding is Unicode
I've tried several encoding combination from related questions (Convert a unicode string to characters in Ruby?, How to convert a string to UTF8 in Ruby), but none of the fit my needs.
I've also tried different "magic comments" encoding: <ENCODING>
, without satisfying result.
What's your methodology to identify and fix encoding issues ?
"J\u00E9r\u00E9my".each_codepoint.to_a
# [74, 233, 114, 233, 109, 121]
and Encoding.default_external
Encoding.default_external
# #<Encoding:US_ASCII>
Which surprises me, as I've the magic comment # encoding: utf-8
at the top of my file
Edit2: explicitely setting default_internal
and default_external
encoding to Encoding::UTF_8
fixes the problem
# encoding: utf-8
Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8
Though I'd like to go further and actually understand why this is required
"J\u00E9r\u00E9my".encoding #=> #<Encoding:UTF-8> "J\u00E9r\u00E9my".each_codepoint.to_a #=> [74, 233, 114, 233, 109, 121]
The strings are perfectly fine. They contain the correct bytes and have the correct encoding.
They are printed this way because your external encoding is set to (or recognised as) US-ASCII:
Encoding.default_external #=> #<Encoding:US_ASCII>
Ruby assumes that your terminal can only render ASCII characters and therefore prints UTF-8 characters using escape sequences. (when using p
/ String#inspect
)
The external encoding is usually determined automatically based on your locale:
$ LANG=C ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>
$ LANG=en_US.UTF-8 ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>
Setting your terminal's or system's encoding / locale to UTF-8 should fix the problem.
For future visitors: note that String#codepoints is shorthand for
str.each_codepoint.to_a
. The result will be the same either way.Indeed, it came from my terminal settings. Despite the fact that WSL' terminal says it's using UTF-8, running the script from another terminal prints the accentuated characters properly. I'll investigate WSL settings, thanks for guiding me to the right direction !