Warm tip: This article is reproduced from serverfault.com, please click

Effectively UTF-8 encode a string

发布于 2020-12-01 14:24:41

I'm running a script on WSL Debian which fetches Windows files from a locally mounted share drive. Issue is that the file names are wrongly encoded, even-though #encoding returns #<Encoding:UTF-8>. Example:

"J\u00E9r\u00E9my".encoding  # #<Encoding:UTF-8>

\u00E9 is the Unicode character for é, so I assume that the encoding is Unicode

I've tried several encoding combination from related questions (Convert a unicode string to characters in Ruby?, How to convert a string to UTF8 in Ruby), but none of the fit my needs. I've also tried different "magic comments" encoding: <ENCODING>, without satisfying result.

What's your methodology to identify and fix encoding issues ?


Edit1: Stefan asked for codepoints:
"J\u00E9r\u00E9my".each_codepoint.to_a
# [74, 233, 114, 233, 109, 121]

and Encoding.default_external

Encoding.default_external
# #<Encoding:US_ASCII>

Which surprises me, as I've the magic comment # encoding: utf-8 at the top of my file


Edit2: explicitely setting default_internal and default_external encoding to Encoding::UTF_8 fixes the problem

# encoding: utf-8

Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8

Though I'd like to go further and actually understand why this is required

Questioner
Sumak
Viewed
0
Stefan 2020-12-01 23:09:31
"J\u00E9r\u00E9my".encoding
#=> #<Encoding:UTF-8>
"J\u00E9r\u00E9my".each_codepoint.to_a
#=> [74, 233, 114, 233, 109, 121]

The strings are perfectly fine. They contain the correct bytes and have the correct encoding.

They are printed this way because your external encoding is set to (or recognised as) US-ASCII:

Encoding.default_external
#=> #<Encoding:US_ASCII>

Ruby assumes that your terminal can only render ASCII characters and therefore prints UTF-8 characters using escape sequences. (when using p / String#inspect)

The external encoding is usually determined automatically based on your locale:

$ LANG=C            ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>

$ LANG=en_US.UTF-8  ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>

Setting your terminal's or system's encoding / locale to UTF-8 should fix the problem.