Warm tip: This article is reproduced from serverfault.com, please click

encoding ruby utf-8

Effectively UTF-8 encode a string

发布于 2020-12-01 14:24:41

I'm running a script on WSL Debian which fetches Windows files from a locally mounted share drive. Issue is that the file names are wrongly encoded, even-though #encoding returns #<Encoding:UTF-8>. Example:

"J\u00E9r\u00E9my".encoding  # #<Encoding:UTF-8>

\u00E9 is the Unicode character for é, so I assume that the encoding is Unicode

I've tried several encoding combination from related questions (Convert a unicode string to characters in Ruby?, How to convert a string to UTF8 in Ruby), but none of the fit my needs. I've also tried different "magic comments" encoding: <ENCODING>, without satisfying result.

What's your methodology to identify and fix encoding issues ?

Edit1: Stefan asked for codepoints:

"J\u00E9r\u00E9my".each_codepoint.to_a
# [74, 233, 114, 233, 109, 121]

and Encoding.default_external

Encoding.default_external
# #<Encoding:US_ASCII>

Which surprises me, as I've the magic comment # encoding: utf-8 at the top of my file

Edit2: explicitely setting default_internal and default_external encoding to Encoding::UTF_8 fixes the problem

# encoding: utf-8

Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8

Though I'd like to go further and actually understand why this is required

Questioner

Sumak

Viewed

0

Stefan 2020-12-01 23:09:31

"J\u00E9r\u00E9my".encoding
#=> #<Encoding:UTF-8>
"J\u00E9r\u00E9my".each_codepoint.to_a
#=> [74, 233, 114, 233, 109, 121]

The strings are perfectly fine. They contain the correct bytes and have the correct encoding.

They are printed this way because your external encoding is set to (or recognised as) US-ASCII:

Encoding.default_external
#=> #<Encoding:US_ASCII>

Ruby assumes that your terminal can only render ASCII characters and therefore prints UTF-8 characters using escape sequences. (when using p / String#inspect)

The external encoding is usually determined automatically based on your locale:

$ LANG=C            ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>

$ LANG=en_US.UTF-8  ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>

Setting your terminal's or system's encoding / locale to UTF-8 should fix the problem.

Todd A. Jacobs 2020-12-01 14:59:37

For future visitors: note that String#codepoints is shorthand for str.each_codepoint.to_a. The result will be the same either way.

Sumak 2020-12-01 15:06:21

Indeed, it came from my terminal settings. Despite the fact that WSL' terminal says it's using UTF-8, running the script from another terminal prints the accentuated characters properly. I'll investigate WSL settings, thanks for guiding me to the right direction !

热门帖子

1

没想到 Arc 浏览器对网络要求如此严格

2

失业三个月，面试寥寥无几，朋友失业的也很多

3

开发了一个在线批量图片压缩网站

4

路由器批量端口映射到 NAS 的问题，求教～

5

出二手书，有需要的朋友可以看看。

6

有什么不错的口粮茶推荐吗

7

求教 nas+盒子流畅观看 4k 原盘的方案

8

有人需要代开吗虚拟卡付款 120 一月

9

BT 真的死了吗

10

广东/深圳联通 2000M 宽带 100M 上传，免费 500M 副宽, 199 元/月

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books