Warm tip: This article is reproduced from serverfault.com, please click

Unicode characters cannot be decoded

发布于 2020-11-29 11:48:49

I use browserless.js (headless Chrome) to fetch the html code of a website, and then use a regular expression to find certain image URLs.

One example is the following:

https://vignette.wikia.nocookie.net/moviepedia/images/8/88/Adrien_Brody.jpg/revision/latest/top-crop/width/360/height/450?cb\u003d20141113231800\u0026path-prefix\u003dde

There are unicode characters such as \u003d, which should be decoded (in this case to =). The reason is that I want to include these images in a site, and without decoding some of them cannot be displayed (like that one above, just paste the URL; it gives broken-image.webp).

I have tried lots of things, but nothing works.

  • JSON.parse(JSON.stringify(...))
  • String.prototype.normalize()
  • decodeURIComponent

Curiously, the regular expression for "\u003d" (i.e. "\\u003d" in js) does not match that string above, but "u003d" does.

This is all very weird, and my current guess is that browserless is responsible for some weird formatting behind the scenes. Namely, when I console log the URL and copy paste it somewhere else, every method mentioned above works for decoding.

I hope that someone can help me on this.

Questioner
Martin Brandenburg
Viewed
0
community wiki 2021-03-11 15:59:40

Just to mark this one as answered. Thomas replied:

JSON.parse(`"${url}"`)