Warm tip: This article is reproduced from stackoverflow.com, please click
ruby string character

Ruby: Testing a ruby string for a substring fails (substring is not recognized)

发布于 2020-03-29 21:03:36

Using Ruby, I am trying to weed out spam messages the manual way, so why exactly does the below test return false when it should return true? The tested string is the original one, so you can literally copy/paste the whole thing into your ruby console to verify this example:

irb(main):053:0> "Веautiful women fоr sеx in yоur town АU: https://links.wtf/qLFs".include? "sex"
=> false

Hint: If you replace the word "sex" inside the entire string by typing it in yourself, the test will return true as expected. So, somehow, the two "sex" strings are not the same, but on what level? How to test that correctly?

EDIT:

I have narrowed it all down to this (copy/paste it to test it!):

irb(main):073:0> "е" == "e"
=> false
Questioner
TomDogg
Viewed
122
Patrick Taylor 2020-01-31 20:40

JavaScript's charCodeAt method tells me that the two characters are a different Unicode value. Ruby's .ord method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE е according to a Unicode lookup table I found online.

Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.

#!/usr/bin/env ruby

CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze

for arg in ARGV
  # next unless arg.is_a?(String)

  arg.split('').each do |char|
    p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
  end
end

For reference, these are the .ord and .charCodeAt methods I used against your example. I started with JavaScript because it's a simple test in the browser console.

2.6.3 :005 > 'е'.ord
 => 1077
2.6.3 :006 > 'e'.ord
 => 101
'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101