Ruby: Testing a ruby string for a substring fails (substring is not recognized)

Patrick Taylor 2020-01-31 20:40

JavaScript's charCodeAt method tells me that the two characters are a different Unicode value. Ruby's .ord method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE е according to a Unicode lookup table I found online.

Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.

#!/usr/bin/env ruby

CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze

for arg in ARGV
  # next unless arg.is_a?(String)

  arg.split('').each do |char|
    p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
  end
end

For reference, these are the .ord and .charCodeAt methods I used against your example. I started with JavaScript because it's a simple test in the browser console.

2.6.3 :005 > 'е'.ord
 => 1077
2.6.3 :006 > 'e'.ord
 => 101

'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101

TomDogg 2020-02-02 03:30:03

The easiest approach for this issue will be to scan the string/text in question with the gem "unicode-scripts" at github.com/janlelis/unicode-scripts. Normal text should then return an array containing the 2 following elements at most ["Common", "Latin"]. If it contains any other elements, such as in ["Common", "Cyrillic", "Latin"], there's a high chance of the string/text being "obfuscated" spam.

Related issues

match query malformed, no start_object after query name" Elasticsearch 7.1

Elasticsearch mapping with dynamic index_name

Unable to use "pod install" in MacOS 11.0

How to optimize mapping hash that contains similar keys and values?

Access params from URL

How to track custom events in paper_trail?

ruby not equal operator doesn't work but equal does

"ld: library not found for -lSystem" when installing homebrew ruby on Big Sur

How to copy multiple lines of code into byebug?

Ruby FastJsonAPI dynamic set_type?