Using Ruby, I am trying to weed out spam messages the manual way, so why exactly does the below test return false
when it should return true
? The tested string is the original one, so you can literally copy/paste the whole thing into your ruby console to verify this example:
irb(main):053:0> "Веautiful women fоr sеx in yоur town АU: https://links.wtf/qLFs".include? "sex"
=> false
Hint: If you replace the word "sex" inside the entire string by typing it in yourself, the test will return true
as expected. So, somehow, the two "sex" strings are not the same, but on what level? How to test that correctly?
EDIT:
I have narrowed it all down to this (copy/paste it to test it!):
irb(main):073:0> "е" == "e"
=> false
JavaScript's charCodeAt
method tells me that the two characters are a different Unicode value. Ruby's .ord
method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE
е according to a Unicode lookup table I found online.
Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.
#!/usr/bin/env ruby
CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze
for arg in ARGV
# next unless arg.is_a?(String)
arg.split('').each do |char|
p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
end
end
For reference, these are the .ord
and .charCodeAt
methods I used against your example. I started with JavaScript because it's a simple test in the browser console.
2.6.3 :005 > 'е'.ord
=> 1077
2.6.3 :006 > 'e'.ord
=> 101
'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101
The easiest approach for this issue will be to scan the string/text in question with the
gem "unicode-scripts"
at github.com/janlelis/unicode-scripts. Normal text should then return an array containing the 2 following elements at most["Common", "Latin"]
. If it contains any other elements, such as in["Common", "Cyrillic", "Latin"]
, there's a high chance of the string/text being "obfuscated" spam.