Warm tip: This article is reproduced from serverfault.com, please click

Uniquely encode any ASCII string into a string that uses a subset of ASCII

发布于 2020-12-07 23:47:21

For this question, please assume python, but it doesn't necessarily matter.

Imagine you have an arbitrary ASCII string, for example:

jrioj4oi3m_=\.,ei9#

Sparing the extensive details, I need to pass this string as a "label" on to another program, but that program doesn't support "labels" containing "special characters" or even numbers. So I'm trying to encode an ASCII string into a string that uses an arbitrary subset of ASCII.

One very naive solution would be to convert the original string into binary, then convert 0s into "a" and 1s into "b". This works to solve my problem, but I would like to learn a better solution here, to become a better programmer.

First of all, what exactly is this problem called?

This is not exactly a hashing problem, because IIRC hashing generally involves encoding into a string that is shorter than the original, and involves collisions.

I need no collisions, and I don't really care how long the encoded string is, as long as it's shorter than the naive case. (Ideally it would be the shortest length possible given the subset)

In fact, it would be ideal to specify exactly what the allowed character set is, then use a generalized encoding algorithm to do the encoding.

Decoding would be nice to know also.

Questioner
cat pants
Viewed
0
ekhumoro 2020-12-08 08:59:03

A simple solution would be to first convert to a hex encoding:

  • jrioj4oi3m_=.,ei9# => 6a72696f6a346f69336d5f3d2e2c65693923

and then translate any numbers into non-hex letters:

  • 6a72696f6a346f69336d5f3d2e2c65693923 => waxswzwfwatuwfwzttwdvftdsescwvwztzst

So the output string would always be exactly twice the length of the input string and only ever contain characters in the range a-z.

This can be easily achieved in python like this:

>>> enc = str.maketrans('0123456789', 'qrstuvwxyz')
>>> dec = str.maketrans('qrstuvwxyz', '0123456789')
>>> s = 'jrioj4oi3m_=.,ei9#'
>>> x = s.encode('ascii').hex().translate(enc)
>>> x
'waxswzwfwatuwfwzttwdvftdsescwvwztzst'
>>> bytes.fromhex(x.translate(dec)).decode('ascii')
'jrioj4oi3m_=.,ei9#'