How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?

发布于 2020-03-27 10:26:41

In a MySQL 5.6 database I have table tablename which has (including others) three TEXT columns: col_a, col_b, col_c.

I want to extract all unique words (with words being separated by spaces) from these three columns that are at least 5 characters long. By "word" I mean any string of non-space characters, eg "foo-123" would be a word, as would "099423". The columns are all utf8 format InnoDB columns.

Is there a single query to do this?

EDIT: As requested, here's an example: (in the real data col_a, col_b and col_c are TEXT fields and could have a large number of words.)

select id, col_a, col_b, col_c from tablename;

id  | col_a              | col_b          | col_c
----|--------------------|----------------|----------------------
1   | apple orange plum  | red green blue | bill dave sue
2   | orange plum banana | yellow red     | frank james
3   | kiwi fruit apple   | green pink     | bill sarah-jane frank

expected_result: ["apple", "orange", "banana", "fruit", 
                  "green", "yellow", "frank", "james", "sarah-jane"]

I don't care about the order of results. thanks!

EDIT: in my example above, everything is in lowercase, as that's how I happen to store everything in my real-life table that this question relates to. But, for the sake of argument, if it did contain some capitalisation I would prefer the query to ignore capitalisation (this is the setting of my DB config as it happens).

EDIT2: in case it helps, all of the text columns have a FULLTEXT index on them.

EDIT3: here is the SQL to create the sample data:

DROP TABLE IF EXISTS `tablename`;
CREATE TABLE `tablename` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `col_a` text,
  `col_b` text,
  `col_c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
LOCK TABLES `tablename` WRITE;
INSERT INTO `tablename` VALUES (1,'apple orange plum','red green blue','bill dave sue'),(2,'orange plum banana','yellow red','frank james'),(3,'kiwi fruit apple','green pink','bill sarah-jane frank');
UNLOCK TABLES;

Rick James 2019-07-08 21:30

Shell script might be efficient...

SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ... to get the columns into a file
tr ' ' "\n" <x -- split into one word per line
awk 'length($1) >= 5' -- minimum size of 5 characters per word
sort -u -- to dedup

There are no stopwords, but sed or awk could deal with that.

 mysql -e "SELECT ... INTO OUTFILE 'x' ..." ...
 tr ' ' "\n" <x  |  awk 'length($1) >= 5'  |  sort -u

Max Williams 2019-07-04 17:27:52

Thanks - this is a promising approach, but uniq doesn't seem to be uniqifying the lines. This is what I'm doing: dbname='my_database'; fname='/tmp/tablename_strings.txt'; sudo chown max:max $fname; rm $fname; mysql -u root $dbname -e "SELECT CONCAT_WS(' ', col_a, col_b, col_c) FROM tablename INTO OUTFILE '$fname'"; tr ' ' '\n' <"$fname" | awk 'length($1) >= 5' | uniq and I get this (separated by newlines): apple orange green orange banana yellow frank james fruit apple green sarah-jane frank

Rick James 2019-07-04 22:58:19

@MaxWilliams - Oops, I forgot the sort. (And there may be a de-dup option on sort, which would avoid the need for uniq.)

Bill Karwin 2019-07-07 21:28:01

You could use sort -u to combine the sort and uniq steps.

Max Williams 2019-07-08 21:21:55

I've given this this tick (and bounty) as it's the fastest of the many suggestions. Thanks!

Rick James 2019-07-08 21:31:35

Sometimes it helps to "think out of the box". (In this case the box is MySQL.)