In a MySQL 5.6 database I have table tablename
which has (including others) three TEXT
columns: col_a, col_b, col_c
.
I want to extract all unique words (with words being separated by spaces) from these three columns that are at least 5 characters long. By "word" I mean any string of non-space characters, eg "foo-123" would be a word, as would "099423". The columns are all utf8 format InnoDB columns.
Is there a single query to do this?
EDIT: As requested, here's an example: (in the real data col_a, col_b and col_c are TEXT fields and could have a large number of words.)
select id, col_a, col_b, col_c from tablename;
id | col_a | col_b | col_c
----|--------------------|----------------|----------------------
1 | apple orange plum | red green blue | bill dave sue
2 | orange plum banana | yellow red | frank james
3 | kiwi fruit apple | green pink | bill sarah-jane frank
expected_result: ["apple", "orange", "banana", "fruit",
"green", "yellow", "frank", "james", "sarah-jane"]
I don't care about the order of results. thanks!
EDIT: in my example above, everything is in lowercase, as that's how I happen to store everything in my real-life table that this question relates to. But, for the sake of argument, if it did contain some capitalisation I would prefer the query to ignore capitalisation (this is the setting of my DB config as it happens).
EDIT2: in case it helps, all of the text columns have a FULLTEXT index on them.
EDIT3: here is the SQL to create the sample data:
DROP TABLE IF EXISTS `tablename`;
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`col_a` text,
`col_b` text,
`col_c` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
LOCK TABLES `tablename` WRITE;
INSERT INTO `tablename` VALUES (1,'apple orange plum','red green blue','bill dave sue'),(2,'orange plum banana','yellow red','frank james'),(3,'kiwi fruit apple','green pink','bill sarah-jane frank');
UNLOCK TABLES;
Shell script might be efficient...
SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ...
to get the columns into a filetr ' ' "\n" <x
-- split into one word per lineawk 'length($1) >= 5'
-- minimum size of 5 characters per wordsort -u
-- to dedupThere are no stopwords, but sed or awk could deal with that.
mysql -e "SELECT ... INTO OUTFILE 'x' ..." ...
tr ' ' "\n" <x | awk 'length($1) >= 5' | sort -u
Thanks - this is a promising approach, but
uniq
doesn't seem to be uniqifying the lines. This is what I'm doing:dbname='my_database'; fname='/tmp/tablename_strings.txt'; sudo chown max:max $fname; rm $fname; mysql -u root $dbname -e "SELECT CONCAT_WS(' ', col_a, col_b, col_c) FROM tablename INTO OUTFILE '$fname'"; tr ' ' '\n' <"$fname" | awk 'length($1) >= 5' | uniq
and I get this (separated by newlines):apple orange green orange banana yellow frank james fruit apple green sarah-jane frank
@MaxWilliams - Oops, I forgot the
sort
. (And there may be a de-dup option onsort
, which would avoid the need foruniq
.)You could use
sort -u
to combine the sort and uniq steps.I've given this this tick (and bounty) as it's the fastest of the many suggestions. Thanks!
Sometimes it helps to "think out of the box". (In this case the box is MySQL.)