Warm tip: This article is reproduced from serverfault.com, please click

Find multiple occurences of same string

发布于 2020-11-29 21:13:35

In a range of files, I want to see which line has atleast 4 times the same occurence of the same word. This word can be any word.

So input:

a a a b b e e e
o o o o p p p y y y
w r r r u u i i o o r
x x o o i i p p z z y y

Output:

o o o o p p p y y y
w r r r u u i i o o r

What I have tried at the moment is to make sure that sentences are put separate, ready to be processed basically.

cat * |
    tr '\n' ' '|
    sed 's/[.!?;"]/ & /g' |
    sed 's/[.!?]/&\n/g'|
    grep -E -w '\b([[:alnum:]]*)\{4*\}\b'

But my grep doesn't get anything, so how do I get that Grep only prints out all sentences which contain a word which occurs atleast 4 times in it?

Questioner
Hooiberg12
Viewed
0
Wiktor Stribiżew 2020-11-30 06:17:22

With GNU grep, you can use a PCRE regex like

grep -P '\b(\w+)\b(.*\b\1\b){3}'

See the regex demo.

Test in Ubuntu 18.04.4 LTS:

enter image description here

Details

  • \b(\w+)\b - a whole word (captured in Group 1) (\b is a word boundary and \w matches letters, digits or underscores)
  • (.*\b\1\b){3} - three occurrences ({3}) of any text followed with the same value as in Group 1 (as \1 is an inline backreference to Group 1 value) as a whole word (again, \b word boundaries are used.)