In a range of files, I want to see which line has atleast 4 times the same occurence of the same word. This word can be any word.
So input:
a a a b b e e e o o o o p p p y y y w r r r u u i i o o r x x o o i i p p z z y y
Output:
o o o o p p p y y y w r r r u u i i o o r
What I have tried at the moment is to make sure that sentences are put separate, ready to be processed basically.
cat * |
tr '\n' ' '|
sed 's/[.!?;"]/ & /g' |
sed 's/[.!?]/&\n/g'|
grep -E -w '\b([[:alnum:]]*)\{4*\}\b'
But my grep doesn't get anything, so how do I get that Grep only prints out all sentences which contain a word which occurs atleast 4 times in it?
With GNU grep
, you can use a PCRE regex like
grep -P '\b(\w+)\b(.*\b\1\b){3}'
See the regex demo.
Test in Ubuntu 18.04.4 LTS:
Details
\b(\w+)\b
- a whole word (captured in Group 1) (\b
is a word boundary and \w
matches letters, digits or underscores)(.*\b\1\b){3}
- three occurrences ({3}
) of any text followed with the same value as in Group 1 (as \1
is an inline backreference to Group 1 value) as a whole word (again, \b
word boundaries are used.)
You can simplify by putting the word bounds in the group:
grep -E '(\b\w\b)(.*\1){3}'
. Unless there's an edge case I haven't thought of.@wjandrea The capturing group only keeps the value, not the pattern. So,
\1
is unaware of the fact if the string it holds was captured as a whole word or not. We need all the word boundaries I used in the pattern.Ah, I see, if you use input like
o do to moe
, mine matches it (false positive), yours doesn't.I am also getting false positives with yours solution @WiktorStribiżew. It most likely has to do with the first word boundary, I would say.
en vrouwen gelijk voor de wet en maken we geen
is one of my results, but the word en does not pop up separately 4 times. It does if you count the en inside of the words.Ah yes this seems to work. Thank you for the solution to my problem.