def count_occurrences(string):
count = 0
for text in GENERIC_TEXT_STORE:
count += text.count(string)
return count
GENERIC_TEXT_STORE is a list of string. For example:
GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']
Given a string 'text', I want to find how many times the text, i.e. 'this', occurs in the GENERIC_TEXT_STORE. If my GENERIC_TEXT_STORE is huge, this is very slow. What are the ways to make this search and count much faster? For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?
If the multiprocessing module is useful here, how to make it possible for this purpose?
You can use re
.
In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']
In [3]: def count_occurrences(string):
...: count = 0
...: for text in GENERIC_TEXT_STORE:
...: count += text.count(string)
...: return count
In [6]: import re
In [7]: def count(_str):
...: return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
...:
In [28]: def count1(_str):
...: return ' '.join(GENERIC_TEXT_STORE).count(_str)
...:
Now using timeit
to analyse the execution time.
when size of GENERIC_TEXT_STORE
is 3
.
In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
when size of GENERIC_TEXT_STORE
is 15000
.
In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
when size of GENERIC_TEXT_STORE
is 150000
In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
when size of GENERIC_TEXT_STORE
is over a million (1500000
)
In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
count1 < count < count_occurrences
When the size of GENERIC_TEXT_STORE
is large count
and count1
are almost 4 to 5 times faster than count_occurrences
.