Warm tip: This article is reproduced from stackoverflow.com, please click
list python search

How to make this search and count much faster?

发布于 2020-04-11 22:14:05
def count_occurrences(string):
    count = 0
    for text in GENERIC_TEXT_STORE:
        count += text.count(string)
    return count

GENERIC_TEXT_STORE is a list of string. For example:

GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']

Given a string 'text', I want to find how many times the text, i.e. 'this', occurs in the GENERIC_TEXT_STORE. If my GENERIC_TEXT_STORE is huge, this is very slow. What are the ways to make this search and count much faster? For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?

If the multiprocessing module is useful here, how to make it possible for this purpose?

Questioner
ling
Viewed
41
Ch3steR 2020-02-02 17:39

You can use re.

In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']

In [3]: def count_occurrences(string):
   ...:     count = 0
   ...:     for text in GENERIC_TEXT_STORE:
   ...:         count += text.count(string)
   ...:     return count

In [6]: import re

In [7]: def count(_str):
   ...:     return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
   ...:
In [28]: def count1(_str):
    ...:     return ' '.join(GENERIC_TEXT_STORE).count(_str)
    ...:

Now using timeit to analyse the execution time.

when size of GENERIC_TEXT_STORE is 3.

In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

when size of GENERIC_TEXT_STORE is 15000.

In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

when size of GENERIC_TEXT_STORE is 150000

In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

when size of GENERIC_TEXT_STORE is over a million (1500000)

In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

count1 < count < count_occurrences

When the size of GENERIC_TEXT_STORE is large count and count1 are almost 4 to 5 times faster than count_occurrences.