def count_occurrences(string):
count = 0
for text in GENERIC_TEXT_STORE:
count += text.count(string)
return count
GENERIC_TEXT_STORE是字符串列表。例如:
GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']
给定一个字符串“ text”,我想查找GENERIC_TEXT_STORE中该文本(即“ this”)出现了多少次。如果我的GENERIC_TEXT_STORE很大,那就太慢了。有什么方法可以使搜索和计数更快?例如,如果我将大的GENERIC_TEXT_STORE列表拆分为多个较小的列表,那会更快吗?
如果多处理模块在这里有用,那么如何使它成为可能?
您可以使用re
。
In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']
In [3]: def count_occurrences(string):
...: count = 0
...: for text in GENERIC_TEXT_STORE:
...: count += text.count(string)
...: return count
In [6]: import re
In [7]: def count(_str):
...: return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
...:
In [28]: def count1(_str):
...: return ' '.join(GENERIC_TEXT_STORE).count(_str)
...:
现在使用timeit
来分析执行时间。
当的大小GENERIC_TEXT_STORE
为时3
。
In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
当的大小GENERIC_TEXT_STORE
为时 15000
。
In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
当的大小GENERIC_TEXT_STORE
为150000
In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
当大小GENERIC_TEXT_STORE
超过一百万(1500000
)
In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
count1 <count <count_occurrences
当的大小GENERIC_TEXT_STORE
大时count
,count1
速度比快4至5倍count_occurrences
。