8.8. Consider Searching Sorted Sequences with bisect¶
It’s common to find yourself with a large amount of data in memory as a sorted list that you then want to search. For example, you may have loaded an English language dictionary to use for spell checking, or perhaps a list of dated financial transactions to audit for correctness.
Regardless of the data your specific program needs to process, searching for a specific value in a list takes linear time proportional to the list’s length when you call the index method:
>>> data = list(range(10**5))
>>> index = data.index(91234)
>>> assert index == 91234
If you’re not sure whether the exact value you’re searching for is in the list, then you may want to search for the closest index that is equal to or exceeds your goal value. The simplest way to do this is to linearly scan the list and compare each item to your goal value:
Click here to view code image
>>> def find_closest(sequence, goal):
>>> for index, value in enumerate(sequence):
>>> if goal < value:
>>> return index
>>> raise ValueError(f'{goal} is out of bounds')
>>>
>>> index = find_closest(data, 91234.56)
>>> assert index == 91235
Python’s built-in bisect module provides better ways to accomplish these types of searches through ordered lists. You can use the bisect_left function to do an efficient binary search through any sequence of sorted items. The index it returns will either be where the item is already present in the list or where you’d want to insert the item in the list to keep it in sorted order:
Click here to view code image
>>> from bisect import bisect_left
>>>
>>> index = bisect_left(data, 91234) # Exact match
>>> assert index == 91234
>>>
>>> index = bisect_left(data, 91234.56) # Closest match
>>> assert index == 91235
The complexity of the binary search algorithm used by the bisect module is logarithmic. This means searching in a list of length 1 million takes roughly the same amount of time with bisect as linearly searching a list of length 20 using the list.index method (math.log2(10**6) == 19.93…). It’s way faster!
I can verify this speed improvement for the example from above by using the timeit built-in module to run a micro-benchmark:
Click here to view code image
>>> import random
>>> import timeit
>>>
>>> size = 10**5
>>> iterations = 1000
>>>
>>> data = list(range(size))
>>> to_lookup = [random.randint(0, size)
>>> for _ in range(iterations)]
>>>
>>> def run_linear(data, to_lookup):
>>> for index in to_lookup:
>>> data.index(index)
>>>
>>> def run_bisect(data, to_lookup):
>>> for index in to_lookup:
>>> bisect_left(data, index)
>>>
>>> baseline = timeit.timeit(
>>> stmt='run_linear(data, to_lookup)',
>>> globals=globals(),
>>> number=10)
>>> print(f'Linear search takes {baseline:.6f}s')
>>>
>>> comparison = timeit.timeit(
>>> stmt='run_bisect(data, to_lookup)',
>>> globals=globals(),
>>> number=10)
>>> print(f'Bisect search takes {comparison:.6f}s')
>>>
>>> slowdown = 1 + ((baseline - comparison) / comparison)
>>> print(f'{slowdown:.1f}x time')
Linear search takes 4.725888s
Bisect search takes 0.004665s
1013.0x time
The best part about bisect is that it’s not limited to the list type; you can use it with any Python object that acts like a sequence (see Item 43: “Inherit from collections.abc for Custom Container Types” for how to do that). The module also provides additional features for more advanced situations (see help(bisect)).
8.8.1. Things to Remember¶
✦ Searching sorted data contained in a list takes linear time using the index method or a for loop with simple comparisons.
✦ The bisect built-in module’s bisect_left function takes logarithmic time to search for values in sorted lists, which can be orders of magnitude faster than other approaches.