8.9. Know How to Use heapq for Priority Queues¶

One of the limitations of Python’s other queue implementations (see Item 71: “Prefer deque for Producer–Consumer Queues” and Item 55: “Use Queue to Coordinate Work Between Threads”) is that they are first-in, first-out (FIFO) queues: Their contents are sorted by the order in which they were received. Often, you need a program to process items in order of relative importance instead. To accomplish this, a priority queue is the right tool for the job.

For example, say that I’m writing a program to manage books borrowed from a library. There are people constantly borrowing new books. There are people returning their borrowed books on time. And there are people who need to be reminded to return their overdue books. Here, I define a class to represent a book that’s been borrowed:

>>> class Book:
>>>     def __init__(self, title, due_date):
>>>         self.title = title
>>>         self.due_date = due_date

I need a system that will send reminder messages when each book passes its due date. Unfortunately, I can’t use a FIFO queue for this because the amount of time each book is allowed to be borrowed varies based on its recency, popularity, and other factors. For example, a book that is borrowed today may be due back later than a book that’s borrowed tomorrow. Here, I achieve this behavior by using a standard list and sorting it by due_date each time a new Book is added:

Click here to view code image

>>> def add_book(queue, book):
>>>     queue.append(book)
>>>     queue.sort(key=lambda x: x.due_date, reverse=True)
>>>
>>> queue = []
>>> add_book(queue, Book('Don Quixote', '2019-06-07'))
>>> add_book(queue, Book('Frankenstein', '2019-06-05'))
>>> add_book(queue, Book('Les Misérables', '2019-06-08'))
>>> add_book(queue, Book('War and Peace', '2019-06-03'))

If I can assume that the queue of borrowed books is always in sorted order, then all I need to do to check for overdue books is to inspect the final element in the list. Here, I define a function to return the next overdue book, if any, and remove it from the queue:

Click here to view code image

>>> class NoOverdueBooks(Exception):
>>>     pass
>>>
>>> def next_overdue_book(queue, now):
>>>     if queue:
>>>         book = queue[-1]
>>>         if book.due_date < now:
>>>             queue.pop()
>>>             return book
>>>
>>>     raise NoOverdueBooks

I can call this function repeatedly to get overdue books to remind people about in the order of most overdue to least overdue:

>>> now = '2019-06-10'
>>>
>>> found = next_overdue_book(queue, now)
>>> print(found.title)
>>>
>>> found = next_overdue_book(queue, now)
>>> print(found.title)

War and Peace
Frankenstein

If a book is returned before the due date, I can remove the scheduled reminder message by removing the Book from the list:

Click here to view code image

>>> def return_book(queue, book):
>>>     queue.remove(book)
>>>
>>> queue = []
>>> book = Book('Treasure Island', '2019-06-04')
>>>
>>> add_book(queue, book)
>>> print('Before return:', [x.title for x in queue])
>>>
>>> return_book(queue, book)
>>> print('After return: ', [x.title for x in queue])

Before return: ['Treasure Island']
After return:  []

And I can confirm that when all books are returned, the return_book function will raise the right exception (see Item 20: “Prefer Raising Exceptions to Returning None”):

>>> try:
>>>     next_overdue_book(queue, now)
>>> except NoOverdueBooks:
>>>     pass          # Expected
>>> else:
>>>     assert False  # Doesn't happen

However, the computational complexity of this solution isn’t ideal. Although checking for and removing an overdue book has a constant cost, every time I add a book, I pay the cost of sorting the whole list again. If I have len(queue) books to add, and the cost of sorting them is roughly len(queue) * math.log(len(queue)), the time it takes to add books will grow superlinearly (len(queue) * len(queue) * math.log(len(queue))).

Here, I define a micro-benchmark to measure this performance behavior experimentally by using the timeit built-in module (see Item 71: “Prefer deque for Producer–Consumer Queues” for the implementation of print_results and print_delta):

>>> import random
>>> import timeit
>>>
>>> def print_results(count, tests):
>>>     ...
>>>
>>> def print_delta(before, after):
>>>     ...
>>>
>>> def list_overdue_benchmark(count):
>>>     def prepare():
>>>         to_add = list(range(count))
>>>         random.shuffle(to_add)
>>>         return [], to_add
>>>
>>>     def run(queue, to_add):
>>>         for i in to_add:
>>>             queue.append(i)
>>>             queue.sort(reverse=True)
>>>
>>>         while queue:
>>>             queue.pop()
>>>
>>>     tests = timeit.repeat(
>>>         setup='queue, to_add = prepare()',
>>>         stmt=f'run(queue, to_add)',
>>>         globals=locals(),
>>>         repeat=100,
>>>         number=1)
>>>
>>>     return print_results(count, tests)

I can verify that the runtime of adding and removing books from the queue scales superlinearly as the number of books being borrowed increases:

Click here to view code image

>>> baseline = list_overdue_benchmark(500)
>>> for count in (1_000, 1_500, 2_000):
>>>     comparison = list_overdue_benchmark(count)
>>>     print_delta(baseline, comparison)

When a book is returned before the due date, I need to do a linear scan in order to find the book in the queue and remove it. Removing a book causes all subsequent items in the list to be shifted back an index, which has a high cost that also scales superlinearly. Here, I define another micro-benchmark to test the performance of returning a book using this function:

>>> def list_return_benchmark(count):
>>>     def prepare():
>>>         queue = list(range(count))
>>>         random.shuffle(queue)
>>>
>>>         to_return = list(range(count))
>>>         random.shuffle(to_return)
>>>
>>>         return queue, to_return
>>>
>>>     def run(queue, to_return):
>>>         for i in to_return:
>>>             queue.remove(i)
>>>
>>>     tests = timeit.repeat(
>>>         setup='queue, to_return = prepare()',
>>>         stmt=f'run(queue, to_return)',
>>>         globals=locals(),
>>>         repeat=100,
>>>         number=1)
>>>
>>>     return print_results(count, tests)

And again, I can verify that indeed the performance degrades superlinearly as the number of books increases:

Click here to view code image

>>> baseline = list_return_benchmark(500)
>>> for count in (1_000, 1_500, 2_000):
>>>     comparison = list_return_benchmark(count)
>>>     print_delta(baseline, comparison)

Using the methods of list may work for a tiny library, but it certainly won’t scale to the size of the Great Library of Alexandria, as I want it to!

Fortunately, Python has the built-in heapq module that solves this problem by implementing priority queues efficiently. A heap is a data structure that allows for a list of items to be maintained where the computational complexity of adding a new item or removing the smallest item has logarithmic computational complexity (i.e., even better than linear scaling). In this library example, smallest means the book with the earliest due date. The best part about this module is that you don’t have to understand how heaps are implemented in order to use its functions correctly.

Here, I reimplement the add_book function using the heapq module. The queue is still a plain list. The heappush function replaces the list.append call from before. And I no longer have to call list.sort on the queue:

>>> from heapq import heappush
>>>
>>> def add_book(queue, book):
>>>     heappush(queue, book)

If I try to use this with the Book class as previously defined, I get this somewhat cryptic error:

Click here to view code image

queue = [] add_book(queue, Book('Little Women', '2019-06-05')) add_book(queue, Book('The Time Machine', '2019-05-30'))

>>>
Traceback ...
TypeError: '<' not supported between instances of 'Book' and
➥'Book'

The heapq module requires items in the priority queue to be comparable and have a natural sort order (see Item 14: “Sort by Complex Criteria Using the key Parameter” for details). You can quickly give the Book class this behavior by using the total_ordering class decorator from the functools built-in module (see Item 51: “Prefer Class Decorators Over Metaclasses for Composable Class Extensions” for background) and implementing the lt special method (see Item 43: “Inherit from collections.abc for Custom Container Types” for background). Here, I redefine the class with a less-than method that simply compares the due_date fields between two Book instances:

>>> import functools
>>>
>>> @functools.total_ordering
>>> class Book:
>>>     def __init__(self, title, due_date):
>>>         self.title = title
>>>         self.due_date = due_date
>>>
>>>     def __lt__(self, other):
>>>         return self.due_date < other.due_date

Now, I can add books to the priority queue by using the heapq.heappush function without issues:

Click here to view code image

>>> queue = []
>>> add_book(queue, Book('Pride and Prejudice', '2019-06-01'))
>>> add_book(queue, Book('The Time Machine', '2019-05-30'))
>>> add_book(queue, Book('Crime and Punishment', '2019-06-06'))
>>> add_book(queue, Book('Wuthering Heights', '2019-06-12'))

Alternatively, I can create a list with all of the books in any order and then use the sort method of list to produce the heap:

Click here to view code image

>>> queue = [
>>>     Book('Pride and Prejudice', '2019-06-01'),
>>>     Book('The Time Machine', '2019-05-30'),
>>>     Book('Crime and Punishment', '2019-06-06'),
>>>     Book('Wuthering Heights', '2019-06-12'),
>>> ]
>>> queue.sort()

Or I can use the heapq.heapify function to create a heap in linear time (as opposed to the sort method’s len(queue) * log(len(queue)) complexity):

Click here to view code image

>>> from heapq import heapify
>>>
>>> queue = [
>>>     Book('Pride and Prejudice', '2019-06-01'),
>>>     Book('The Time Machine', '2019-05-30'),
>>>     Book('Crime and Punishment', '2019-06-06'),
>>>     Book('Wuthering Heights', '2019-06-12'),
>>> ]
>>> heapify(queue)

To check for overdue books, I inspect the first element in the list instead of the last, and then I use the heapq.heappop function instead of the list.pop function:

Click here to view code image

>>> from heapq import heappop
>>>
>>> def next_overdue_book(queue, now):
>>>     if queue:
>>>         book = queue[0]             # Most overdue first
>>>         if book.due_date < now:
>>>             heappop(queue)          # Remove the overdue book
>>>             return book
>>>
>>>     raise NoOverdueBooks

Now, I can find and remove overdue books in order until there are none left for the current time:

Click here to view code image

>>> now = '2019-06-02'
>>>
>>> book = next_overdue_book(queue, now)
>>> print(book.title)
>>>
>>> book = next_overdue_book(queue, now)
>>> print(book.title)
>>>
>>> try:
>>>     next_overdue_book(queue, now)
>>> except NoOverdueBooks:
>>>     pass          # Expected
>>> else:
>>>     assert False  # Doesn't happen

The Time Machine
Pride and Prejudice

I can write another micro-benchmark to test the performance of this implementation that uses the heapq module:

>>> def heap_overdue_benchmark(count):
>>>     def prepare():
>>>         to_add = list(range(count))
>>>         random.shuffle(to_add)
>>>         return [], to_add
>>>
>>>     def run(queue, to_add):
>>>         for i in to_add:
>>>             heappush(queue, i)
>>>         while queue:
>>>             heappop(queue)
>>>
>>>     tests = timeit.repeat(
>>>         setup='queue, to_add = prepare()',
>>>         stmt=f'run(queue, to_add)',
>>>         globals=locals(),
>>>         repeat=100,
>>>         number=1)
>>>
>>>     return print_results(count, tests)

This benchmark experimentally verifies that the heap-based priority queue implementation scales much better (roughly len(queue) * math.log(len(queue))), without superlinearly degrading performance:

Click here to view code image

>>> baseline = heap_overdue_benchmark(500)
>>> for count in (1_000, 1_500, 2_000):
>>>     comparison = heap_overdue_benchmark(count)
>>>     print_delta(baseline, comparison)

With the heapq implementation, one question remains: How should I handle returns that are on time? The solution is to never remove a book from the priority queue until its due date. At that time, it will be the first item in the list, and I can simply ignore the book if it’s already been returned. Here, I implement this behavior by adding a new field to track the book’s return status:

@functools.total_ordering class Book:

def __init__(self, title, due_date):
self.title = title self.due_date = due_date self.returned = False # New field

...

Then, I change the next_overdue_book function to repeatedly ignore any book that’s already been returned:

>>> def next_overdue_book(queue, now):
>>>     while queue:
>>>         book = queue[0]
>>>         if book.returned:
>>>             heappop(queue)
>>>             continue
>>>
>>>         if book.due_date < now:
>>>             heappop(queue)
>>>             return book
>>>
>>>         break
>>>
>>>     raise NoOverdueBooks

This approach makes the return_book function extremely fast because it makes no modifications to the priority queue:

>>> def return_book(queue, book):
>>>     book.returned = True

The downside of this solution for returns is that the priority queue may grow to the maximum size it would have needed if all books from the library were checked out and went overdue. Although the queue operations will be fast thanks to heapq, this storage overhead may take significant memory (see Item 81: “Use tracemalloc to Understand Memory Usage and Leaks” for how to debug such usage).

That said, if you’re trying to build a robust system, you need to plan for the worst-case scenario; thus, you should expect that it’s possible for every library book to go overdue for some reason (e.g., a natural disaster closes the road to the library). This memory cost is a design consideration that you should have already planned for and mitigated through additional constraints (e.g., imposing a maximum number of simultaneously lent books).

Beyond the priority queue primitives that I’ve used in this example, the heapq module provides additional functionality for advanced use cases (see help(heapq)). The module is a great choice when its functionality matches the problem you’re facing (see the queue.PriorityQueue class for another thread-safe option).

8.10. Things to Remember¶

✦ Priority queues allow you to process items in order of importance instead of in first-in, first-out order.

✦ If you try to use list operations to implement a priority queue, your program’s performance will degrade superlinearly as the queue grows.

✦ The heapq built-in module provides all of the functions you need to implement a priority queue that scales efficiently.

✦ To use heapq, the items being prioritized must have a natural sort order, which requires special methods like lt to be defined for classes.

8.8. Consider Searching Sorted Sequences with bisect

8.11. Consider memoryview and bytearray for Zero-Copy Interactions with bytes

Python 3 教程 文档

8.9. Know How to Use heapq for Priority Queues¶

8.10. Things to Remember¶

Python 3 教程文档