7.13. Consider concurrent.futures for True Parallelism

At some point in writing Python programs, you may hit the performance wall. Even after optimizing your code (see Item 70: “Profile Before Optimizing”), your program’s execution may still be too slow for your needs. On modern computers that have an increasing number of CPU cores, it’s reasonable to assume that one solution would be parallelism. What if you could split your code’s computation into independent pieces of work that run simultaneously across multiple CPU cores?

Unfortunately, Python’s global interpreter lock (GIL) prevents true parallelism in threads (see Item 53: “Use Threads for Blocking I/O, Avoid for Parallelism”), so that option is out. Another common suggestion is to rewrite your most performance-critical code as an extension module, using the C language. C gets you closer to the bare metal and can run faster than Python, eliminating the need for parallelism in some cases. C extensions can also start native threads independent of the Python interpreter that run in parallel and utilize multiple CPU cores with no concern for the GIL. Python’s API for C extensions is well documented and a good choice for an escape hatch. It’s also worth checking out tools like SWIG (https://github.com/swig/swig) and CLIF (https://github.com/google/clif) to aid in extension development.

But rewriting your code in C has a high cost. Code that is short and understandable in Python can become verbose and complicated in C. Such a port requires extensive testing to ensure that the functionality is equivalent to the original Python code and that no bugs have been introduced. Sometimes it’s worth it, which explains the large ecosystem of C-extension modules in the Python community that speed up things like text parsing, image compositing, and matrix math. There are even open source tools such as Cython (https://cython.org) and Numba (https://numba.pydata.org) that can ease the transition to C.

The problem is that moving one piece of your program to C isn’t sufficient most of the time. Optimized Python programs usually don’t have one major source of slowness; rather, there are often many significant contributors. To get the benefits of C’s bare metal and threads, you’d need to port large parts of your program, drastically increasing testing needs and risk. There must be a better way to preserve your investment in Python to solve difficult computational problems.

The multiprocessing built-in module, which is easily accessed via the concurrent.futures built-in module, may be exactly what you need (see Item 59: “Consider ThreadPoolExecutor When Threads Are Necessary for Concurrency” for a related example). It enables Python to utilize multiple CPU cores in parallel by running additional interpreters as child processes. These child processes are separate from the main interpreter, so their global interpreter locks are also separate. Each child can fully utilize one CPU core. Each child has a link to the main process where it receives instructions to do computation and returns results.

For example, say that I want to do something computationally intensive with Python and utilize multiple CPU cores. I’ll use an implementation of finding the greatest common divisor of two numbers as a proxy for a more computationally intense algorithm (like simulating fluid dynamics with the Navier–Stokes equation):

Click here to view code image

>>> # my_module.py
>>> def gcd(pair):
>>>     a, b = pair
>>>     low = min(a, b)
>>>     for i in range(low, 0, -1):
>>>         if a % i == 0 and b % i == 0:
>>>             return i
>>>     assert False, 'Not reachable'

Running this function in serial takes a linearly increasing amount of time because there is no parallelism:

Click here to view code image

# run_serial.py import my_module import time

NUMBERS = [

(1963309, 2265973), (2030677, 3814172), (1551645, 2229620), (2039045, 2020802), (1823712, 1924928), (2293129, 1020491), (1281238, 2273782), (3823812, 4237281), (3812741, 4729139), (1292391, 2123811),

]

def main():

start = time.time() results = list(map(my_module.gcd, NUMBERS)) end = time.time() delta = end - start print(f'Took {delta:.3f} seconds')

if __name__ == '__main__':

main()

>>>
Took 1.173 seconds

Running this code on multiple Python threads will yield no speed improvement because the GIL prevents Python from using multiple CPU cores in parallel. Here, I do the same computation as above but using the concurrent.futures module with its ThreadPoolExecutor class and two worker threads (to match the number of CPU cores on my computer):

Click here to view code image

# run_threads.py import my_module from concurrent.futures import ThreadPoolExecutor import time

NUMBERS = [

...

]

def main():

start = time.time() pool = ThreadPoolExecutor(max_workers=2) results = list(pool.map(my_module.gcd, NUMBERS)) end = time.time() delta = end - start print(f'Took {delta:.3f} seconds')

if __name__ == '__main__':

main()

>>>
Took 1.436 seconds

It’s even slower this time because of the overhead of starting and communicating with the pool of threads.

Now for the surprising part: Changing a single line of code causes something magical to happen. If I replace the ThreadPoolExecutor with the ProcessPoolExecutor from the concurrent.futures module, everything speeds up:

Click here to view code image

# run_parallel.py import my_module from concurrent.futures import ProcessPoolExecutor import time

NUMBERS = [

...

]

def main():

start = time.time() pool = ProcessPoolExecutor(max_workers=2) # The one change results = list(pool.map(my_module.gcd, NUMBERS)) end = time.time() delta = end - start print(f'Took {delta:.3f} seconds')

if __name__ == '__main__':

main()

>>>
Took 0.683 seconds

Running on my dual-core machine, this is significantly faster! How is this possible? Here’s what the ProcessPoolExecutor class actually does (via the low-level constructs provided by the multiprocessing module):

  1. It takes each item from the numbers input data to map.

  2. It serializes the item into binary data by using the pickle module (see Item 68: “Make pickle Reliable with copyreg”).

  3. It copies the serialized data from the main interpreter process to a child interpreter process over a local socket.

  4. It deserializes the data back into Python objects, using pickle in the child process.

  5. It imports the Python module containing the gcd function.

  6. It runs the function on the input data in parallel with other child processes.

  7. It serializes the result back into binary data.

  8. It copies that binary data back through the socket.

  9. It deserializes the binary data back into Python objects in the parent process.

  10. It merges the results from multiple children into a single list to return.

Although it looks simple to the programmer, the multiprocessing module and ProcessPoolExecutor class do a huge amount of work to make parallelism possible. In most other languages, the only touch point you need to coordinate two threads is a single lock or atomic operation (see Item 54: “Use Lock to Prevent Data Races in Threads” for an example). The overhead of using multiprocessing via ProcessPoolExecutor is high because of all of the serialization and deserialization that must happen between the parent and child processes.

This scheme is well suited to certain types of isolated, high-leverage tasks. By isolated, I mean functions that don’t need to share state with other parts of the program. By high-leverage tasks, I mean situations in which only a small amount of data must be transferred between the parent and child processes to enable a large amount of computation. The greatest common divisor algorithm is one example of this, but many other mathematical algorithms work similarly.

If your computation doesn’t have these characteristics, then the overhead of ProcessPoolExecutor may prevent it from speeding up your program through parallelization. When that happens, multiprocessing provides more advanced facilities for shared memory, cross-process locks, queues, and proxies. But all of these features are very complex. It’s hard enough to reason about such tools in the memory space of a single process shared between Python threads. Extending that complexity to other processes and involving sockets makes this much more difficult to understand.

I suggest that you initially avoid all parts of the multiprocessing built-in module. You can start by using the ThreadPoolExecutor class to run isolated, high-leverage functions in threads. Later you can move to the ProcessPoolExecutor to get a speedup. Finally, when you’ve completely exhausted the other options, you can consider using the multiprocessing module directly.

7.13.1. Things to Remember

✦ Moving CPU bottlenecks to C-extension modules can be an effective way to improve performance while maximizing your investment in Python code. However, doing so has a high cost and may introduce bugs.

✦ The multiprocessing module provides powerful tools that can parallelize certain types of Python computation with minimal effort.

✦ The power of multiprocessing is best accessed through the concurrent.futures built-in module and its simple ProcessPoolExecutor class.

✦ Avoid the advanced (and complicated) parts of the multiprocessing module until you’ve exhausted all other options.