7.1. Use subprocess to Manage Child Processes¶
Python has battle-hardened libraries for running and managing child processes. This makes it a great language for gluing together other tools, such as command-line utilities. When existing shell scripts get complicated, as they often do over time, graduating them to a rewrite in Python for the sake of readability and maintainability is a natural choice.
Child processes started by Python are able to run in parallel, enabling you to use Python to consume all of the CPU cores of a machine and maximize the throughput of programs. Although Python itself may be CPU bound (see Item 53: “Use Threads for Blocking I/O, Avoid for Parallelism”), it’s easy to use Python to drive and coordinate CPU-intensive workloads.
Python has many ways to run subprocesses (e.g., os.popen, os.exec*), but the best choice for managing child processes is to use the subprocess built-in module. Running a child process with subprocess is simple. Here, I use the module’s run convenience function to start a process, read its output, and verify that it terminated cleanly:
>>> import subprocess
>>>
>>> result = subprocess.run(
>>> ['echo', 'Hello from the child!'],
>>> capture_output=True,
>>> encoding='utf-8')
>>>
>>> result.check_returncode() # No exception means clean exit
>>> print(result.stdout)
Hello from the child!
7.1.1. Note¶
The examples in this item assume that your system has the echo, sleep, and openssl commands available. On Windows, this may not be the case. Please refer to the full example code for this item to see specific directions on how to run these snippets on Windows.
Child processes run independently from their parent process, the Python interpreter. If I create a subprocess using the Popen class instead of the run function, I can poll child process status periodically while Python does other work:
proc = subprocess.Popen(['sleep', '1']) while proc.poll() is None:
print('Working...')
# Some time-consuming work here ...
print('Exit status', proc.poll())
>>>
Working...
Working...
Working...
Working...
Exit status 0
Decoupling the child process from the parent frees up the parent process to run many child processes in parallel. Here, I do this by starting all the child processes together with Popen upfront:
>>> import time
>>>
>>> start = time.time()
>>> sleep_procs = []
>>> for _ in range(10):
>>> proc = subprocess.Popen(['sleep', '1'])
>>> sleep_procs.append(proc)
Later, I wait for them to finish their I/O and terminate with the communicate method:
>>> for proc in sleep_procs:
>>> proc.communicate()
>>>
>>> end = time.time()
>>> delta = end - start
>>> print(f'Finished in {delta:.3} seconds')
Finished in 1.02 seconds
If these processes ran in sequence, the total delay would be 10 seconds or more rather than the ~1 second that I measured.
You can also pipe data from a Python program into a subprocess and retrieve its output. This allows you to utilize many other programs to do work in parallel. For example, say that I want to use the openssl command-line tool to encrypt some data. Starting the child process with command-line arguments and I/O pipes is easy:
>>> import os
>>> def run_encrypt(data):
>>> env = os.environ.copy()
>>>
>>> env['password'] = 'zf7ShyBhZOraQDdE/FiZpm/m/8f9X+M1'
>>> proc = subprocess.Popen(
>>> ['openssl', 'enc', '-des3', '-pass', 'env:password'],
>>> env=env,
>>> stdin=subprocess.PIPE,
>>> stdout=subprocess.PIPE)
>>> proc.stdin.write(data)
>>> proc.stdin.flush() # Ensure that the child gets input
>>> return proc
Here, I pipe random bytes into the encryption function, but in practice this input pipe would be fed data from user input, a file handle, a network socket, and so on:
>>> procs = []
>>> for _ in range(3):
>>> data = os.urandom(10)
>>> proc = run_encrypt(data)
>>> procs.append(proc)
* WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better. * WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better. *** WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better.
The child processes run in parallel and consume their input. Here, I wait for them to finish and then retrieve their final output. The output is random encrypted bytes as expected:
>>> for proc in procs:
>>> out, _ = proc.communicate()
>>> print(out[-10:])
b'}x85gG| xd0!U~' b'xb9xd4x02xbcxcexf5xb38x94xdb' b'qx87xeex9axebxfaCxa4g('
It’s also possible to create chains of parallel processes, just like UNIX pipelines, connecting the output of one child process to the input of another, and so on. Here’s a function that starts the openssl command-line tool as a subprocess to generate a Whirlpool hash of the input stream:
>>> def run_hash(input_stdin):
>>> return subprocess.Popen(
>>> ['openssl', 'dgst', '-whirlpool', '-binary'],
>>> stdin=input_stdin,
>>> stdout=subprocess.PIPE)
Now, I can kick off one set of processes to encrypt some data and another set of processes to subsequently hash their encrypted output. Note that I have to be careful with how the stdout instance of the upstream process is retained by the Python interpreter process that’s starting this pipeline of child processes:
>>> encrypt_procs = []
>>> hash_procs = []
>>> for _ in range(3):
>>> data = os.urandom(100)
>>>
>>> encrypt_proc = run_encrypt(data)
>>> encrypt_procs.append(encrypt_proc)
>>>
>>> hash_proc = run_hash(encrypt_proc.stdout)
>>> hash_procs.append(hash_proc)
>>>
>>> # Ensure that the child consumes the input stream and
>>> # the communicate() method doesn't inadvertently steal
>>> # input from the child. Also lets SIGPIPE propagate to
>>> # the upstream process if the downstream process dies.
>>> encrypt_proc.stdout.close()
>>> encrypt_proc.stdout = None
* WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better. * WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better. *** WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better.
The I/O between the child processes happens automatically once they are started. All I need to do is wait for them to finish and print the final output:
>>> for proc in encrypt_procs:
>>> proc.communicate()
>>> assert proc.returncode == 0
>>>
>>> for proc in hash_procs:
>>> out, _ = proc.communicate()
>>> print(out[-10:])
>>> assert proc.returncode == 0
b'bxeexd5x8dx9axe1xb9xa6xb8xc1' b'#[-x03xa0:4sa@' b'Lxe8xd8xad2Sxdbxf2xe6k'
If I’m worried about the child processes never finishing or somehow blocking on input or output pipes, I can pass the timeout parameter to the communicate method. This causes an exception to be raised if the child process hasn’t finished within the time period, giving me a chance to terminate the misbehaving subprocess:
>>> proc = subprocess.Popen(['sleep', '10'])
>>> try:
>>> proc.communicate(timeout=0.1)
>>>
>>> except subprocess.TimeoutExpired:
>>> proc.terminate()
>>> proc.wait()
>>>
>>> print('Exit status', proc.poll())
Exit status -15
7.1.2. Things to Remember¶
✦ Use the subprocess module to run child processes and manage their input and output streams.
✦ Child processes run in parallel with the Python interpreter, enabling you to maximize your usage of CPU cores.
✦ Use the run convenience function for simple usage, and the Popen class for advanced usage like UNIX-style pipelines.
✦ Use the timeout parameter of the communicate method to avoid deadlocks and hanging child processes.