8.6. Profile Before Optimizing¶

The dynamic nature of Python causes surprising behaviors in its runtime performance. Operations you might assume would be slow are actually very fast (e.g., string manipulation, generators). Language features you might assume would be fast are actually very slow (e.g., attribute accesses, function calls). The true source of slowdowns in a Python program can be obscure.

The best approach is to ignore your intuition and directly measure the performance of a program before you try to optimize it. Python provides a built-in profiler for determining which parts of a program are responsible for its execution time. This means you can focus your optimization efforts on the biggest sources of trouble and ignore parts of the program that don’t impact speed (i.e., follow Amdahl’s law).

For example, say that I want to determine why an algorithm in a program is slow. Here, I define a function that sorts a list of data using an insertion sort:

>>> def insertion_sort(data):
>>>     result = []
>>>     for value in data:
>>>         insert_value(result, value)
>>>         return result

The core mechanism of the insertion sort is the function that finds the insertion point for each piece of data. Here, I define an extremely inefficient version of the insert_value function that does a linear scan over the input array:

>>> def insert_value(array, value):
>>>     for i, existing in enumerate(array):
>>>         if existing > value:
>>>             array.insert(i, value)
>>>             return
>>>     array.append(value)

To profile insertion_sort and insert_value, I create a data set of random numbers and define a test function to pass to the profiler:

>>> from random import randint
>>>
>>> max_size = 10**4
>>> data = [randint(0, max_size) for _ in range(max_size)]
>>> test = lambda: insertion_sort(data)

Python provides two built-in profilers: one that is pure Python (profile) and another that is a C-extension module (cProfile). The cProfile built-in module is better because of its minimal impact on the performance of your program while it’s being profiled. The pure-Python alternative imposes a high overhead that skews the results.

8.6.1. Note¶

When profiling a Python program, be sure that what you’re measuring is the code itself and not external systems. Beware of functions that access the network or resources on disk. These may appear to have a large impact on your program’s execution time because of the slowness of the underlying systems. If your program uses a cache to mask the latency of slow resources like these, you should ensure that it’s properly warmed up before you start profiling.

Here, I instantiate a Profile object from the cProfile module and run the test function through it using the runcall method:

>>> from cProfile import Profile
>>>
>>> profiler = Profile()
>>> profiler.runcall(test)

[8882]

When the test function has finished running, I can extract statistics about its performance by using the pstats built-in module and its Stats class. Various methods on a Stats object adjust how to select and sort the profiling information to show only the things I care about:

>>> from pstats import Stats
>>>
>>> stats = Stats(profiler)
>>> stats.strip_dirs()
>>> stats.sort_stats('cumulative')
>>> stats.print_stats()

function calls in 0.000 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  0.000    0.000    0.000    0.000 3575045409.py:5(<lambda>)
  0.000    0.000    0.000    0.000 3743226344.py:1(insertion_sort)
  0.000    0.000    0.000    0.000 2797103157.py:3(insert_value)
  0.000    0.000    0.000    0.000 {built-in method _bisect.bisect_left}
  0.000    0.000    0.000    0.000 {method 'insert' of 'list' objects}
  0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

<pstats.Stats at 0x7fc22670cf10>

The output is a table of information organized by function. The data sample is taken only from the time the profiler was active, during the runcall method above:

Here’s a quick guide to what the profiler statistics columns mean:

ncalls: The number of calls to the function during the profiling period.

tottime: The number of seconds spent executing the function, excluding time spent executing other functions it calls.

tottime percall: The average number of seconds spent in the function each time it is called, excluding time spent executing other functions it calls. This is tottime divided by ncalls.

cumtime: The cumulative number of seconds spent executing the function, including time spent in all other functions it calls.

cumtime percall: The average number of seconds spent in the function each time it is called, including time spent in all other functions it calls. This is cumtime divided by ncalls.

Looking at the profiler statistics table above, I can see that the biggest use of CPU in my test is the cumulative time spent in the insert_value function. Here, I redefine that function to use the bisect built-in module (see Item 72: “Consider Searching Sorted Sequences with bisect”):

>>> from bisect import bisect_left
>>>
>>> def insert_value(array, value):
>>>     i = bisect_left(array, value)
>>>     array.insert(i, value)

I can run the profiler again and generate a new table of profiler statistics. The new function is much faster, with a cumulative time spent that is nearly 100 times smaller than with the previous insert_value function:

Sometimes when you’re profiling an entire program, you might find that a common utility function is responsible for the majority of execution time. The default output from the profiler makes such a situation difficult to understand because it doesn’t show that the utility function is called by many different parts of your program.

For example, here the my_utility function is called repeatedly by two different functions in the program:

>>> def my_utility(a, b):
>>>     c = 1
>>>     for i in range(100):
>>>         c += a * b
>>>
>>> def first_func():
>>>     for _ in range(1000):
>>>         my_utility(4, 5)
>>>
>>> def second_func():
>>>     for _ in range(10):
>>>         my_utility(1, 3)
>>>
>>> def my_program():
>>>     for _ in range(20):
>>>         first_func()
>>>         second_func()

Profiling this code and using the default print_stats output generates statistics that are confusing:

The my_utility function is clearly the source of most execution time, but it’s not immediately obvious why that function is called so much. If you search through the program’s code, you’ll find multiple call sites for my_utility and still be confused.

To deal with this, the Python profiler provides the print_callers method to show which callers contributed to the profiling information of each function:

>>> stats.print_callers()

   Ordered by: cumulative time

Function                                          was called by...
                                                      ncalls  tottime  cumtime
3575045409.py:5(<lambda>)                         <-
3743226344.py:1(insertion_sort)                   <-       1    0.000    0.000  3575045409.py:5(<lambda>)
2797103157.py:3(insert_value)                     <-       1    0.000    0.000  3743226344.py:1(insertion_sort)
{built-in method _bisect.bisect_left}             <-       1    0.000    0.000  2797103157.py:3(insert_value)
{method 'insert' of 'list' objects}               <-       1    0.000    0.000  2797103157.py:3(insert_value)
{method 'disable' of '_lsprof.Profiler' objects}  <-

<pstats.Stats at 0x7fc22670cf10>

This profiler statistics table shows functions called on the left and which function was responsible for making the call on the right. Here, it’s clear that my_utility is most used by first_func:

8.6.2. Things to Remember¶

✦ It’s important to profile Python programs before optimizing because the sources of slowdowns are often obscure.

✦ Use the cProfile module instead of the profile module because it provides more accurate profiling information.

✦ The Profile object’s runcall method provides everything you need to profile a tree of function calls in isolation.

✦ The Stats object lets you select and print the subset of profiling information you need to see to understand your program’s performance.

8.5. Use decimal When Precision Is Paramount

8.7. Prefer deque for Producer–Consumer Queues

Python 3 教程 文档

8.6. Profile Before Optimizing¶

8.6.1. Note¶

8.6.2. Things to Remember¶

Python 3 教程文档