8.4. Make pickle Reliable with copyreg¶
The pickle built-in module can serialize Python objects into a stream of bytes and deserialize bytes back into objects. Pickled byte streams shouldn’t be used to communicate between untrusted parties. The purpose of pickle is to let you pass Python objects between programs that you control over binary channels.
8.4.1. Note¶
The pickle module’s serialization format is unsafe by design. The serialized data contains what is essentially a program that describes how to reconstruct the original Python object. This means a malicious pickle payload could be used to compromise any part of a Python program that attempts to deserialize it.
In contrast, the json module is safe by design. Serialized JSON data contains a simple description of an object hierarchy. Deserializing JSON data does not expose a Python program to additional risk. Formats like JSON should be used for communication between programs or people who don’t trust each other.
For example, say that I want to use a Python object to represent the state of a player’s progress in a game. The game state includes the level the player is on and the number of lives they have remaining:
>>> class GameState:
>>> def __init__(self):
>>> self.level = 0
>>> self.lives = 4
The program modifies this object as the game runs:
>>> state = GameState()
>>> state.level += 1 # Player beat a level
>>> state.lives -= 1 # Player had to try again
>>>
>>> print(state.__dict__)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_255/2880348924.py in <module>
1 state = GameState()
2 state.level += 1 # Player beat a level
----> 3 state.lives -= 1 # Player had to try again
4
5 print(state.__dict__)
AttributeError: 'GameState' object has no attribute 'lives'
When the user quits playing, the program can save the state of the game to a file so it can be resumed at a later time. The pickle module makes it easy to do this. Here, I use the dump function to write the GameState object to a file:
>>> import pickle
>>>
>>> state_path = 'game_state.bin'
>>> with open(state_path, 'wb') as f:
>>> pickle.dump(state, f)
Later, I can call the load function with the file and get back the GameState object as if it had never been serialized:
>>> with open(state_path, 'rb') as f:
>>> state_after = pickle.load(f)
>>>
>>> print(state_after.__dict__)
{'level': 1, 'points': 0, 'magic': 5}
The problem with this approach is what happens as the game’s features expand over time. Imagine that I want the player to earn points toward a high score. To track the player’s points, I’d add a new field to the GameState class
>>> class GameState:
>>> def __init__(self):
>>> self.level = 0
>>> self.lives = 4
>>> self.points = 0 # New field
Serializing the new version of the GameState class using pickle will work exactly as before. Here, I simulate the round-trip through a file by serializing to a string with dumps and back to an object with loads:
>>> state = GameState()
>>> serialized = pickle.dumps(state)
>>> state_after = pickle.loads(serialized)
>>> print(state_after.__dict__)
{'level': 0, 'lives': 4, 'points': 0}
But what happens to older saved GameState objects that the user may want to resume? Here, I unpickle an old game file by using a program with the new definition of the GameState class:
>>> with open(state_path, 'rb') as f:
>>> state_after = pickle.load(f)
>>>
>>> print(state_after.__dict__)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_255/1950908918.py in <module>
1 with open(state_path, 'rb') as f:
----> 2 state_after = pickle.load(f)
3
4 print(state_after.__dict__)
/tmp/ipykernel_255/1300014333.py in unpickle_game_state(kwargs)
3 if version == 1:
4 del kwargs['lives']
----> 5 return GameState(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'level'
The points attribute is missing! This is especially confusing because the returned object is an instance of the new GameState class:
>>> assert isinstance(state_after, GameState)
This behavior is a byproduct of the way the pickle module works. Its primary use case is making object serialization easy. As soon as your use of pickle moves beyond trivial usage, the module’s functionality starts to break down in surprising ways.
Fixing these problems is straightforward using the copyreg built-in module. The copyreg module lets you register the functions responsible for serializing and deserializing Python objects, allowing you to control the behavior of pickle and make it more reliable.
Default Attribute Values In the simplest case, you can use a constructor with default arguments (see Item 23: “Provide Optional Behavior with Keyword Arguments” for background) to ensure that GameState objects will always have all attributes after unpickling. Here, I redefine the constructor this way:
>>> class GameState:
>>> def __init__(self, level=0, lives=4, points=0):
>>> self.level = level
>>> self.lives = lives
>>> self.points = points
To use this constructor for pickling, I define a helper function that takes a GameState object and turns it into a tuple of parameters for the copyreg module. The returned tuple contains the function to use for unpickling and the parameters to pass to the unpickling function:
>>> def pickle_game_state(game_state):
>>> kwargs = game_state.__dict__
>>> return unpickle_game_state, (kwargs,)
Now, I need to define the unpickle_game_state helper. This function takes serialized data and parameters from pickle_game_state and returns the corresponding GameState object. It’s a tiny wrapper around the constructor:
>>> def unpickle_game_state(kwargs):
>>> return GameState(**kwargs)
Now, I register these functions with the copyreg built-in module:
>>> import copyreg
>>>
>>> copyreg.pickle(GameState, pickle_game_state)
After registration, serializing and deserializing works as before:
>>> state = GameState()
>>> state.points += 1000
>>> serialized = pickle.dumps(state)
>>> state_after = pickle.loads(serialized)
>>> print(state_after.__dict__)
{'level': 0, 'lives': 4, 'points': 1000}
With this registration done, now I’ll change the definition of GameState again to give the player a count of magic spells to use. This change is similar to when I added the points field to GameState:
>>> class GameState:
>>> def __init__(self, level=0, lives=4, points=0, magic=5):
>>> self.level = level
>>> self.lives = lives
>>> self.points = points
>>> self.magic = magic # New field
But unlike before, deserializing an old GameState object will result in valid game data instead of missing attributes. This works because unpickle_game_state calls the GameState constructor directly instead of using the pickle module’s default behavior of saving and restoring only the attributes that belong to an object. The GameState constructor’s keyword arguments have default values that will be used for any parameters that are missing. This causes old game state files to receive the default value for the new magic field when they are deserialized:
>>> print('Before:', state.__dict__)
>>> state_after = pickle.loads(serialized)
>>> print('After: ', state_after.__dict__)
Before: {'level': 0, 'lives': 4, 'points': 1000}
After: {'level': 0, 'lives': 4, 'points': 1000, 'magic': 5}
8.4.2. Versioning Classes¶
Sometimes you need to make backward-incompatible changes to your Python objects by removing fields. Doing so prevents the default argument approach above from working.
For example, say I realize that a limited number of lives is a bad idea, and I want to remove the concept of lives from the game. Here, I redefine the GameState class to no longer have a lives field:
>>> class GameState:
>>> def __init__(self, level=0, points=0, magic=5):
>>> self.level = level
>>> self.points = points
>>> self.magic = magic
The problem is that this breaks deserialization of old game data. All fields from the old data, even ones removed from the class, will be passed to the GameState constructor by the unpickle_game_state function:
pickle.loads(serialized) >>> Traceback ... TypeError: __init__() got an unexpected keyword argument ➥'lives'
I can fix this by adding a version parameter to the functions supplied to copyreg. New serialized data will have a version of 2 specified when pickling a new GameState object:
>>> def pickle_game_state(game_state):
>>> kwargs = game_state.__dict__
>>> kwargs['version'] = 2
>>> return unpickle_game_state, (kwargs,)
Old versions of the data will not have a version argument present, which means I can manipulate the arguments passed to the GameState constructor accordingly:
>>> def unpickle_game_state(kwargs):
>>> version = kwargs.pop('version', 1)
>>> if version == 1:
>>> del kwargs['lives']
>>> return GameState(**kwargs)
Now, deserializing an old object works properly:
>>> copyreg.pickle(GameState, pickle_game_state)
>>> print('Before:', state.__dict__)
>>> state_after = pickle.loads(serialized)
>>> print('After: ', state_after.__dict__)
Before: {'level': 0, 'lives': 4, 'points': 1000}
After: {'level': 0, 'points': 1000, 'magic': 5}
I can continue using this approach to handle changes between future versions of the same class. Any logic I need to adapt an old version of the class to a new version of the class can go in the unpickle_game_state function.
Stable Import Paths One other issue you may encounter with pickle is breakage from renaming a class. Often over the life cycle of a program, you’ll refactor your code by renaming classes and moving them to other modules. Unfortunately, doing so breaks the pickle module unless you’re careful.
Here, I rename the GameState class to BetterGameState and remove the old class from the program entirely:
>>> class BetterGameState:
>>> def __init__(self, level=0, points=0, magic=5):
>>> self.level = level
>>> self.points = points
>>> self.magic = magic
Attempting to deserialize an old GameState object now fails because the class can’t be found:
pickle.loads(serialized)
>>>
Traceback ...
AttributeError: Can't get attribute 'GameState' on <module
➥'__main__' from 'my_code.py'>
The cause of this exception is that the import path of the serialized object’s class is encoded in the pickled data:
>>> print(serialized)
b'x80x04x95Lx00x00x00x00x00x00x00x8cx08__main__x94x8cx13unpickle_game_statex94x93x94}x94(x8cx05levelx94Kx00x8cx05livesx94Kx04x8cx06pointsx94Mxe8x03ux85x94Rx94.'
The solution is to use copyreg again. I can specify a stable identifier for the function to use for unpickling an object. This allows me to transition pickled data to different classes with different names when it’s deserialized. It gives me a level of indirection:
copyreg.pickle(BetterGameState, pickle_game_state)
After I use copyreg, you can see that the import path to unpickle_game_state is encoded in the serialized data instead of BetterGameState:
>>> state = BetterGameState()
>>> serialized = pickle.dumps(state)
>>> print(serialized)
b'x80x04x95Gx00x00x00x00x00x00x00x8cx08__main__x94x8cx0fBetterGameStatex94x93x94)x81x94}x94(x8cx05levelx94Kx00x8cx06pointsx94Kx00x8cx05magicx94Kx05ub.'
The only gotcha is that I can’t change the path of the module in which the unpickle_game_state function is present. Once I serialize data with a function, it must remain available on that import path for deserialization in the future.
8.4.3. Things to Remember¶
✦ The pickle built-in module is useful only for serializing and deserializing objects between trusted programs.
✦ Deserializing previously pickled objects may break if the classes involved have changed over time (e.g., attributes have been added or removed).
✦ Use the copyreg built-in module with pickle to ensure backward compatibility for serialized objects.