>>> from env_helper import info; info()
页面更新时间: 2024-04-04 19:40:25
运行环境:
Linux发行版本: Debian GNU/Linux 12 (bookworm)
操作系统内核: Linux-6.1.0-18-amd64-x86_64-with-glibc2.36
Python版本: 3.11.2
1.3. Know the Differences Between bytes and str¶
In Python, there are two types that represent sequences of character data: bytes and str. Instances of bytes contain raw, unsigned 8-bit values (often displayed in the ASCII encoding):
>>> a = b'h\x65llo'
>>> print(list(a))
>>> print(a)
[104, 101, 108, 108, 111]
b'hello'
Instances of str contain Unicode code points that represent textual characters from human languages:
>>> a = 'a\u0300 propos'
>>> print(list(a))
>>> print(a)
['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's']
à propos
Importantly, str instances do not have an associated binary encoding, and bytes instances do not have an associated text encoding. To convert Unicode data to binary data, you must call the encode method of str. To convert binary data to Unicode data, you must call the decode method of bytes. You can explicitly specify the encoding you want to use for these methods, or accept the system default, which is commonly UTF-8 (but not always—see more on that below).
When you’re writing Python programs, it’s important to do encoding and decoding of Unicode data at the furthest boundary of your interfaces; this approach is often called the Unicode sandwich. The core of your program should use the str type containing Unicode data and should not assume anything about character encodings. This approach allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).
The split between character types leads to two common situations in Python code:
You want to operate on raw 8-bit sequences that contain UTF-8-encoded strings (or some other encoding).
You want to operate on Unicode strings that have no specific encoding.
You’ll often need two helper functions to convert between these cases and to ensure that the type of input values matches your code’s expectations.
The first function takes a bytes or str instance and always returns a str:
>>> def to_str(bytes_or_str):
>>> if isinstance(bytes_or_str, bytes):
>>> value = bytes_or_str.decode('utf-8')
>>> else:
>>> value = bytes_or_str
>>> return value # Instance of str
>>> print(repr(to_str(b'foo')))
>>> print(repr(to_str('bar')))
'foo'
'bar'
The second function takes a bytes or str instance and always returns a bytes:
>>> def to_bytes(bytes_or_str):
>>> if isinstance(bytes_or_str, str):
>>> value = bytes_or_str.encode('utf-8')
>>> else:
>>> value = bytes_or_str
>>> return value # Instance of bytes
>>>
>>> print(repr(to_bytes(b'foo')))
>>> print(repr(to_bytes('bar')))
b'foo'
b'bar'
There are two big gotchas when dealing with raw 8-bit values and Unicode strings in Python.
The first issue is that bytes and str seem to work the same way, but their instances are not compatible with each other, so you must be deliberate about the types of character sequences that you’re passing around.
By using the + operator, you can add bytes to bytes and str to str, respectively:
>>> print(b'one' + b'two')
>>> print('one' + 'two')
b'onetwo'
onetwo
But you can’t add str instances to bytes instances:
b'one' + 'two' >>> Traceback ... TypeError: can't concat str to bytes
Nor can you add bytes instances to str instances:
'one' + b'two' >>> Traceback ... TypeError: can only concatenate str (not "bytes") to str
By using binary operators, you can compare bytes to bytes and str to str, respectively:
>>> assert b'red' > b'blue'
>>> assert 'red' > 'blue'
But you can’t compare a str instance to a bytes instance:
assert 'red' > b'blue' >>> Traceback ... TypeError: '>' not supported between instances of 'str' and ➥'bytes'
Nor can you compare a bytes instance to a str instance:
assert b'blue' < 'red'
>>>
Traceback ...
TypeError: '<' not supported between instances of 'bytes'
➥and 'str'
Comparing bytes and str instances for equality will always evaluate to False, even when they contain exactly the same characters (in this case, ASCII-encoded “foo”):
>>> print(b'foo' == 'foo')
False
The % operator works with format strings for each type, respectively:
>>> print(b'red %s' % b'blue')
>>> print('red %s' % 'blue')
b'red blue'
red blue
But you can’t pass a str instance to a bytes format string because Python doesn’t know what binary text encoding to use:
print(b'red %s' % 'blue')
>>>
Traceback ...
TypeError: %b requires a bytes-like object, or an object that
➥implements __bytes__, not 'str'
You can pass a bytes instance to a str format string using the % operator, but it doesn’t do what you’d expect:
>>> print('red %s' % b'blue')
red b'blue'
This code actually invokes the repr method (see Item 75: “Use repr Strings for Debugging Output”) on the bytes instance and substitutes that in place of the %s, which is why b’blue’ remains escaped in the output.
The second issue is that operations involving file handles (returned by the open built-in function) default to requiring Unicode strings instead of raw bytes. This can cause surprising failures, especially for programmers accustomed to Python 2. For example, say that I want to write some binary data to a file. This seemingly simple code breaks:
- with open('data.bin', 'w') as f:
f.write(b'xf1xf2xf3xf4xf5')
>>>
Traceback ...
TypeError: write() argument must be str, not bytes
The cause of the exception is that the file was opened in write text mode (‘w’) instead of write binary mode (‘wb’). When a file is in text mode, write operations expect str instances containing Unicode data instead of bytes instances containing binary data. Here, I fix this by changing the open mode to ‘wb’:
>>> with open('data.bin', 'wb') as f:
>>> f.write(b'\xf1\xf2\xf3\xf4\xf5')
A similar problem also exists for reading data from files. For example, here I try to read the binary file that was written above:
- with open('data.bin', 'r') as f:
data = f.read()
>>>
Traceback ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in
➥position 0: invalid continuation byte
This fails because the file was opened in read text mode (‘r’) instead of read binary mode (‘rb’). When a handle is in text mode, it uses the system’s default text encoding to interpret binary data using the bytes.encode (for writing) and str.decode (for reading) methods. On most systems, the default encoding is UTF-8, which can’t accept the binary data b’:raw-latex:`\xf1`:raw-latex:`\xf2`:raw-latex:`\xf3`:raw-latex:`\xf4`:raw-latex:`\xf5`‘, thus causing the error above. Here, I solve this problem by changing the open mode to ’rb’:
>>> with open('data.bin', 'rb') as f:
>>> data = f.read()
>>>
>>> assert data == b'\xf1\xf2\xf3\xf4\xf5'
Alternatively, I can explicitly specify the encoding parameter to the open function to make sure that I’m not surprised by any platform-specific behavior. For example, here I assume that the binary data in the file was actually meant to be a string encoded as ‘cp1252’ (a legacy Windows encoding):
>>> with open('data.bin', 'r', encoding='cp1252') as f:
>>> data = f.read()
>>>
>>> assert data == 'ñòóôõ'
The exception is gone, and the string interpretation of the file’s contents is very different from what was returned when reading raw bytes. The lesson here is that you should check the default encoding on your system (using python3 -c ‘import locale; print(locale. getpreferredencoding())’) to understand how it differs from your expectations. When in doubt, you should explicitly pass the encoding parameter to open.
1.3.1. Things to Remember¶
✦ bytes contains sequences of 8-bit values, and str contains sequences of Unicode code points.
✦ Use helper functions to ensure that the inputs you operate on are the type of character sequence that you expect (8-bit values, UTF-8-encoded strings, Unicode code points, etc).
✦ bytes and str instances can’t be used together with operators (like >, ==, +, and %).
✦ If you want to read or write binary data to/from a file, always open the file using a binary mode (like ‘rb’ or ‘wb’).
✦ If you want to read or write Unicode data to/from a file, be careful about your system’s default text encoding. Explicitly pass the encoding parameter to open if you want to avoid surprises.