2.1. Read and Write an Excel File

BASE PYTHON WITH XLRD AND XLWT MODULES

sudo apt install -y python3-xlrd python3-xlwt

To read and write an Excel file with base Python and the xlrd and xlwt modules, type the following code into a text editor and save the file as 2excel_parsing_and_write.py :

>>> #!/usr/bin/env python3
>>> import sys
>>> from xlrd import open_workbook
>>> from xlwt import Workbook
>>> input_file = 'xls_foo.xls'
>>> output_file = 'xx_xls.xls'
>>> output_workbook = Workbook()
>>> output_worksheet =output_workbook.add_sheet('jan_2013_output')
>>> with open_workbook(input_file) as workbook:
>>>
>>>     worksheet =workbook.sheet_by_name('january_2013')
>>>
>>>     for row_index in range(worksheet.nrows):
>>>
>>>         for column_index in range(worksheet.ncols):
>>>
>>>             output_worksheet.write(row_index,column_index, worksheet.cell_value(row_index,column_index))
>>>             output_workbook.save(output_file)
>>>

Line 3 imports xlrd’s open_workbook function and line 4 imports xlwt’s Workbook object.

Line 7 instantiates an xlwt Workbook object so we can write the results to an output Excel workbook. Line 8 uses xlwt’s add_sheet function to add a worksheet named jan_2013_output inside the output workbook.

Line 9 uses xlrd’s open_workbook function to open the input workbook into a workbook object. Line 10 uses the workbook object’s sheet_by_name function to access the worksheet titled january_2013 .

Lines 11 and 12 create for loops over the row and column index values, using the range function and the worksheet object’s nrows and ncols attributes, so we can iterate through each of the rows and columns in the worksheet.

Line 13 uses xlwt’s write function and row and column indexes to write every cell value to the worksheet in the output file.

Finally, line 15 saves and closes the output workbook.

To run the script, type the following on the command line and hit Enter:

python 2excel_parsing_and_write.py sales_2013.xlsx output_files2output.xls

You can then open the output file, 2output.xls , to review the results.

You may have noticed that the dates in the Purchase Date column, column E, appear to be numbers instead of dates. Excel stores dates and times as floating-point numbers representing the number of days since 1900-Jan-0, plus a fractional portion of a 24-hour day. For example, the number 1 represents 1900-Jan-1, as one day has passed since 1900-Jan-0. Therefore, the numbers in this column represent dates, but they are not formatted as dates.

The xlrd package provides additional functions for formatting date-like values. The next example augments the previous example by demonstrating how to format date-like values so date-like values printed to the screen and written to the output file appear as they do in the input file.

2.1.1. Format dates

This example builds on the previous example by showing how to use xlrd to maintain date formats as they appear in input Excel files. For example, if a date in an Excel worksheet is 1/19/2000, then we usually want to write 1/19/2000 or another related date format to the output file. However, as the previous example showed, with ourcurrent code, we will end up with the number 36544.0 in the output file, as that is the number of days between 1/0/1900 and 1/19/2000.

To apply formatting to our date column, type the following code into a text editor and save the file as : 3excel_parsing_and_write_keep_dates.py

>>> #!/usr/bin/env python3
>>> import sys
>>> from datetime import date
>>> from xlrd import open_workbook, xldate_as_tuple
>>> from xlwt import Workbook
>>> input_file = 'xls_foo.xls'
>>> output_file = 'xx_xls2.xls'
>>> output_workbook = Workbook()
>>> output_worksheet =output_workbook.add_sheet('jan_2013_output')
>>> with open_workbook(input_file) as workbook:
>>>
>>>     worksheet =workbook.sheet_by_name('january_2013')
>>>
>>>     for row_index in range(worksheet.nrows):
>>>
>>>             row_list_output = []
>>>
>>>             for col_index in range(worksheet.ncols):
>>>
>>>                 if worksheet.cell_type(row_index,col_index) == 3:
>>>
>>>                     date_cell =xldate_as_tuple(worksheet.cell_value(row_index,col_index),workbook.datemode)
>>>
>>>                     date_cell =date(*date_cell[0:3]).strftime('%m/%d/%Y')
>>>
>>>                     row_list_output.append(date_cell)
>>>                     output_worksheet.write(row_index, col_index,date_cell)
>>>                 else:
>>>                     non_date_cell =worksheet.cell_value(row_index,col_index)
>>>
>>>                     row_list_output.append(non_date_cell)
>>>
>>>                 output_worksheet.write(row_index, col_index,non_date_cell)
>>> output_workbook.save(output_file)

Line 3 imports the date function from the datetime module so we can cast values as dates and format the dates.

Line 4 imports two functions from the xlrd module. We used the first function to open an Excel workbook in the previous example, so I’ll focus on the second function. The xldate_as_tuple function enables us to convert Excel numbers that are presumed to represent dates, times, or date-times into tuples. Once we convert the numbers into tuples, we can extract specific date elements (e.g., year, month, and day) and format the elements into different date formats (e.g., 1/1/2010 or January 1, 2010).

Line 15 creates an if-else statement to test whether the cell type is the number three. If you review thexlrd module’s documentation, you’ll see that cell type three means the cell contains a date. Therefore, the if-else statement tests whether each cell it sees contains a date. If it does, then the code in the if block operates on the cell; if it doesn’t, then the code in the else block operates on the cell. Because the dates are in the last column, the if block handles the last column.

Line 18 uses the worksheet object’s cell_value function and row and column indexing to access the value in the cell. Alternatively, you could use the cell().value function; both versions give you the same results. This cell value is then the first argument in the xldate_as_tuple function, which converts the floating-point number into a tuple that represents the date.

The workbook.datemode argument is required so that the function can determine whether the date is 1900- based or 1904-based and therefore convert the number to the correct tuple (some versions of Excel for Mac calculate dates from January 1, 1904; for more information on this, read the Microsoft reference guide). The result of the xldate_as_tuple function is assigned to a tuple variable called date_cell. This line is so long that it’s split over two lines in the text, with a backslash as the last character of thefirst line (the backslash is required so Python interprets the two lines as one line). However, in your script, all of the code can appear on one line without the backslash.

Line 18 uses tuple indexing to access the first three elements in the date_cell tuple (i.e., the year, month, and day elements) and pass them as arguments to the date function, which converts the values into a date object. Next, the strftime function converts the date object into a string with the specified date format. The format, ‘%m/%d/%Y’, specifies that a date like March 15, 2014 should appear as 03/15/2014. The formatted date string is reassigned to the variable called date_cell. Line 20 uses the list’s append function to append the value in date_cell into the output list called row_list_output.

To get a feel for the operations taking place in lines 16 and 18, after running the script as is, add a print statement (i.e., print(date_cell)) between the two date_cell = … lines. Resave and rerun the script to see the result of the xldate_as_tuple function printed to the screen. Next, remove that print statement and move it beneath the second date_cell = … line. Resave and rerun the script to see the result of the date.strftime functions printed to the screen. These print statements help you see how the functionsin these two lines convert the number representing a date in Excel into a tuple and then into a text string formatted as a date.

The else block operates on all of the non-date cells. Line 23 uses the worksheet object’s cell_value function and row and column indexing to access the value in the cell and assigns it to a variable called non_date_cell. Line 25 uses the list’s append function to append the value in non_date_cell into row_list_output. Together, these two lines extract the values in the first four columns of each row as is and append them into row_list_output.

After all of the columns in the row have been processed and added to row_list_output, line 26 writes the values in row_list_output to the output file.

To run the script, type the following on the command line and hit Enter:

python 3excel_parsing_and_write_keep_dates.py sales_2013.xlsxoutput_files3output.xls

You can then open the output file,3output.xls , to review the results. ## Pandas Pandas has a set of commands for reading and writing Excel files as well. Here is a code example that will use pandas for Excel file parsing—save it as (this script reads an input Excel file, prints the contents to the screen, and writes the contents to an output Excel file):

注意,Pandas 需要较新版本的支持。不然会出错:

ImportError: Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).

所以要使用使用 pip 安装相关模块:

pip install xlrd, xlwt, pandas, openpyxl
>>> #!/usr/bin/env python3
>>> import pandas as pd
>>> import sys
>>> input_file = 'xls_foo.xls'
>>> output_file = 'xx_xls3.xlsx'

使用较新版本的 Pandas 进行测试,无法保存成 xls 格式的文件,会出错:

ValueError: No engine for filetype: 'xls'

需要将保存的文件后缀设置为 xlsx ,但同时需要安装 openpyxl 模块。

>>> data_frame = pd.read_excel(input_file,sheet_name='january_2013')
>>> writer = pd.ExcelWriter(output_file)
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

Cell In [6], line 1
----> 1 data_frame = pd.read_excel(input_file,sheet_name='january_2013')
      2 writer = pd.ExcelWriter(output_file)


File /usr/lib/python3/dist-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)


File /usr/lib/python3/dist-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)


File /usr/lib/python3/dist-packages/pandas/io/excel/_base.py:482, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    480 if not isinstance(io, ExcelFile):
    481     should_close = True
--> 482     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    483 elif engine and engine != io.engine:
    484     raise ValueError(
    485         "Engine should not be specified when passing "
    486         "an ExcelFile - ExcelFile already has the engine set"
    487     )


File /usr/lib/python3/dist-packages/pandas/io/excel/_base.py:1695, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1692 self.engine = engine
   1693 self.storage_options = storage_options
-> 1695 self._reader = self._engines[engine](self._io, storage_options=storage_options)


File /usr/lib/python3/dist-packages/pandas/io/excel/_xlrd.py:34, in XlrdReader.__init__(self, filepath_or_buffer, storage_options)
     24 """
     25 Reader using xlrd engine.
     26
   (...)
     31 {storage_options}
     32 """
     33 err_msg = "Install xlrd >= 1.0.0 for Excel support"
---> 34 import_optional_dependency("xlrd", extra=err_msg)
     35 super().__init__(filepath_or_buffer, storage_options=storage_options)


File /usr/lib/python3/dist-packages/pandas/compat/_optional.py:176, in import_optional_dependency(name, extra, errors, min_version)
    174             return None
    175         elif errors == "raise":
--> 176             raise ImportError(msg)
    178 return module


ImportError: Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).
>>> data_frame.to_excel(writer,sheet_name='jan_13_output', index=False)
>>> writer.close()
>>>
>>> # writer.save()

To run the script, type the following on the command line and hit Enter:

python pandas_parsing_and_write_keep_dates.py sales_2013.xlsxoutput_filespandas_output.xls

You can then open the output file,pandas_output.xls , to review the results.

Now that you understand how to process a worksheet in an Excel workbook and retain date formatting, let’s turn to the issue of filtering for specific rows in aworksheet. We’ll discuss how to filter rows by evaluating whether values in the row (a) meet specific conditions, (b) are in a set of interest, or (c) match specific regular expression patterns.