1.2. Introspecting an Excel Workbook

Now that we have an Excel workbook that contains three worksheets, let’s learn how to process an Excel workbook in Python. As a reminder, we are using the xlrd and xlwt packages in this lesson, so make sure you have already downloaded and installed these add-in packages.

As you are probably already aware, Excel files are different from CSV files in at least two important respects. First, unlike a CSV file, an Excel file is not a plain-text file, so you cannot open it and view the data in a text editor. You can see this by right- clicking on the Excel workbook you just created and opening it in a text editor like Notepad orTextWrangler. Instead of legible data, you will see a mess of special characters.

Second, unlike a CSV file, an Excel workbook is designed to contain multiple worksheets. Because a single Excel workbook can contain multiple worksheets, we need to learn how to introspect (i.e., look inside and examine) all of the worksheets in a workbook without having to manually open the workbook. By introspecting a workbook, we can examine the number of worksheets and the types and amount of data on each worksheet before we actually process the data in the workbook.

Introspecting Excel files is useful to make sure that they contain the data you expect, and to do a quick check for consistency and completeness. That is, understanding the number of input files and the number of rows and columns in each file will give you some idea about the size of the processing job as well as the potential consistency of the file layouts.

Once you understand how to introspect the worksheets in a workbook, we will move on to parsing a single worksheet, iterating over multiple worksheets, and then iterating over multiple workbooks.

To determine the number of worksheets in the workbook, the names of the worksheets, and the number of rows and columns in each of the worksheets, type the followingcode into a text editor and save the file as : 1excel_introspect_workbook.py

>>> # import sys
>>> from xlrd import open_workbook
>>> # input_file = sys.argv[1]
>>> workbook = open_workbook('test.xlsx')
>>> print('Number of worksheets:', workbook.nsheets)
>>> for worksheet in workbook.sheets():
>>>     print("Worksheet name:", worksheet.name,"\tRows:",worksheet.nrows, "\tColumns:",worksheet.ncols)
Number of worksheets: 2
Worksheet name: DRR2020工作任务         Rows: 36        Columns: 7
Worksheet name: 分类  Rows: 46        Columns: 16

Figure 1-4, Figure 1-5, and Figure 1-6 show what the script looks like in Anaconda Spyder, Notepad++ (Windows), and TextWrangler (macOS), respectively.

image1 Figure 1-4. The 1excel_introspect_workbook.py Python script in Anaconda Spyder

image2 Figure 1-5. The 1excel_introspect_workbook.py Python script in Notepad++ (Windows)

image3 Figure 1-6. The 1excel_introspect_workbook.py Python script in TextWrangler (macOS)

Line 3 imports the xlrd module’s open_workbook function so we can use it to read and parse an Excel file.

Line 7 uses the open_workbook function to open the Excel input file into an object I’ve named workbook. The workbook object contains all of the available information about the workbook, so we can use it to retrieve individual worksheets from the workbook.

Line 8 prints the number of worksheets in the workbook.

Line 9 is a for loop that iterates over all of the worksheets in the workbook. The workbook object’s sheets method identifies all of the worksheets in the workbook.

Line 10 prints the name of each worksheet and the number of rows and columns in each worksheet to the screen. The print statement uses the worksheet object’s name attribute to identify the name of each worksheet. Similarly, it uses the nrows and ncols attributes to identify the number of rows and columns, respectively, in each worksheet.

If you created the file in the Spyder IDE, then to run the script:

  1. Click on the Run drop-down menu in the upper-left corner of the IDE.

  2. Select “Configure”

  3. After the Run Settings window opens, select the “Command line options” check box and enter“sales_2013.xlsx” (see Figure 1-7).

  4. Make sure the “Working directory” is where you saved the script and Excel file.

  5. Click Run.

image4 Figure 1-7. Specifying command line options in Anaconda Spyder

When you click the Run button (either the Run button in the Run Settings window or the green Run button in the upper-left corner of the IDE) you’ll see the output displayed in the Python console in the lower righthand pane of the IDE. Figure 1-7 displays the Run drop-down menu, the key settings in the Run Settings window, and the output inside red boxes.

Alternatively, you can run the script in a Command Prompt or Terminal window. To do so, use one of the following commands, depending on your operating system.

On Windows:

python 1excel_introspect_workbook.py sales_2013.xlsx

On macOS:

chmod +x 1excel_introspect_workbook.py ./1excel_introspect_workbook.py sales_2013.xlsx

You should see the output shown in Figure 1-8 (for Windows) or Figure 1-9 (for macOS) printed to the screen.

image1 Figure 1-8. Output of Python script in a Command Prompt window (Windows)

image2Figure 1-9. Output of Python script in a Terminal window (macOS)

The first line of output shows that the Excel input sales_2013.xlsx file, , contains three worksheets. The next three lines show that the three worksheets are january_2013 february_2013 march_2013 named , , and . They also show that each of the worksheets contains seven rows, including the header row, and five columns.

Now that we know how to use Python to introspect an Excel workbook, let’s learn how to parse a single worksheet in different ways. We’ll then extend that knowledge to iterate over multiple worksheets and then to iterate over multiple workbooks.