3.2. Select Specific Columns Across All Worksheets¶
Sometimes an Excel workbook contains multiple worksheets and each of the worksheets contains more columns than you need. In these cases, you can use Python to read all of the worksheets, filter out the columns you do not need, and retain the columns that you do need.
As we learned earlier, there are at least two ways to select a subset of columns from a worksheet—by index value and by column heading. The following example demonstrates how to select specific columns from all of the worksheets in a workbook using the column headings.
BASE PYTHON
To select the Customer Name and Sale Amount columns across all of the worksheets with base Python, type the following code into a text editor and save the file as 10excel_column_by_name_all_worksheets.py:
>>>
>>> #!/usr/bin/env python3
>>> import sys
>>> from datetime import date
>>> from xlrd import open_workbook, xldate_as_tuple
>>> from xlwt import Workbook
>>> input_file = sys.argv[1]
>>> output_file = sys.argv[2]
>>> output_workbook = Workbook()
>>> output_worksheet =output_workbook.add_sheet('selected_columns_all_worksheets')
>>> my_columns = ['Customer Name', 'Sale Amount']
>>> first_worksheet = True
>>> with open_workbook(input_file) as workbook:
>>>
>>> data = [my_columns]
>>>
>>> index_of_cols_to_keep = []
>>>
>>> for worksheet in workbook.sheets():
>>>
>>> if first_worksheet:
>>>
>>> header = worksheet.row_values(0)
>>>
>>> for column_index in range(len(header)):
>>>
>>> if header[column_index] in my_columns:
>>>
>>> index_of_cols_to_keep.append(column_index)
>>>
>>> first_worksheet = False
>>>
>>> for row_index in range(1,worksheet.nrows):
>>>
>>> row_list = []
>>>
>>> for column_index in index_of_cols_to_keep:
>>>
>>> cell_value =worksheet.cell_value(row_index, column_index)
>>>
>>> cell_type =worksheet.cell_type(row_index, column_index)
>>>
>>> if cell_type == 3:
>>>
>>> date_cell =xldate_as_tuple(cell_value,workbook.datemode)
>>>
>>> date_cell =date(*date_cell[0:3]).strftime('%m/%d/%Y')
>>>
>>> row_list.append(date_cell)
>>>
>>> else:
>>>
>>> row_list.append(cell_value)
>>>
>>> data.append(row_list)
>>>
>>> for list_index, output_list in enumerate(data):
>>>
>>> for element_index, element in enumerate(output_list):
>>>
>>> output_worksheet.write(list_index,element_index, element)
>>> output_workbook.save(output_file)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-8-ce141537685e> in <module>
10 my_columns = ['Customer Name', 'Sale Amount']
11 first_worksheet = True
---> 12 with open_workbook(input_file) as workbook:
13
14 data = [my_columns]
~/.local/lib/python3.9/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
109 else:
110 filename = os.path.expanduser(filename)
--> 111 with open(filename, "rb") as f:
112 peek = f.read(peeksz)
113 if peek == b"PK\x03\x04": # a ZIP file
FileNotFoundError: [Errno 2] No such file or directory: '-f'
Line 10 creates a list variable named my_columns that contains the names of the two columns we want to retain.
Line 13 places my_columns as the first list of values in data, as they are the column headings of the columns we intend to write to the output file. Line 14 creates an empty list named index_of_cols_to_keep that will contain the index values of the Customer Name and Sale Amount columns.
Line 16 tests if we’re processing the first worksheet. If so, then we identify the index values of the Customer Name and Sale Amount columns and append them into index_of_cols_to_keep. Then we set first_worksheet equal to False. The code continues and processes the remaining data rows, using line 24 toonly process the values in the Customer Name and Sale Amount columns.
For all of the subsequent worksheets, first_worksheet is False, so the script moves ahead to line 22 to process the data rows in each worksheet. For these worksheets, we only process the columns with the index values listed in index_of_cols_to_keep. If the value in one of these columns is a date, we format it as a date. After assembling a row of values we want to write to the output file, we append the list of values into data in line 36.
To run the script, type the following on the command line and hit Enter:
python 10excel_column_by_name_all_worksheets.py sales_2013.xlsxoutput_files10output.xls
You can then open the output file, 10output.xls , to review the results.
PANDAS
Once again, we’ll read all of the worksheets into a dictionary with the pandas read_excel function. Then we’ll select specific columns from each worksheet with the loc function, create a list of filtered DataFrames,and concatenate the DataFrames together into a final DataFrame.
In this example, we want to select the Customer Name and Sale Amount columns across all of the worksheets. To select these columns with pandas, type the following code into a text editor and save the file as pandas_column_by_name_all_worksheets.py:
>>> #!/usr/bin/env python3
>>> import pandas as pd
>>> import sys
>>> input_file = sys.argv[1]
>>> output_file = sys.argv[2]
>>> data_frame = pd.read_excel(input_file, shee_name=None,index_col=None)
>>> column_output = []
>>> for worksheet_name, data in data_frame.items():
>>> column_output.append(data.loc[:, ['Customer Name','Sale Amount']])
>>> selected_columns = pd.concat(column_output, axis=0,ignore_index=True)
>>> writer = pd.ExcelWriter(output_file)
>>> selected_columns.to_excel(writer,sheet_name='selected_columns_all_worksheets',index=False)
>>> writer.save()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-6a07f3c26604> in <module>
4 input_file = sys.argv[1]
5 output_file = sys.argv[2]
----> 6 data_frame = pd.read_excel(input_file, shee_name=None,index_col=None)
7 column_output = []
8 for worksheet_name, data in data_frame.items():
/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper
TypeError: read_excel() got an unexpected keyword argument 'shee_name'
To run the script, type the following on the command line and hit Enter:
python pandas_column_by_name_all_worksheets.py sales_2013.xlsxoutput_filespandas_output.xls
You can then open the output file, review the results.