5.3. Sum and Average Values per Workbook andWorksheet¶
BASE PYTHON
To calculate worksheet- and workbook-level statistics for multiple workbooks with base Python, type the following code into a text editor and save the file as 14excel_sum_average_multiple_workbooks.py:
>>> #!/usr/bin/env python3
>>> import glob
>>> import os
>>> import sys
>>> from datetime import date
>>> from xlrd import open_workbook, xldate_as_tuple
>>> from xlwt import Workbook
>>> input_folder = sys.argv[1]
>>> output_file = sys.argv[2]
>>> output_workbook = Workbook()
>>> output_worksheet =output_workbook.add_sheet('sums_and_averages')
>>> all_data = []
>>> sales_column_index = 3
>>> header = ['workbook', 'worksheet','worksheet_total', 'worksheet_average','workbook_total','workbook_average']
>>> all_data.append(header)
>>> for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
>>>
>>> with open_workbook(input_file) as workbook:
>>>
>>> list_of_totals = []
>>>
>>> list_of_numbers = []
>>>
>>> workbook_output = []
>>>
>>> for worksheet in workbook.sheets():
>>> total_sales = 0
>>>
>>> number_of_sales = 0
>>>
>>> worksheet_list = []
>>>
>>> worksheet_list.append(os.path.basename(input_file))
>>>
>>> worksheet_list.append(worksheet.name)
>>>
>>> for row_index in range(1,worksheet.nrows):
>>>
>>> try:
>>>
>>> total_sales +=float(str(worksheet.cell_value(row_index,sales_column_index)).strip('$').replace(',',''))
>>>
>>> number_of_sales += 1.
>>>
>>> except:
>>>
>>> total_sales += 0.
>>>
>>> number_of_sales += 0.
>>>
>>> average_sales = '%.2f' % (total_sales/ number_of_sales)
>>>
>>> worksheet_list.append(total_sales)
>>>
>>> worksheet_list.append(float(average_sales))
>>>
>>> list_of_totals.append(total_sales)
>>>
>>> list_of_numbers.append(float(number_of_sales))
>>>
>>> workbook_output.append(worksheet_list)
>>>
>>> workbook_total = sum(list_of_totals)
>>>
>>> workbook_average =sum(list_of_totals)/sum(list_of_numbers)
>>>
>>> for list_element in workbook_output:
>>>
>>> list_element.append(workbook_total)
>>>
>>> list_element.append(workbook_average)
>>>
>>> all_data.extend(workbook_output)
>>>
>>> for list_index, output_list in enumerate(all_data):
>>>
>>> for element_index, element in enumerate(output_list):
>>>
>>> output_worksheet.write(list_index,element_index, element)
>>> output_workbook.save(output_file)
Line 12 creates an empty list named all_data to hold all of the rows we want to write to the output file. Line 13 creates a variable named sales_column_index to hold the index value of the Sale Amount column.
Line 14 creates the list of column headings for the output file and line 16 appends this list of values into all_data.
In lines 19, 20, and 21 we create three lists. The list_of_totals will contain the total sale amounts for all of the worksheets in a workbook. Similarly, list_of_numbers will contain the number of sale amounts used to calculate the total sale amounts for all of the worksheets in a workbook. The third list, workbook_output, will contain all of the lists of output that we’ll write to the output file.
In line 25, we create a list, worksheet_list, to hold all of the information about the worksheet that we want to retain. In lines 26 and 27, we append the name of the workbook and the name of the worksheet into worksheet_list. Similarly, in lines 38 and 39, we append the total and average sale amounts intoworksheet_list. In line 42, we append worksheet_list into workbook_output to store the information at the workbook level.
In lines 40 and 41 we append the total and number of sale amounts for the worksheet into list_of_totals and list_of_numbers, respectively, so we can store these values across all of the worksheets. In lines 43 and 44 we use the lists to calculate the total and average sale amount for the workbook.
In lines 45 to 47, we iterate through the lists in workbook_output (there are three lists for each workbook, as each workbook has three worksheets) and append the workbook-level total and average sale amounts into each of the lists.
Once we have all of the information we want to retain for the workbook (i.e., three lists, one for each worksheet), we extend the lists into all_data. We use extend instead of append so that each of the lists in workbook_output becomes a separate element in all_data. This way, after processing all three workbooks, all_data is a list of nine elements, where each element is a list. If instead we were to use append, there would only be three elements in all_data and each one would be a list of lists. To run the script, type the following on the command line and hit Enter:
python 14excel_sum_average_multiple_workbooks.py "C:UsersClintonDesktop"output_files14output.xls
You can then open the output file,14output.xls , to review the results.
PANDAS
Pandas makes it relatively straightforward to iterate through multiple workbooks and calculate statistics for the workbooks at both the worksheet and workbook levels. In this script, we calculate statistics for each of the worksheets in a workbook and concatenate the results into a DataFrame. Then we calculate workbook-level statistics, convert them into a DataFrame, merge the two DataFrames together with a left join on the name of the workbook, and add the resulting DataFrame to a list. Once all of the workbook-level DataFrames are in the list, we concatenate them together into a single DataFrame and write it to the output file.
To calculate worksheet and workbook-level statistics for multiple workbooks with pandas, type the following code into a text editor and save the file as pandas_sum_average_multiple_workbooks.py#!/usr/bin/env python3:
>>> import pandas as pd
>>> import glob
>>> import os
>>> import sys
>>> input_path = sys.argv[1]
>>> output_file = sys.argv[2]
>>> all_workbooks =glob.glob(os.path.join(input_path,'*.xls*'))
>>> data_frames = []
>>> for workbook in all_workbooks:
>>> all_worksheets = pd.read_excel(workbook,sheetname=None, index_col=None)
>>> workbook_total_sales = []
>>> workbook_number_of_sales = []
>>> worksheet_data_frames = []
>>> worksheets_data_frame = None
>>> workbook_data_frame = None
>>> for worksheet_name, data in all_worksheets.items():
>>> total_sales =pd.DataFrame([float(str(value).strip('$').replace(',',''))
>>> for value in data.loc[:, 'SaleAmount']]).sum()
>>> number_of_sales = len(data.loc[:, 'SaleAmount'])
>>> average_sales = pd.DataFrame(total_sales /number_of_sales)
>>> workbook_total_sales.append(total_sales)
>>> workbook_number_of_sales.append(number_of_sales)
>>> data = {'workbook':os.path.basename(workbook),'worksheet': worksheet_name,'worksheet_total': total_sales,'worksheet_average': average_sales}
>>> worksheet_data_frames.append(pd.DataFrame(data, columns=['workbook', 'worksheet', 'worksheet_total', 'worksheet_average']))
>>> worksheets_data_frame = pd.concat(worksheet_data_frames, axis=0, ignore_index=True)
>>> workbook_total =pd.DataFrame(workbook_total_sales).sum()
>>> workbook_total_number_of_sales = pd.DataFrame(workbook_number_of_sales).sum()
>>> workbook_average = pd.DataFrame(workbook_total / workbook_total_number_of_sales)
>>> workbook_stats = {'workbook':os.path.basename(workbook),'workbook_total':workbook_total,'workbook_average':workbook_average}
>>> workbook_stats = pd.DataFrame(workbook_stats,columns=['workbook', 'workbook_total','workbook_average'])
>>> workbook_data_frame =pd.merge(worksheets_data_frame, workbook_stats, on='workbook', how='left')
>>> data_frames.append(workbook_data_frame)
>>> all_data_concatenated = pd.concat(data_frames, axis=0,ignore_index=True)
>>> writer = pd.ExcelWriter(output_file)
>>> all_data_concatenated.to_excel(writer,sheet_name='sums_and_averages', index=False)
>>> writer.save()
---------------------------------------------------------------------------
OptionError Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
655 try:
--> 656 engine = config.get_option(f"io.excel.{ext}.writer")
657 if engine == "auto":
/usr/lib/python3/dist-packages/pandas/_config/config.py in __call__(self, *args, **kwds)
232 def __call__(self, *args, **kwds):
--> 233 return self.__func__(*args, **kwds)
234
/usr/lib/python3/dist-packages/pandas/_config/config.py in _get_option(pat, silent)
104 def _get_option(pat: str, silent: bool = False):
--> 105 key = _get_single_key(pat, silent)
106
/usr/lib/python3/dist-packages/pandas/_config/config.py in _get_single_key(pat, silent)
90 _warn_if_deprecated(pat)
---> 91 raise OptionError(f"No such keys(s): {repr(pat)}")
92 if len(keys) > 1:
OptionError: "No such keys(s): 'io.excel.json.writer'"
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-9-f8c401cc32ff> in <module>
32 data_frames.append(workbook_data_frame)
33 all_data_concatenated = pd.concat(data_frames, axis=0,ignore_index=True)
---> 34 writer = pd.ExcelWriter(output_file)
35 all_data_concatenated.to_excel(writer,sheet_name='sums_and_averages', index=False)
36 writer.save()
/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
658 engine = _get_default_writer(ext)
659 except KeyError as err:
--> 660 raise ValueError(f"No engine for filetype: '{ext}'") from err
661 cls = get_writer(engine)
662
ValueError: No engine for filetype: 'json'
To run the script, type the following on the command line and hit Enter:
python pandas_sum_average_multiple_workbooks.py "C:UsersClintonDesktop"output_filespandas_output.xls
You can then open the output file,pandas_output.xls , to review the results.
We’ve covered a lot of ground in this lesson. We’ve discussed how to read and parse an Excel workbook, navigate rows in an Excel worksheet, navigate columns in an Excel worksheet, process multiple Excel worksheets, process multiple Excel workbooks, and calculate statistics for multiple Excel worksheets and workbooks. If you’ve followed along with the examples in this lesson, you have written 14 new Python scripts!
The best part about all of the work you have put into working through the examples in this lesson is that you are now well equipped to navigate and process Excel files, one of the most common file types in business. Moreover, because many business divisions store data in Excel workbooks, you now have a set of tools you can use to process the data in these workbooks regardless of the number of workbooks, the size of the workbooks, or the number of worksheets in each workbook. Now you can take advantage of your computer’s data processingcapabilities to automate and scale your analysis of data in Excel workbooks.