5.2. Concatenate Data from Multiple Workbooks¶
BASE PYTHON
To concatenate data from all of the worksheets in multiple workbooks vertically into one output file with base Python, type the following code into a text editor and save the file as 13excel_ concat_data_from_multiple_workbooks.py :
>>> #!/usr/bin/env python3
>>> import glob
>>> import os
>>> import sys
>>> from datetime import date
>>> from xlrd import open_workbook, xldate_as_tuple
>>> from xlwt import Workbook
>>> input_folder = sys.argv[1]
>>> output_file = sys.argv[2]
>>> output_workbook = Workbook()
>>> output_worksheet =output_workbook.add_sheet('all_data_all_workbooks')
>>> data = []
>>> first_worksheet = True
>>> for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
>>>
>>> print (os.path.basename(input_file))
>>>
>>> with open_workbook(input_file) as workbook:
>>>
>>> for worksheet in workbook.sheets():
>>>
>>> if first_worksheet:
>>>
>>> header_row =worksheet.row_values(0)
>>>
>>> data.append(header_row)
>>>
>>> first_worksheet = False
>>>
>>> for row_index in range(1,worksheet.nrows):
>>>
>>> row_list = []
>>>
>>> for column_index in range(worksheet.ncols):
>>>
>>> cell_value =worksheet.cell_value(row_index,column_index)
>>>
>>> cell_type =worksheet.cell_type(row_index, column_index)
>>>
>>> if cell_type == 3:
>>>
>>> date_cell =xldate_as_tuple(cell_value,workbook.datemode)
>>>
>>> date_cell =date(*date_cell[0:3]).strftime('%m/%d/%Y')
>>>
>>> row_list.append(date_cell)
>>>
>>> else:
>>>
>>> row_list.append(cell_value)
>>>
>>> data.append(row_list)
>>> for list_index, output_list in enumerate(data):
>>>
>>> for element_index, element in enumerate(output_list):
>>>
>>> output_worksheet.write(list_index,element_index, element)
>>> output_workbook.save(output_file)
Line 13 creates a Boolean (i.e., True/False) variable named first_worksheet that we use to distinguish between the first worksheet and all of the subsequent worksheets we process. For the first worksheet we process, line 18 is True so we append the header row into data and then set first_worksheet equal to False.
For the remaining data rows in the first worksheet and all of the subsequent worksheets, we skip the header row and start processing the data rows. We know that we start at the second row because the range function in line 22 starts at one instead of zero.
To run the script, type the following on the command line and hit Enter:
python 13excel_ concat_data_from_multiple_workbooks.py "C:UsersClintonDesktop"output_files13output.xls
You can then open the output file, 13output.xls , to review the results.
PANDAS
Pandas provides the concat function for concatenating DataFrames. If you want to stack the DataFrames vertically on top of one another, then use axis=0. If you want to join them horizontally side by side, then use axis=1. Alternatively, if you need to join the DataFrames together based on a key column, the pandas merge function provides these SQL join–like operations.
To concatenate data from all of the worksheets in multiple workbooks vertically into one output file with pandas, type the following code into a text editor and save the file as pandas_concat_data_from_multiple_workbooks.py:
>>> #!/usr/bin/env python3
>>> import pandas as pd
>>> import glob
>>> import os
>>> import sys
>>> input_path = sys.argv[1]
>>> output_file = sys.argv[2]
>>> all_workbooks =glob.glob(os.path.join(input_path,'*.xls*'))
>>> data_frames = []
>>> for workbook in all_workbooks:
>>> all_worksheets = pd.read_excel(workbook,sheetname=None, index_col=None)
>>> for worksheet_name, data in all_worksheets.items():
>>> data_frames.append(data)
>>> all_data_concatenated = pd.concat(data_frames, axis=0,ignore_index=True)
>>> writer = pd.ExcelWriter(output_file)
>>> all_data_concatenated.to_excel(writer,sheet_name='all_data_all_workbooks',index=False)
>>> writer.save()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-66ddec2f63fd> in <module>
12 for worksheet_name, data in all_worksheets.items():
13 data_frames.append(data)
---> 14 all_data_concatenated = pd.concat(data_frames, axis=0,ignore_index=True)
15 writer = pd.ExcelWriter(output_file)
16 all_data_concatenated.to_excel(writer,sheet_name='all_data_all_workbooks',index=False)
/usr/lib/python3/dist-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
272 ValueError: Indexes have overlapping values: ['a']
273 """
--> 274 op = _Concatenator(
275 objs,
276 axis=axis,
/usr/lib/python3/dist-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
329
330 if len(objs) == 0:
--> 331 raise ValueError("No objects to concatenate")
332
333 if keys is None:
ValueError: No objects to concatenate
To run the script, type the following on the command line and hit Enter:
python pandas_concat_data_from_multiple_workbooks.py "C:UsersClintonDesktop"output_filespandas_output.xls
You can then open the output file, review the results. pandas_output.xls , to