Many of us find our self in a situation were we have to evaluate a chunk of data in the form of xlsx or csv. These works are mostly repetitive with same set of actions of operation or action performed on them everyday with same data sheets with different data.
This begs the question on our mind, what if there is a tool to do this or a way to automate these action. When I had this question and searching internet for a solution, I found out most people used Python to write a script which will do their required action on those data sheets avoiding these daunting and repetitive work by a single click.
Isn't it amazing? In this post I will be sharing what I have learnt, that will help you write your own scripts. We will start with few basic and will work our way up. Lets get started,
Installing Packages
To work with spreadsheet we have to install few package which supports them. Below is the list of popular packages used we can and will be few of them in our post.
openpyxl - The recommended package for reading and writing Excel 2010 files (ie: .xlsx)
xlsxwriter - An alternative package for writing data, formatting information and, in particular, charts in the Excel 2010 format (ie: .xlsx)
xlrd - This package is for reading data and formatting information from older Excel files (ie: .xls)
xlwt - This package is for writing data and formatting information to older Excel files (ie: .xls)
xlutils - This package collects utilities that require both xlrd and xlwt, including the ability to copy and modify or filter existing excel files.
PyXLL - PyXLL is a commercial product that enables writing Excel add-ins in Python with no VBA. Python functions can be exposed as worksheet functions (UDFs), macros, menus and ribbon tool bars.
Let us start with Openpyxl,
To install, open up a command prompt and type in the bellow command,
pip install openpyxl
On my system I have already installed openpyxl and using below command you can list out its version
import openpyxl
print(openpyxl.__version__)
Now lets see how to open an excel sheet, Openpyxl provide a method named load_workbook which is used to load an spreadsheet into an object. Spreadsheet consists of sheets so it is required to select a sheet when a workbook is loaded, so to do that there is an another method named get_sheet_by_name.
After doing that, the object will have access to all the data of selected worksheet's sheet. We will be able to perform our operations on this sheet using only this object.
wb = openpyxl.load_workbook('filename')
sheet = wb.get_sheet_by_name('sheetname')
Apart for this you can even get a list of sheets in an excel using a method named get_sheet_names. This method return the sheet name in the form of a list so it can re used effectively.
sheet_names = wb.get_sheet_names()
There is a method named title, which takes no parameter will return a string containing the name of the current sheet
sheet.title
When working with real time data you might face a problem, you might have applied a formula on a row or column based on its row or column count. If you define these value it becomes static and your script might not run properly on other datasheet. Openpyxl has this really useful method named max_row and max_column which will return the integer value.
print("Max row : ", sheet.max_row)
print("Max column : ", sheet.max_column)
Similarly we have a method named value which is used to retrieve value in a cell, there are lot of ways we can retrieve these data we will show them all below so choose which is best for you.
c = sheet['A1'].value
print(c)
By below way we can more ergonomically reuse them in loops for accessing a array of data
a3 = sheet.cell(row=1, column=1).value
print(a3)
for i in range(1, 10):
print(sheet.cell(row=i, column=19).value)
So far we have seen ways on how to read data from the spreadsheet. Let see how to write data, as similarly from getting data we same simple assign a value or a formula to a spreadsheet using value method to define.
sheet.cell(row=row_number, column=column_number).value = 'value'
now using loops,
print("Data before Change\n")
for i in range(1, 10):
print(sheet.cell(row=i, column=19).value)
print("Data before Change\n")
for i in range(1, 10):
sheet.cell(row=i, column=19).value = '02-12-2020'
print(sheet.cell(row=i, column=19).value)
Lets now change the properties of the cell and its value. Some times we have certain cell contains text or value which is large and gets wrapped around but when spreadsheet loads it will open each cell in default height and width. Since this is the case we have to enlarge these cell our self to see its value.
Openpyxl even has methods to set these height and width using row_dimensions and column_dimensions. Note that for row we have to pass in the row number and for column we have to pass column alphabet.
sheet.row_dimensions[1].height = 70
sheet.column_dimensions['A'].width = 20
We can also change apply styles in the cell using Font package so make sure to import them(use alias for convenience) and then use method Font.
from openpyxl.styles import Font
sheet['A1'].font = Font(sz=14, bold=True, italic=True)
It is important to note that none of the data you write or change you make in sheet will take effect unless you save it. To do that we have to use save method.
wb.save('filename')
Lets now create a sheet, this time we will mention on which index they will be located, usually when creating new sheet if index is not mentioned it will be placed at the end if specified it will be placed on the specified place when created. To do that just pass in a argument named index with the placement index value(integer).
wb.create_sheet(title = 'temp', index = 2)
We can also apply a value, passing it as formula so when sheet is opened it will apply the value and display data.