I don’t know whether I can be considered a data steward or maybe more like a data janitor for my department, but what I know is that my work consists of a lot of data wrangling in Excel. Some are routine like generating monthly reports, while the rest ad-hoc analytics and policy making, especially important as we are building a new hospital and setting up new systems.
There is of course tremendous amount of time saving if you actually use some programming skillz to automate the data analysis. One tricky part though is that the input and output must be in Excel files (non-technical people will want to view the output), and those are not exactly the most pleasant file type to work with from programming perspective.
Luckily there are a lot of options to parse and write Excel files in Python – my language of choice for quick-and-dirty scripting – and goodness gracious I have tried almost all of them. So here are the methods I have used and their pros and cons.
Convert to CSV: death to proprietary file formats!
Probably the most straightforward way to import the data to Python constructs is to just convert the Excel file to CSV, either by manually using the “Save As” function in Excel or using tools like
csvkit. Then you can use
csvkit again for some basic manipulation like merging workbooks and performing joins.
This is not strictly Python only, because
csvkit is a command line program and you can read the resulting CSV with whatever libraries available including the built in
csv module in Python.
This method is simple and probably the fastest to write for basic tasks like merging sheets. The
csvkit docs gives a nice overview of how to do those tasks. However, you need to use other method to write Excel workbooks (or write CSVs and then convert to Excel manually). Also, in case you haven’t noticed yet, CSVs do not support rich formatting.
Datanitro: Excel macros for Python literates
Datanitro is a great tool (was free for personal use some time ago but not anymore) that basically enables you to write macros in Python that are able to modify the workbook in place while the Excel application is open. It is an alternative to the built-in Visual Basic for Applications (VBA) macros, which are to be frank rather unpleasant to write in.
I like the simplicity and also the ability to retain and apply formatting, which can also save quite some tedious time reformatting output workbooks. This makes Datanitro a very good choice for reports that needs to be “pretty”, and in my case I am also able to use it together with the charting capabilities of Excel. I created a “template” Excel file; whenever I want to generate the report, I copy the data into the template, run the Datanitro script, and the Save As the workbook to a new file.
The biggest turnoff is that Datanitro scripts, while valid Python programs, cannot be run without the Datanitro itself installed. Installation requires administrative rights and also money (after the free trial expires), which a lot of my colleagues do not have. It also means that those scripts are vendor-locked-in to the Datanitro platform.
Another issue is that it simply cannot access more than one workbook at the same time, thus tasks that uses more than one workbook at the same time are impossible to automate this way. EDIT: Datanitro does support opening multiple workbooks! Ben the co-founder showed it to me. Thanks
Parsing: let’s do some real programming
This is of course by far the most flexible automation method that you can come up with, and it is basically writing a script that reads, processes, and writes Excel files. There are modules in Python that provide common tasks so that you do not need to write routines for parsing the Excel workbooks.
openpyxl is the one that I used the most. The APIs are convenient (with several high-level functions like accessing ranges), concise, and well documented. Besides accessing cells, cell ranges, and sheets, as well as writing Excel files from scratch, openpyxl can also perform basic formatting tasks like setting bold/italic font, text colour, and row/column sizes. This is really helpful if you want to create Excel files from scratch that are readily human-digestible.
My only gripe about openpyxl now is probably just performance. It takes almost a full minute to parse one workbook with (3 sheets x 1000 rows x 10 columns) in an i5 laptop. Last time I checked it doesn’t seem to be compatible yet with pypy.
pandas is the module to use for any serious number-crunching in Python; users of Matlab and R will feel right at home using this tool. It has read and write functions to several common data format, including Excel workbooks. It imports the Excel workbook into a Pandas DataFrame objects, with which you can analyse and manipulate using any of the provided methods. Pandas has extensive utility functions for merging, joining, grouping, filtering, performing calculations, as well as statistical concepts like standard descriptive analysis (mean/median/mode/standard error) and moving average. Together with matplotlib for plotting/charting capabilities and IPython for interactive programming, it enables you to perform advanced analysis pretty quickly.
I have begun to use this method more often the more I learn about Pandas. The learning curve is definitely higher than just using openpyxl – DataFrames can have their own quirks – but the reward is significant. Anyway when you parse using openpyxl, usually you are building a data structure in memory; rather than doing that, might as well just use Pandas DataFrames.
Performance wise it is comparable to openpyxl as pandas uses it under the hood. The stock
to_excel() method does not allow any formatting customizations though. Also, I find that loading from CSV is much faster than loading from Excel, so you may want to do a one-time conversion to CSV (via
to_csv(), or use
csvkit tool elaborated in a previous section) if you load a workbook often.
xlwt / xlrd
The modules xlrd and xlwt reads and writes to Excel 2003 (
.xls) format. I do not have much experience using these modules, since openpyxl is usually what I need, but in case you still need to parse
.xls files, you know where to look.
Using SQL in Excel
A lot of common data cleaning and analysis tasks become easier if you are able to use SQL over an Excel database, but doing this without jumping through the hoops is not trivial. A few ways doing this I have yet to explore fully include:
- Using utilities like textql: I’m not able to find one that is easily downloadable. textql requires compilation using the Go development environment which for various reason I have not been doing.
- pandasql enables you to run SQL queries in Pandas DataFrames.
- Using Pandas
from_excel()to get the DataFrame, and then
to_sql()to put it inside an SQLite database (can be in-memory too). Then use
- Import the Excel files to Microsoft Access. I find the UI rather clunky (e.g. unable to import multiple Excel files at one go) but MS Access is a real database. Not to mention that it is rather expensive.
In general though, vanilla Pandas DataFrames can usually suit your needs and are probably faster for multi-megabyte workbooks.
Conclusion, or TL;DR
Invest time in studying pandas.