Excel data analysis with Python

a

I don’t know whether I can be considered a data steward or maybe more like a data janitor for my department, but what I know is that my work consists of a lot of data wrangling in Excel. Some are routine like generating monthly reports, while the rest ad-hoc analytics and policy making, especially important as we are building a new hospital and setting up new systems.

There is of course tremendous amount of time saving if you actually use some programming skillz to automate the data analysis. One tricky part though is that the input and output must be in Excel files (non-technical people will want to view the output), and those are not exactly the most pleasant file type to work with from programming perspective.

Luckily there are a lot of options to parse and write Excel files in Python – my language of choice for quick-and-dirty scripting – and goodness gracious I have tried almost all of them. So here are the methods I have used and their pros and cons.

Convert to CSV: death to proprietary file formats!

Probably the most straightforward way to import the data to Python constructs is to just convert the Excel file to CSV, either by manually using the “Save As” function in Excel or using tools like csvkit. Then you can use csvkit again for some basic manipulation like merging workbooks and performing joins.

This is not strictly Python only, because csvkit is a command line program and you can read the resulting CSV with whatever libraries available including the built in csv module in Python.

This method is simple and probably the fastest to write for basic tasks like merging sheets. The csvkit docs gives a nice overview of how to do those tasks. However, you need to use other method to write Excel workbooks (or write CSVs and then convert to Excel manually). Also, in case you haven’t noticed yet, CSVs do not support rich formatting.

Datanitro: Excel macros for Python literates

Datanitro is a great tool (was free for personal use some time ago but not anymore) that basically enables you to write macros in Python that are able to modify the workbook in place while the Excel application is open. It is an alternative to the built-in Visual Basic for Applications (VBA) macros, which are to be frank rather unpleasant to write in.

I like the simplicity and also the ability to retain and apply formatting, which can also save quite some tedious time reformatting output workbooks. This makes Datanitro a very good choice for reports that needs to be “pretty”, and in my case I am also able to use it together with the charting capabilities of Excel. I created a “template” Excel file; whenever I want to generate the report, I copy the data into the template, run the Datanitro script, and the Save As the workbook to a new file.

The biggest turnoff is that Datanitro scripts, while valid Python programs, cannot be run without the Datanitro itself installed. Installation requires administrative rights and also money (after the free trial expires), which a lot of my colleagues do not have. It also means that those scripts are vendor-locked-in to the Datanitro platform.

Another issue is that it simply cannot access more than one workbook at the same time, thus tasks that uses more than one workbook at the same time are impossible to automate this way. EDIT: Datanitro does support opening multiple workbooks! Ben the co-founder showed it to me. Thanks :)

Parsing: let’s do some real programming

This is of course by far the most flexible automation method that you can come up with, and it is basically writing a script that reads, processes, and writes Excel files. There are modules in Python that provide common tasks so that you do not need to write routines for parsing the Excel workbooks.

openpyxl

openpyxl is the one that I used the most. The APIs are convenient (with several high-level functions like accessing ranges), concise, and well documented. Besides accessing cells, cell ranges, and sheets, as well as writing Excel files from scratch, openpyxl can also perform basic formatting tasks like setting bold/italic font, text colour, and row/column sizes. This is really helpful if you want to create Excel files from scratch that are readily human-digestible.

My only gripe about openpyxl now is probably just performance. It takes almost a full minute to parse one workbook with (3 sheets x 1000 rows x 10 columns) in an i5 laptop. Last time I checked it doesn’t seem to be compatible yet with pypy.

pandas

pandas is the module to use for any serious number-crunching in Python; users of Matlab and R will feel right at home using this tool. It has read and write functions to several common data format, including Excel workbooks. It imports the Excel workbook into a Pandas DataFrame objects, with which you can analyse and manipulate using any of the provided methods. Pandas has extensive utility functions for merging, joining, grouping, filtering, performing calculations, as well as statistical concepts like standard descriptive analysis (mean/median/mode/standard error) and moving average. Together with matplotlib for plotting/charting capabilities and IPython for interactive programming, it enables you to perform advanced analysis pretty quickly.

I have begun to use this method more often the more I learn about Pandas. The learning curve is definitely higher than just using openpyxl – DataFrames can have their own quirks – but the reward is significant. Anyway when you parse using openpyxl, usually you are building a data structure in memory; rather than doing that, might as well just use Pandas DataFrames.

Performance wise it is comparable to openpyxl as pandas uses it under the hood. The stock to_excel() method does not allow any formatting customizations though. Also, I find that loading from CSV is much faster than loading from Excel, so you may want to do a one-time conversion to CSV (via from_excel() then to_csv(), or use csvkit tool elaborated in a previous section) if you load a workbook often.

xlwt / xlrd

The modules xlrd and xlwt reads and writes to Excel 2003 (.xls) format. I do not have much experience using these modules, since openpyxl is usually what I need, but in case you still need to parse .xls files, you know where to look.

Using SQL in Excel

A lot of common data cleaning and analysis tasks become easier if you are able to use SQL over an Excel database, but doing this without jumping through the hoops is not trivial. A few ways doing this I have yet to explore fully include:

  • Using utilities like textql: I’m not able to find one that is easily downloadable. textql requires compilation using the Go development environment which for various reason I have not been doing.
  • pandasql enables you to run SQL queries in Pandas DataFrames.
  • Using Pandas from_excel() to get the DataFrame, and then to_sql() to put it inside an SQLite database (can be in-memory too). Then use read_sql_table() and read_sql_query() for queries.
  • Import the Excel files to Microsoft Access. I find the UI rather clunky (e.g. unable to import multiple Excel files at one go) but MS Access is a real database. Not to mention that it is rather expensive.

In general though, vanilla Pandas DataFrames can usually suit your needs and are probably faster for multi-megabyte workbooks.

Conclusion, or TL;DR

Invest time in studying pandas.

9 responses

  1. (VBA) macros, which are to be frank rather unpleasant to write in

    That’s just your opinion. It’s been around for 20 years.
    With the exception of objects like array, it works well and is like a 100% custom-made suit for Excel.

    So other “work-arounds” are pleasant and viable ?
    * Buy datanitro
    * try to get it installed & working on user systems.

    • It is my opinion that VBA is less pleasant to write in as compared to Python. :)

      A lot of people get started on this kind of automation using VBA. Rather than unpleasant, it is probably more applicable in general to say that VBA is a dead-end platform that is not usable outside Office. Instead of committing a large amount of time to learn VBA I would think learning a more general-purpose language like Python would be a better approach.

      • Absolutely. Maintaining VBA code, which usually ends up tied to the spreadsheet that it lives in, is the worst part. Writing (or at least maintaining / modifying) nice, standalone Python is something that just about any engineer with even a remote bit of programming knowledge can do.

Leave a Reply