Visualize before you analyse data

2014-08-09 14_10_24-eye

Once in an ordinary day at work in an ordinary meeting, colleague A claimed that based on his monthly reports, our night shift staffs (starting work at 8pm) have been seeing increased workload of inpatient supplies, as compared to previous months. How they perform inpatient supplies is that they see the list of items requested by the ward nurses in the EMR (Electronic Medical Record) system, click “supply” on the items, labels print out, and staffs arrange to physically transfer the labeled items to the ward. EMR interfaces with another system – let’s call it System P here – which performs all charging and inventory purposes.

His monthly reports were based on System P. In the report, he filtered all supplies performed between 8pm to 12 midnight from the supply logs, and compared these with similar data from previous months.

Now, it is common knowledge among the management that the interface messages from EMR to System P does not happen immediately – in fact, in most cases, there is a middleware that buffers the load and releases it in off-hours, so that System P does not get overloaded during peak daily operations. Rather, colleague A insisted that his data was correct, and claimed that this workload in particular was not buffered by the middleware. We also saw (by eyeballing the logs) that the time of the day of the supplies were scattered throughout the day, which seemed to support his understanding.

We soon learned that attempting to summarize thousands of rows of data by eyeballing was not a very good idea.

We tried to visualize the distribution of the supplies workload throughout the day, and we plotted this:

Numbers are naturally not real and for illustration purposes only.

Numbers are naturally not real and for illustration purposes only.

Aren’t the two prominent peaks on 12 noon and midnight very suspicious? It turned out that the interface messages was indeed buffered by the middleware and released two times during the day, 12 noon and midnight respectively! This basically meant that the data used for the analysis was invalid, as the time referred to the time the interface message was received by System P, not the time the actual work was actually performed in EMR.

The “noise” shown at 8am to 7pm was apparently the middleware flushing the messages ad-hoc due to events such as patients getting discharged. This was what caused us to believe that the messages were not buffered by the middleware at all, when we were eyeballing the time.

In the end, we got the actual workload logs from EMR and managed to figure out some causes of the perceived increased workload and reacted accordingly (by pre-empting some supplies and performing it during the working hours instead, where more staffs are available).

What just happened?

If we did not visualize the workload data then, it is possible that we may come up with a wrong conclusion and implement suboptimal policies – for example, attributing them to increased patient load and thus blindly throwing in more manpower to solve the issue.

When I saw the plot, I realized that we almost fell into the fallacy alluded by Francis Anscombe – analyzing data before even attempting to visualize it.

Anscombe's Quartet: all four datasets are identical when examined using simple summary statistics, but vary considerably when graphed. Image taken from the Wikipedia article.

Anscombe’s Quartet: all four datasets are identical when examined using simple summary statistics, but vary considerably when graphed. Image was reproduced from the Wikipedia article.

Note that I do not use any fancy algorithms or calculations – I just created a simple graph using common Excel functionality. Data analysis does not need to be complicated. Sometimes, people forget about common sense when encountering vast data, so do look back and see what you may have overlooked before drawing conclusions.

One response

Leave a Reply