Tuesday, October 2, 2012

Data Alignment with Pandas - Scrubbing Data

So I had written, or, rather, attempted to write, a long post about how I successfully installed Pandas, battled the dragons of library version numbers, slayed the inconsistencies in my setup, and emerged victorious with the Princess of Matplotlib firmly saddled to my pony, but in the end it turns out I have no idea how I got things working. But they are working and it's best to let sleeping dragons lie. (Actually, this is not quite true. I'm basically just using the system python3.2 installing things via apt and pip. Perhaps later I'll go back and set up an appropriate virtualenv with pythonbrew and blog about that.) From here on out I just assume you have things working too.

A common task when starting analysis would be to get all of our data lined up and in a format where we can actually do something usable with it. For this project we have a number of data sources that we want to scrub and then align.

File Format

The dimm and mass data include six datetime columns and one value column.
% head 2009.dimm.dat 2009 09 22 07 36 36 0.41370476 2009 09 22 07 38 02 0.44429658 ... % head 2009.mass.dat 2009 09 22 07 35 41 0.26 2009 09 22 07 37 06 0.23 ...

The weather data is as follows:
year month day hour minute(HST) wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure(mb-if present) % head cfht-wx.2009.dat 2008 12 31 23 59 9 274 -4.79 95 612.8 2009 01 01 00 04 10 275 -4.83 94 612.8 ...
Our goal is to put all of these items into one DataFrame with a datetime index and columns for dimm, mass, and the various weather pieces. To do all this we will need to do a little pre-processing of the files, which exist for years 2009 to 2012, a little pulling of that data into pandas, and then a little manipulation in order to align everything properly. So, let's see how that's done. Note that all of our data is freely available from the following:

Pre-Processing

It is a whole lot easier to concatenate all the datetime fields into one and then just pull those in to pandas. While we could do that as we pull in the data it is probably faster to just do it on the command line using our good buddy awk:

% for file in `ls *{dimm,mass}*`; do for> cp $file $file.bak; for> awk '{print $1$2$3$4$5$6,$7}' $file.bak > $file; for> done % cat *dimm.dat > all_dimm.dat % cat all_dimm.dat | sort | uniq -w 14 > all_dimm_uniq.dat % cat *mass.dat > all_mass.dat % cat all_mass.dat | sort | uniq -w 14 > all_mass_uniq.dat % head -n2 all_dimm_uniq.dat 20090922073636 0.41370476 20090922073802 0.44429658 % tail -n2 all_dimm_uniq.dat 20120925052836 0.29 20120925190417 nan

There are probably sexier ways to accomplish the above without the unwieldy print statement in the awk, but it took me all of seven seconds to do, so I call it good. Here you can see we process all the dimm and mass files, eliminating the spaces between the datetime fields, and then we concat all the files into an all_dimm.dat and all_mass.dat respectively. Because we have identified duplicate time entries in the files we go ahead and sort and uniq the .dat files and eliminate the dupes based off the first 14 characters (the datetime stamp).  Wash and repeat with the weather data using an even more obtuse print statement. Note that the weather data doesn't include seconds so we just add them in with our print statement. In fact, I should show that just because of its hideousness:
awk "{print \$1\"00\",\$2,\$3,\$4,\$5,\$6}" all_weather.dat.bak > all_weather.dat
In the end we should have three files, all_dimm.dat, all_mass.dat, and all_weather.dat, all of which have a unique datetime value for the first column, followed by a space, and then any remaining values (which is just dimm for dimm, mass for mass, and all those other things listed above for weather). So let's start pulling that into pandas.

Pull Into Pandas

For the record, I am using ipython3 notebook --pylab=inline for my work. To get it into fancy blog format, see Fernando Perez's helpful post. I had to alter this to work with python3.2 and will be sending my pull request to him shortly (it's kind of hackish for just my system right now. Hey, I'm a busy guy).

In [50]:
cd /home/wtgee/Projects/seeingstats/
/home/wtgee/Projects/seeingstats
In [51]:
import pandas as pd
In [52]:
dimm = pd.read_table('data/all_dimm.dat', sep=' ', header=None, index_col=0, names=['date','dimm'], parse_dates=True)
In [53]:
mass = pd.read_table('data/all_mass.dat', sep=' ', header=None,   index_col=0, names=['date','mass'], parse_dates=True)
In [54]:
weather = pd.read_table('data/all_weather.dat', sep=' ', header=None, index_col=0, names=['date','wind_speed','wind_dir','temp','rel_humidity','pressure'], parse_dates=True)
So what we have done is use pd.read_table to pull in each of the respective files. Nothing tricky here, just specifying the file, a space separator, no header, and our index is our first column which we use with parse_dates=True so that we will get a DatetimeIndex. We also go ahead and name our columns at the same time.
At this point we have three DataFrames that we would like to combine into one. However, before we can do that we need to do a little more scrubbing.
In [55]:
dimm.index
Out [55]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-09-22 07:36:36, ..., 2012-09-25 19:04:17]
Length: 233060, Freq: None, Timezone: None
In [56]:
dimm.head(3)
Out [56]:
dimm
date
2009-09-22 07:36:36 0.413705
2009-09-22 07:38:02 0.444297
2009-09-22 07:40:53 0.421856
In [57]:
mass.head(3)
Out [57]:
mass
date
2009-09-22 07:35:41 0.26
2009-09-22 07:37:06 0.23
2009-09-22 07:38:33 0.22
In [58]:
weather.head(3)
Out [58]:
wind_speed wind_dir temp rel_humidity pressure
date
2008-12-31 23:59:00 9 274 -4.79 95 612.8
2009-01-01 00:04:00 10 275 -4.83 94 612.8
2009-01-01 00:09:00 10 272 -4.83 93 612.8
As you can see above, our DatetimeIndex for each set currently have no Frequency. Furthermore, our data is not aligned in terms of timestamps (for instance, our first dimm reading is at 07:36:36 while our first mass reading is 07:35:41). Since our weather data is in 5 minute intervals and this seems like a pretty sane default to start with, we will need to do some massaging of the dimm and mass data to get them there as well. The general technique for this would be to convert the DataFrames to a Frequency of Seconds, forward-filling (or 'pad'ding) the data while we do that and then resample all that data into 5 minute intervals using the mean values.
Here's where we pull out our beefy computer since we are going to be creating data values for every second for a three year period. Luckily, our work is easy while the computer's work is hard:
In [59]:
dimm = dimm.asfreq('S', how='pad').resample('5min', how='mean')
In [60]:
mass = mass.asfreq('S', how='pad').resample('5min', how='mean')
In [61]:
weather = weather.asfreq('S', how='pad').resample('5min', how='mean')
In [62]:
dimm.head(3)
Out [62]:
dimm
2009-09-22 07:40:00 0.429001
2009-09-22 07:45:00 0.421856
2009-09-22 07:50:00 0.636297
In [63]:
mass.head(3)
Out [63]:
mass
2009-09-22 07:40:00 0.236667
2009-09-22 07:45:00 0.230000
2009-09-22 07:50:00 0.320000
In [64]:
weather.head(3)
Out [64]:
wind_speed wind_dir temp rel_humidity pressure
2009-01-01 00:00:00 9 274 -4.79 95 612.8
2009-01-01 00:05:00 10 275 -4.83 94 612.8
2009-01-01 00:10:00 10 272 -4.83 93 612.8
One last simple task for now is to remove all the weather dates from 2009 that we do not have dimm and mass data for:
In [66]:
weather = weather['2009-09-22':'2012-09-22']
In [67]:
dimm = dimm['2009-09-22':'2012-09-22']
In [68]:
mass = mass['2009-09-22':'2012-09-22']
So at this point we have data for a specified range of dates. Note, however, that our individual times our going to be off. For weather data we have a reading every five minutes for 24 hours each day while for the dimm and mass we have five minute intervals but only for observing times, that is, sunset to sunrise. Because of the nature of the instruments, however, the dimm and mass data don't line up precisely for what they consider sunset and sunrise. So next time we will be looking at how to get values for each of the DataFrames that actually line up.