A common task when starting analysis would be to get all of our data lined up and in a format where we can actually do something usable with it. For this project we have a number of data sources that we want to scrub and then align.
File Format
The dimm and mass data include six datetime columns and one value column.
% head 2009.dimm.dat
2009 09 22 07 36 36 0.41370476
2009 09 22 07 38 02 0.44429658
...
% head 2009.mass.dat
2009 09 22 07 35 41 0.26
2009 09 22 07 37 06 0.23
...
The weather data is as follows:
year month day hour minute(HST) wind_speed(kts) wind_direction(dec) temperature(C) relative_humidity(%) pressure(mb-if present)
% head cfht-wx.2009.dat
2008 12 31 23 59 9 274 -4.79 95 612.8
2009 01 01 00 04 10 275 -4.83 94 612.8
...
Our goal is to put all of these items into one DataFrame with a datetime index and columns for dimm, mass, and the various weather pieces. To do all this we will need to do a little pre-processing of the files, which exist for years 2009 to 2012, a little pulling of that data into pandas, and then a little manipulation in order to align everything properly. So, let's see how that's done. Note that all of our data is freely available from the following:
- http://wxws.ifa.hawaii.edu/current/seeing/data/ (dimm and mass)
- http://mkwc.ifa.hawaii.edu/archive/wx/cfht/ (weather data)
Pre-Processing
It is a whole lot easier to concatenate all the datetime fields into one and then just pull those in to pandas. While we could do that as we pull in the data it is probably faster to just do it on the command line using our good buddy awk:
% for file in `ls *{dimm,mass}*`; do
for> cp $file $file.bak;
for> awk '{print $1$2$3$4$5$6,$7}' $file.bak > $file;
for> done
% cat *dimm.dat > all_dimm.dat
% cat all_dimm.dat | sort | uniq -w 14 > all_dimm_uniq.dat
% cat *mass.dat > all_mass.dat
% cat all_mass.dat | sort | uniq -w 14 > all_mass_uniq.dat
% head -n2 all_dimm_uniq.dat
20090922073636 0.41370476
20090922073802 0.44429658
% tail -n2 all_dimm_uniq.dat
20120925052836 0.29
20120925190417 nan
There are probably sexier ways to accomplish the above without the unwieldy print statement in the awk, but it took me all of seven seconds to do, so I call it good. Here you can see we process all the dimm and mass files, eliminating the spaces between the datetime fields, and then we concat all the files into an all_dimm.dat and all_mass.dat respectively. Because we have identified duplicate time entries in the files we go ahead and sort and uniq the .dat files and eliminate the dupes based off the first 14 characters (the datetime stamp). Wash and repeat with the weather data using an even more obtuse print statement. Note that the weather data doesn't include seconds so we just add them in with our print statement. In fact, I should show that just because of its hideousness:
In the end we should have three files, all_dimm.dat, all_mass.dat, and all_weather.dat, all of which have a unique datetime value for the first column, followed by a space, and then any remaining values (which is just dimm for dimm, mass for mass, and all those other things listed above for weather). So let's start pulling that into pandas.
For the record, I am using
awk "{print \$1\"00\",\$2,\$3,\$4,\$5,\$6}" all_weather.dat.bak > all_weather.dat
In the end we should have three files, all_dimm.dat, all_mass.dat, and all_weather.dat, all of which have a unique datetime value for the first column, followed by a space, and then any remaining values (which is just dimm for dimm, mass for mass, and all those other things listed above for weather). So let's start pulling that into pandas.
Pull Into Pandas
ipython3 notebook --pylab=inline
for my work. To get it into fancy blog format, see Fernando Perez's helpful post. I had to alter this to work with python3.2 and will be sending my pull request to him shortly (it's kind of hackish for just my system right now. Hey, I'm a busy guy).
In [50]:
cd /home/wtgee/Projects/seeingstats/
In [51]:
import pandas as pd
In [52]:
dimm = pd.read_table('data/all_dimm.dat', sep=' ', header=None, index_col=0, names=['date','dimm'], parse_dates=True)
In [53]:
mass = pd.read_table('data/all_mass.dat', sep=' ', header=None, index_col=0, names=['date','mass'], parse_dates=True)
In [54]:
weather = pd.read_table('data/all_weather.dat', sep=' ', header=None, index_col=0, names=['date','wind_speed','wind_dir','temp','rel_humidity','pressure'], parse_dates=True)
So what we have done is use pd.read_table to pull in each of the respective files. Nothing tricky here, just specifying the file, a space separator, no header, and our index is our first column which we use with parse_dates=True so that we will get a DatetimeIndex. We also go ahead and name our columns at the same time.
At this point we have three DataFrames that we would like to combine into one. However, before we can do that we need to do a little more scrubbing.
At this point we have three DataFrames that we would like to combine into one. However, before we can do that we need to do a little more scrubbing.
In [55]:
dimm.index
Out [55]:
In [56]:
dimm.head(3)
Out [56]:
In [57]:
mass.head(3)
Out [57]:
In [58]:
weather.head(3)
Out [58]:
As you can see above, our DatetimeIndex for each set currently have no Frequency. Furthermore, our data is not aligned in terms of timestamps (for instance, our first dimm reading is at 07:36:36 while our first mass reading is 07:35:41). Since our weather data is in 5 minute intervals and this seems like a pretty sane default to start with, we will need to do some massaging of the dimm and mass data to get them there as well. The general technique for this would be to convert the DataFrames to a Frequency of Seconds, forward-filling (or 'pad'ding) the data while we do that and then resample all that data into 5 minute intervals using the mean values.
Here's where we pull out our beefy computer since we are going to be creating data values for every second for a three year period. Luckily, our work is easy while the computer's work is hard:
Here's where we pull out our beefy computer since we are going to be creating data values for every second for a three year period. Luckily, our work is easy while the computer's work is hard:
In [59]:
dimm = dimm.asfreq('S', how='pad').resample('5min', how='mean')
In [60]:
mass = mass.asfreq('S', how='pad').resample('5min', how='mean')
In [61]:
weather = weather.asfreq('S', how='pad').resample('5min', how='mean')
In [62]:
dimm.head(3)
Out [62]:
In [63]:
mass.head(3)
Out [63]:
In [64]:
weather.head(3)
Out [64]:
One last simple task for now is to remove all the weather dates from 2009 that we do not have dimm and mass data for:
In [66]:
weather = weather['2009-09-22':'2012-09-22']
In [67]:
dimm = dimm['2009-09-22':'2012-09-22']
In [68]:
mass = mass['2009-09-22':'2012-09-22']
So at this point we have data for a specified range of dates. Note, however, that our individual times our going to be off. For weather data we have a reading every five minutes for 24 hours each day while for the dimm and mass we have five minute intervals but only for observing times, that is, sunset to sunrise. Because of the nature of the instruments, however, the dimm and mass data don't line up precisely for what they consider sunset and sunrise. So next time we will be looking at how to get values for each of the DataFrames that actually line up.
No comments:
Post a Comment