## Monday, July 8, 2013

### Modeling Residential Electricity Usage with R – Part 2

I can’t believe it has been nearly 6 months since I last posted.  Given the sustained heat it seemed like a good idea to finish off this subject.

As hinted at in my last post, temperature is the missing variable to make sense of Residential electrical usage.  Fortunately there are some reasonable freely available historical weather databases, most notably the one provided by NOAA.  As covered in the last posting, we can download this data for a particular location, in this case Chicago O’Hare airport, KORD.

Next we have to find a suitable model for the usage pattern we observed in the last posting.  Usage is lowest at around 70F, and rises slightly as temperatures fall but rises significantly as temperatures rise.

Operationally we can imagine that as temperatures rise beyond what is comfortable, more and more cooling devices turn on to combat the heat.  However, eventually all the air conditioning equipment is on at full capacity and incremental demand drops off.  Based on this, a logistic function [http://en.wikipedia.org/wiki/Logistic_function] seems appropriate.

Furthermore, depending on the time of day, the high temperature and the low temperature may both be influential predictors.

Continuing from the code presented in the last posting, we can fit a logistic function in temperature to the usage data for each hour of the day, taking into account whether it is a weekday, the time of the year and the temperature function:

allResults=NULL   # create the variable for results for each hour
startList=list(intercept=1.5,weekday=-.02,Time=0,S1=.04,C1=-.03,S2=-.02,
C2=-.04,S3=.001,C3=.01,maxT=87,maxS=4,maxScalar=1,
maxSlope=0,minSlope=0)
# initial variable value

for (hourInd in 1:24){  # iterate through hours; create a model for each hour

test2<-nls(ComEd[,(hourInd+1)] ~ intercept
+ timcols%*%c(weekday,Time,S1,C1,S2,C2,S3,C3)
+ maxSlope*KORDtemps[dateInd,"MAX"]
+ minSlope*KORDtemps[dateInd,"MIN"]
+ maxScalar*plogis(KORDtemps[dateInd,"MAX"],
location = maxT, scale = maxS),
start=startList,
control=nls.control(maxiter = 500, tol = 1e-05,
minFactor = 1/10000,printEval = FALSE, warnOnly = TRUE))
# using nonlinear least squares to find the best result

allResults=rbind(allResults,coef(test2))  # combine all results

startList=list(intercept=coef(test2)[1],weekday=coef(test2)[2],
Time=coef(test2)[3],S1=coef(test2)[4],
C1=coef(test2)[5],S2=coef(test2)[6],C2=coef(test2)[7],
S3=coef(test2)[8],C3=coef(test2)[9],maxT=coef(test2)[10],
maxS=coef(test2)[11],maxScalar=coef(test2)[12],
maxSlope=coef(test2)[13],minSlope=coef(test2)[14])
# update starting value for next iteration to solution from previous hour
# although not always necessary this reduces computational time
}

Then we can examine the regression results to see the average response function in usage based on Max temperature:
maxResponse = mean(allResults[,'maxSlope']) * KORDtemps[dateInd,"MAX"]
+ mean(allResults[,'maxScalar'])*plogis(KORDtemps[dateInd,"MAX"],
location = mean(allResults[,'maxT']),
scale = mean(allResults[,'maxS']), log = FALSE)

plot(KORDtemps[dateInd,"MAX"],maxResponse)

In terms of fit we can revisit the chart from the previous blog posting by plotting the observed data for a particular hour, in this case midday, overlaid with the fitted data for the same inputs.

We can see that the fitted data (in black) overlay the actual data (in red) well but there are still differences for very hot days.

So to summarize we can model residential power usage pretty accurately to first order based on knowledge of the high temperature for the day as well as the time of the year, the day of the week and the hour of the day.  As expected, as temperatures rise, usage grows dramatically over a high temperature of 80F.

Stay cool!

## Wednesday, January 30, 2013

### Modeling Residential Electricity Usage with R

Wow, I can’t believe it has been 11 months since my last blog posting!  The next series of postings will be related to the retail energy field.  Residential power usage is satisfying to model as it can be forecast fairly accurately with the right inputs.  Partly as a consequence of deregulation there is now more data more available than before.  As in prior postings I will use reproducible R code each step of the way.

For this posting I will be using data from Commonwealth Edison [ComEd] in Chicago, IL.  ComEd makes available historical usage data for different rate classes.  In this example I use rate class C23.

library(xlsx)

# load historical electric usage data from ComEd website

# edit row and column names
dimnames(ComEd)[2][[1]]<-{c('Date',1:24)}
dimnames(ComEd)[1][[1]]<-substr(ComEd[,1],5,15)
ComEd[,1]<-as.Date(substr(ComEd[,1],5,15),'%m/%d/%Y')

Next we hypothesize some explanatory variables.  Presumably electricity usage is influenced by day of the week, time of the year and the passage of time.  Therefore we construct a set of explanatory variables and use them to predict usage for a particular hour of the day [16] as follows:

# construct time related explanatory variables
times<-as.numeric(ComEd[,1]-ComEd[1,1])/365.25
weekdayInd<-!(weekdays(ComEd[,1])=="Saturday"|weekdays(ComEd[,1])=="Sunday")
timcols<-cbind(weekday=weekdayInd,
Time=times,S1=sin(2*pi*times),C1=cos(2*pi*times),S2=sin(4*pi*times),C2=cos(4*pi*times),S3=sin(6*pi*times),C3=cos(6*pi*times))

lm1<-lm(ComEd[,16]~timcols)
summary(lm1)

While all the input variables are highly predictive, something seems to be missing in overall predictive accuracy:
Call:
lm(formula = ComEd[, 16] ~ timcols)

Residuals:
Min       1Q   Median       3Q      Max
-1.11876 -0.16639 -0.01833  0.12886  1.64360

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     1.30253    0.02577  50.545  < 2e-16 ***
timcolsweekday -0.08813    0.02226  -3.959 7.93e-05 ***
timcolsTime     0.06901    0.00959   7.196 1.03e-12 ***
timcolsS1       0.37507    0.01450  25.863  < 2e-16 ***
timcolsC1      -0.17110    0.01424 -12.016  < 2e-16 ***
timcolsS2      -0.23303    0.01425 -16.351  < 2e-16 ***
timcolsC2      -0.37622    0.01439 -26.136  < 2e-16 ***
timcolsS3      -0.05689    0.01434  -3.967 7.65e-05 ***
timcolsC3       0.13164    0.01424   9.246  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3679 on 1331 degrees of freedom
Multiple R-squared: 0.6122,    Adjusted R-squared: 0.6099
F-statistic: 262.7 on 8 and 1331 DF,  p-value: < 2.2e-16

A plot of the residuals clearly shows something else is going on:
plot(lm1\$residuals)

As you may have already concluded there is seasonal heteroscedasticity in the data, beyond what we have fitted with the seasonal variables.

Let’s load some temperature data for the Chicago area and see how it relates to this data.
First we load the weather data from the NOAA database.  It is stored in annual fixed width files which we stitch together:

# Load weather data from NOAA ftp site
KORDtemps<-NULL
for (yearNum in 2009:2013)
{
ftpString<-paste('ftp://ftp.ncdc.noaa.gov/pub/data/gsod/',yearNum,'/725300-94846-',yearNum,'.op.gz',sep='')
fileString<-paste('KORD',yearNum,'.gz',sep='')
skip=1,
col.names=c('STN','WBAN','YEARMODA','TEMP','TEMPCOUNT','DEWP','DEWPCOUNT','SLP','SLPCOUNT','STP','STPCOUNT','VISIB','VISIBCOUNT','WDSP','WDSPCOUNT','MAXSPD','GUST','MAX','MAXFLAG','MIN','MINFLAG','PRCP','PRCPFLAG','SNDP','FRSHTT'))
KORDtemps<-rbind(KORDtemps,temp2)
}

# Change missing data to NAs
KORDtemps[KORDtemps==9999.9]<-NA
KORDtemps<-cbind(date=as.Date(as.character(KORDtemps[,3]),'%Y%m%d'),KORDtemps)

# create cross reference date index
dateInd<-sapply(ComEd[,1], function(x) which(KORDtemps[,1]==x))

Next we plot the residuals from our regression vs. the prevailing max temperature for the day:
plot(KORDtemps[dateInd,"MAX"],lm1\$residuals)

As expected there is something going on relating temperature to residuals.

In my next blog posting I will discuss approaches to fitting this weather data to the demand model...

## Monday, March 5, 2012

### Futures price prediction using the order book data

It has been a couple of months since my last post; busy with lots of projects.

I had some fun playing around with data from Interactive Brokers API.  It turns out that it is relatively easily to get hold of the raw market data relating to both trades and order book changes for CME/NYMEX commodity futures.  For the purposes of this analysis I focused on the order book itself and what if anything one can imply forwards in time based on it.

All of the analysis was done in R.  I have been experimenting with R for a year or so and have found it to have many advantages for statistical analyses.  I put all the code at the bottom of the posting for those of you interested in doing something similar.

The raw data needs some manipulation to make it useful.  Each record is a single “order book event” as I think of it.  It could be a new order at a new price or a change to the number of buys or sells in the existing order book.  Only the first 10 prices are supplied.  As a result there are over 500,000 records for a single day’s 7 hours of trading during the pit hours (when most of the trade volume occurs).

For example, reconstructing a time series of the order book itself we see that as of
2012-02-17 13:52:55.302678
time the order book consisted of these bids:
 Price 2.689 2.69 2.691 2.692 2.694 2.695 2.685 2.686 2.687 2.688 quantity 18 37 44 34 10 1 16 8 6 16

And these offers:
 Price 2.701 2.702 2.703 2.704 2.705 2.696 2.697 2.698 2.699 2.7 quantity 168 11 65 8 10 6 79 21 11 12

So a rudimentary analysis of the data might focus on the imbalance between the sum of buys and the sum of sells.  In the above example there are a total of 190 bids showing and 391 offers showing.  Surely a difference of this magnitude would be meaningful?

For this sort of time series analysis we turn to the Vector Autoregression [VAR] models (as described in previous postings).  We have two time series, each potentially dependent on itself and the other.

The first step is to discretize the data into 1 second interval data since the raw data is event driven and not summarized.  The coding steps for this are outlined in R below.  Also we focus initially on the price implied by the average of the best bid and offer.  This is because trades may not occur contemporaneously with order book changes.  Therefore the best price indicator we have at any point in time is based on the best bid and offer.

The sample data we are using is from the 2nd hour of trading from 10am to 11am on Friday February 17th 2012.

`Estimation results for equation P_chg: `
`====================================== `
`P_chg = P_chg.l1 + imbalance.l1 + P_chg.l2 + imbalance.l2 + P_chg.l3 + imbalance.l3 + P_chg.l4 + imbalance.l4 + P_chg.l5 + imbalance.l5 + const `
` `
`               Estimate Std. Error t value Pr(>|t|)    `
`P_chg.l1     -2.592e-04  1.675e-02  -0.015   0.9877    `
`imbalance.l1  7.496e-07  2.681e-07   2.796   0.0052 ** `
`P_chg.l2      7.043e-02  1.676e-02   4.202 2.71e-05 ***`
`imbalance.l2  1.594e-07  3.571e-07   0.447   0.6553    `
`P_chg.l3      2.088e-02  1.679e-02   1.244   0.2138    `
`imbalance.l3 -6.813e-07  3.572e-07  -1.907   0.0566 .  `
`P_chg.l4     -6.725e-03  1.675e-02  -0.401   0.6881    `
`imbalance.l4  3.668e-07  3.536e-07   1.037   0.2997    `
`P_chg.l5     -6.367e-03  1.666e-02  -0.382   0.7023    `
`imbalance.l5 -5.575e-07  2.625e-07  -2.124   0.0337 *  `
`const         5.517e-06  9.708e-06   0.568   0.5698    `
`---`
`Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 `
` `
` `
`Residual standard error: 0.0005032 on 3584 degrees of freedom`
`Multiple R-Squared: 0.01049,   Adjusted R-squared: 0.007733 `
`F-statistic: 3.801 on 10 and 3584 DF,  p-value: 4.028e-05 `

Looking at the results we see that the model is significant.  This means that we have a model that can explain the change in price at time t based on information available at time t-1.  Specifically it indicates that the price change at time t can be forecast with the price changes and order imbalance data at times t-1, t-2 and t-3.

Unfortunately when we look at the fitted values, we see that the model isn’t giving significant enough forecasts of price.  Fitted values range from -6.543e-04 to 6.409e-04.  I.e.: +/- 0.6 of a tick [see also chart 1].  Given that trading costs alone are approximately 1 tick we haven’t created a forecast that we can actually make any money from.  However it is interesting to get the intuitive validation that prices have a momentum component on a short term basis, and that changes in the order book (increasing number of buys vs. sells) is also a positive predictor for price.
More explanatory variables is one solution to trying to get more explanatory power.
The data is distributed randomly in time; maybe the frequency of data or intensity of the order book event process is relevant [see chart 2].  Since intensity is highly skewed by adding 1 and taking the log of the data we get a more useful distribution for statistical fitting purposes.

Also, (borrowing from Hasbrouck’s Empirical Market Microstructure), it is also worth considering the serial covariance of prices.  Deviations of this from normal can be a signal of private information entering the market.  Chart 3 shows the rolling serial covariance of price changes.  Significantly positive periods may be predictive of market moving activity.

Rerunning the same analysis with these two additional variables we get the results:

`stimation results for equation P_chg: `
`====================================== `
`P_chg = P_chg.l1 + order_chg.l1 + log_intensity.l1 + serial_cov.l1 + P_chg.l2 + order_chg.l2 + log_intensity.l2 + serial_cov.l2 + P_chg.l3 + order_chg.l3 + log_intensity.l3 + serial_cov.l3 + P_chg.l4 + order_chg.l4 + log_intensity.l4 + serial_cov.l4 + const `
` `
`                   Estimate Std. Error t value Pr(>|t|)    `
`P_chg.l1          3.031e-04  1.678e-02   0.018  0.98559    `
`order_chg.l1      7.210e-07  2.685e-07   2.686  0.00727 ** `
`log_intensity.l1  2.574e-06  1.396e-05   0.184  0.85376    `
`serial_cov.l1     4.551e-01  1.147e+00   0.397  0.69145    `
`P_chg.l2          7.220e-02  1.678e-02   4.304 1.72e-05 ***`
`order_chg.l2      1.500e-07  3.576e-07   0.419  0.67503    `
`log_intensity.l2 -1.737e-05  1.446e-05  -1.202  0.22961    `
`serial_cov.l2    -5.224e-01  1.650e+00  -0.317  0.75153    `
`P_chg.l3          2.208e-02  1.678e-02   1.316  0.18837    `
`order_chg.l3     -6.861e-07  3.541e-07  -1.938  0.05274 .  `
`log_intensity.l3  8.787e-06  1.445e-05   0.608  0.54320    `
`serial_cov.l3     2.059e+00  1.649e+00   1.248  0.21204    `
`P_chg.l4         -9.888e-03  1.669e-02  -0.593  0.55352    `
`order_chg.l4     -1.387e-07  2.630e-07  -0.528  0.59786    `
`log_intensity.l4  1.606e-06  1.395e-05   0.115  0.90837    `
`serial_cov.l4    -1.964e+00  1.146e+00  -1.714  0.08663 .  `
`const             1.152e-05  2.271e-05   0.507  0.61204    `
`---`
`Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 `
` `
` `
`Residual standard error: 0.0005036 on 3579 degrees of freedom`
`Multiple R-Squared: 0.01048,   Adjusted R-squared: 0.006059 `
`F-statistic:  2.37 on 16 and 3579 DF,  p-value: 0.001621 `

Unfortunately the range of fitted values is still about 0.6 of a tick: -6.601e-04 to 6.419e-04 and the p value is not as compelling as before.  We have added 2 variables and the model is no better a predictor than it was before.

So in summary, price movements can be predicted using a vector autoregression model, looking at lagged price change, order imbalance, message intensity and serial correlation.  However the strength of the price signal is small relative to transaction costs.

Many thanks to Bernhard Pfaff for his R package ‘vars’ as well as the companion book Analysis of Integrated and Cointegrated Time Series with R, and Joel Hasbrouck for his book Empirical Market Microstructure.

I can be reached at lloyd.spencer@kahutrading.com
___________________________________________________
For those interested in the R code see below.  I am aware there are some loops and other coding blunders but it does work:

library(IBrokers)
tws <- twsConnect(clientId = 1)
twsPrompt<-twsFuture(symbol="NG","NYMEX","20120227")
reqMktData(tws, twsList, CALLBACK = NULL, file = "data.dat")

colType<-c("Date","POSIXct","integer","integer","integer","integer","integer","integer","numeric","integer","character")

options(digits.secs=6)  # option for formatting dates with millisecond timestamps
cleanDates<-strptime(paste(depthData[,1],depthData[,2]),"%Y%m%d %H:%M:%OS")
cleanData<-array(0,c(length(cleanDates),8))
for (col in 3:10) cleanData[,(col-2)]<-as.numeric(as.vector(depthData[,col]))

keep<-which(!is.na(cleanDates))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

keep<-which(!is.na(cleanData[,5]))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

bidPrices<-rep(0,10)
bidSizes<-rep(0,10)
bidAge<-rep(0,10)
offerPrices<-rep(0,10)
offerSizes<-rep(0,10)
offerAge<-rep(0,10)

dataRows<-length(cleanData[,1])
bidTotal<-rep(0,dataRows)
offerTotal<-rep(0,dataRows)
midPrice<-rep(0,dataRows)
fwdIndex<-rep(0,dataRows)

# loop through the data, recreating the working order book at each instant in time.
for (rowNum in 1:dataRows) {
if ((cleanData[rowNum,6]==1)&(cleanData[rowNum,7]>0)) {
P<-cleanData[rowNum,7]
Q<-cleanData[rowNum,8]
if (P %in% bidPrices) {
bidSizes[which(bidPrices==P)]<-Q
bidAge[which(bidPrices==P)]<-1
bidAge[which(bidPrices!=P)]<-bidAge[which(bidPrices!=P)]+1
} else {
if (P>max(bidPrices)) {
delete<-which.min(bidPrices)
bidPrices[delete]<-P
bidSizes[delete]<-Q

} else if (P<min(bidPrices)) {
delete<-which.max(bidPrices)
bidPrices[delete]<-P
bidSizes[delete]<-Q
}
}
} else if ((cleanData[rowNum,6]==0)&(cleanData[rowNum,7]>0)) {
P<-cleanData[rowNum,7]
Q<-cleanData[rowNum,8]
if (P %in% offerPrices) {
offerSizes[which(offerPrices==P)]<-Q
offerAge[which(offerPrices==P)]<-1
offerAge[which(offerPrices!=P)]<-offerAge[which(offerPrices!=P)]+1
} else {
if (P>max(offerPrices)) {
delete<-which.min(offerPrices)
offerPrices[delete]<-P
offerSizes[delete]<-Q

} else if (P<min(offerPrices)) {
delete<-which.max(offerPrices)
offerPrices[delete]<-P
offerSizes[delete]<-Q
}
}
}
bidTotal[rowNum]<-sum(bidSizes)
offerTotal[rowNum]<-sum(offerSizes)
if ((min(offerPrices)>0)&(max(bidPrices)>0)) midPrice[rowNum]<-(min(offerPrices)+max(bidPrices))/2
}

# create second indices for dates with correct start of day
ints<-seq(cleanDates[36],cleanDates[length(cleanDates)],"sec")
secondIndex<-findInterval(as.numeric(ints),as.numeric(cleanDates))
intensity<-diff(secondIndex)+1
imbalance<-bidTotal[secondIndex]-offerTotal[secondIndex]
priceChange<-diff(midPrice[secondIndex])

priceDiffs<-diff(midPrice[secondIndex])
results=NULL
library(vars)

# first look at order imbalance

alldat<-as.matrix(cbind(priceDiffs,imbalance[2:(length(imbalance))]))[921:4520,]
colnames(alldat)<-c("P_chg","imbalance")
varest<-VAR(alldat,p=1,type="const", lag.max=20, ic="AIC")
summary(varest)
summary(varest\$varresult\$P_chg\$fitted.values)

serialCov<-NULL
for (ii in (921-60):(4520-60)) serialCov<-rbind(serialCov,cov(priceDiffs[(ii-1):(ii+59)],priceDiffs[(ii):(ii+60)]))
plot(serialCov)
length(serialCov)

alldat<-(cbind(priceDiffs,imbalance[2:(length(imbalance))],log(intensity, base = 10)[2:(length(imbalance))])[921:4520,])
alldat<-cbind(alldat[,1:3],serialCov*1000)  # 1000 times to avoid singularity
colnames(alldat)<-c("P_chg","order_chg","log_intensity","serial_cov")

varest<-VAR(alldat,type="const", lag.max=10, ic="AIC")
summary(varest)
summary(varest\$varresult\$P_chg\$fitted.values)

## Thursday, December 1, 2011

### NG Spreads returns, a reliable earner.

I was reminded recently it has been a while since my last posting.  So here is a quick analysis of a strategy that works pretty well.

Much has been written on the changes to the NG market structure over the past few years; the influence of financial investors and the impact of new technologies.  One thing that hasn’t changed is the reliable earning power of spread trading.

The complexity in modeling returns from futures trading, particularly spread trading means that the best way to characterize the profitability of these types of strategies is in terms of \$/contract.  This can then be directly compared to the \$/contract of margin requirements for a reasonable perspective on the investment return.  For this analysis, I used recent data from the CME website for initial margin by strategy type for the NG contract:

But first I present the \$/contract return side of the equation for some standard spread strategies.  These analyses are all based on rolling a X-Y position in NG futures, with the contract roll being on the penultimate day available.  In the usual convention X is the nearby contract and Y is the further out contract.  The ratio is 1:1 in contracts.  Other roll alternatives may make more sense but this is conveniently tradeable given the TAS contracts available.  Other hedge ratios may be more statistically defensible; 1:1 has the advantage of being available as a directly tradable spread (also via TAS), and the analytics is trivial.

Looking at the 5 year historical performance of N-(N+1) spreads we see the following results:

 spread Daily avg Daily stdev Annualizedavg/stdev 1-2 13.27 232.51 0.90 2-3 -4.89 172.79 -0.45 3-4 -10.81 128.42 -1.33 4-5 -3.90 126.33 -0.49 5-6 -1.30 97.10 -0.21 6-7 -5.48 108.67 -0.80 7-8 -9.22 123.99 -1.18 8-9 -17.55 179.98 -1.54 9-10 -2.78 129.50 -0.34 10-11 0.96 123.55 0.12 11-12 -7.32 755.13 -0.15 1-3 18.16 193.36 1.48 1-2 -(3-4) 24.08 232.20 1.64

This table shows the average daily return and standard deviation in \$/contract.  So for example, holding a long prompt contract and short 2nd month contract since January 1, 2007 made an average of \$13.27 per day with a standard deviation of \$232.51 per day.  However on an annualized basis this translates into \$3,317 and \$3,676 respectively or a profit to risk ratio of about 0.9.  Not great, but a nice diversified addition to an existing strategy.
The highest individual 1 month spread contract return is in the 3-4 contract.  Specifically being short this contract, which yields a profit to risk ratio for the year of 1.33.

By extension, these strategies can be combined relatively easily:
Long 1-2 plus short 2-3 (which is equivalent to long 1-3) has a 1.48 profit to risk ratio.
Long 1-2 plus short 3-4 has a 1.64 profit to risk ratio.

What is nice about these strategies is that the capital (margin) commitment is not immense.  An individual spread requires less than \$1000 per contract of initial margin.  Drawdowns have been limited in the past 5 years as the cumulative profit chart shows:

Overall, this is a nice uncorrelated strategy to add to a portfolio.