It has been a couple of months since my last post; busy with lots of projects.

I had some fun playing around with data from Interactive Brokers API. It turns out that it is relatively easily to get hold of the raw market data relating to both trades and order book changes for CME/NYMEX commodity futures. For the purposes of this analysis I focused on the order book itself and what if anything one can imply forwards in time based on it.

All of the analysis was done in R. I have been experimenting with R for a year or so and have found it to have many advantages for statistical analyses. I put all the code at the bottom of the posting for those of you interested in doing something similar.

The raw data needs some manipulation to make it useful. Each record is a single “order book event” as I think of it. It could be a new order at a new price or a change to the number of buys or sells in the existing order book. Only the first 10 prices are supplied. As a result there are over 500,000 records for a single day’s 7 hours of trading during the pit hours (when most of the trade volume occurs).

For example, reconstructing a time series of the order book itself we see that as of

2012-02-17 13:52:55.302678

time the order book consisted of these bids:

Price | 2.689 | 2.690 | 2.691 | 2.692 | 2.694 | 2.695 | 2.685 | 2.686 | 2.687 | 2.688 |

quantity | 18 | 37 | 44 | 34 | 10 | 1 | 16 | 8 | 6 | 16 |

And these offers:

Price | 2.701 | 2.702 | 2.703 | 2.704 | 2.705 | 2.696 | 2.697 | 2.698 | 2.699 | 2.700 |

quantity | 168 | 11 | 65 | 8 | 10 | 6 | 79 | 21 | 11 | 12 |

So a rudimentary analysis of the data might focus on the imbalance between the sum of buys and the sum of sells. In the above example there are a total of 190 bids showing and 391 offers showing. Surely a difference of this magnitude would be meaningful?

For this sort of time series analysis we turn to the Vector Autoregression [VAR] models (as described in previous postings). We have two time series, each potentially dependent on itself and the other.

The first step is to discretize the data into 1 second interval data since the raw data is event driven and not summarized. The coding steps for this are outlined in R below. Also we focus initially on the price implied by the average of the best bid and offer. This is because trades may not occur contemporaneously with order book changes. Therefore the best price indicator we have at any point in time is based on the best bid and offer.

The sample data we are using is from the 2^{nd} hour of trading from 10am to 11am on Friday February 17^{th} 2012.

Estimation results for equation P_chg:

======================================

P_chg = P_chg.l1 + imbalance.l1 + P_chg.l2 + imbalance.l2 + P_chg.l3 + imbalance.l3 + P_chg.l4 + imbalance.l4 + P_chg.l5 + imbalance.l5 + const

Estimate Std. Error t value Pr(>|t|)

P_chg.l1 -2.592e-04 1.675e-02 -0.015 0.9877

imbalance.l1 7.496e-07 2.681e-07 2.796 0.0052 **

P_chg.l2 7.043e-02 1.676e-02 4.202 2.71e-05 ***

imbalance.l2 1.594e-07 3.571e-07 0.447 0.6553

P_chg.l3 2.088e-02 1.679e-02 1.244 0.2138

imbalance.l3 -6.813e-07 3.572e-07 -1.907 0.0566 .

P_chg.l4 -6.725e-03 1.675e-02 -0.401 0.6881

imbalance.l4 3.668e-07 3.536e-07 1.037 0.2997

P_chg.l5 -6.367e-03 1.666e-02 -0.382 0.7023

imbalance.l5 -5.575e-07 2.625e-07 -2.124 0.0337 *

const 5.517e-06 9.708e-06 0.568 0.5698

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0005032 on 3584 degrees of freedom

Multiple R-Squared: 0.01049, Adjusted R-squared: 0.007733

F-statistic: 3.801 on 10 and 3584 DF, p-value: 4.028e-05

Looking at the results we see that the model is significant. This means that we have a model that can explain the change in price at time t based on information available at time t-1. Specifically it indicates that the price change at time t can be forecast with the price changes and order imbalance data at times t-1, t-2 and t-3.

Unfortunately when we look at the fitted values, we see that the model isn’t giving significant enough forecasts of price. Fitted values range from -6.543e-04 to 6.409e-04. I.e.: +/- 0.6 of a tick [see also chart 1]. Given that trading costs alone are approximately 1 tick we haven’t created a forecast that we can actually make any money from. However it is interesting to get the intuitive validation that prices have a momentum component on a short term basis, and that changes in the order book (increasing number of buys vs. sells) is also a positive predictor for price.

More explanatory variables is one solution to trying to get more explanatory power.

The data is distributed randomly in time; maybe the frequency of data or *intensity* of the order book event process is relevant [see chart 2]. Since intensity is highly skewed by adding 1 and taking the log of the data we get a more useful distribution for statistical fitting purposes.

Also, (borrowing from Hasbrouck’s *Empirical Market Microstructure*), it is also worth considering the serial covariance of prices. Deviations of this from normal can be a signal of private information entering the market. Chart 3 shows the rolling serial covariance of price changes. Significantly positive periods may be predictive of market moving activity.

Rerunning the same analysis with these two additional variables we get the results:

stimation results for equation P_chg:

======================================

P_chg = P_chg.l1 + order_chg.l1 + log_intensity.l1 + serial_cov.l1 + P_chg.l2 + order_chg.l2 + log_intensity.l2 + serial_cov.l2 + P_chg.l3 + order_chg.l3 + log_intensity.l3 + serial_cov.l3 + P_chg.l4 + order_chg.l4 + log_intensity.l4 + serial_cov.l4 + const

Estimate Std. Error t value Pr(>|t|)

P_chg.l1 3.031e-04 1.678e-02 0.018 0.98559

order_chg.l1 7.210e-07 2.685e-07 2.686 0.00727 **

log_intensity.l1 2.574e-06 1.396e-05 0.184 0.85376

serial_cov.l1 4.551e-01 1.147e+00 0.397 0.69145

P_chg.l2 7.220e-02 1.678e-02 4.304 1.72e-05 ***

order_chg.l2 1.500e-07 3.576e-07 0.419 0.67503

log_intensity.l2 -1.737e-05 1.446e-05 -1.202 0.22961

serial_cov.l2 -5.224e-01 1.650e+00 -0.317 0.75153

P_chg.l3 2.208e-02 1.678e-02 1.316 0.18837

order_chg.l3 -6.861e-07 3.541e-07 -1.938 0.05274 .

log_intensity.l3 8.787e-06 1.445e-05 0.608 0.54320

serial_cov.l3 2.059e+00 1.649e+00 1.248 0.21204

P_chg.l4 -9.888e-03 1.669e-02 -0.593 0.55352

order_chg.l4 -1.387e-07 2.630e-07 -0.528 0.59786

log_intensity.l4 1.606e-06 1.395e-05 0.115 0.90837

serial_cov.l4 -1.964e+00 1.146e+00 -1.714 0.08663 .

const 1.152e-05 2.271e-05 0.507 0.61204

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0005036 on 3579 degrees of freedom

Multiple R-Squared: 0.01048, Adjusted R-squared: 0.006059

F-statistic: 2.37 on 16 and 3579 DF, p-value: 0.001621

Unfortunately the range of fitted values is still about 0.6 of a tick: -6.601e-04 to 6.419e-04 and the p value is not as compelling as before. We have added 2 variables and the model is no better a predictor than it was before.

So in summary, price movements can be predicted using a vector autoregression model, looking at lagged price change, order imbalance, message intensity and serial correlation. However the strength of the price signal is small relative to transaction costs.

Many thanks to Bernhard Pfaff for his R package ‘vars’ as well as the companion book *Analysis of Integrated and Cointegrated Time Series with R*, and Joel Hasbrouck for his book *Empirical Market Microstructure*.

I can be reached at lloyd.spencer@kahutrading.com

___________________________________________________

For those interested in the R code see below. I am aware there are some loops and other coding blunders but it does work:

Download data from IB:

library(IBrokers)

tws <- twsConnect(clientId = 1)

twsPrompt<-twsFuture(symbol="NG","NYMEX","20120227")

reqMktData(tws, twsList, CALLBACK = NULL, file = "data.dat")

Load and analyze data:

setwd('C:/Users/Lloyd/Documents/trading/analysis/2012/intraday/')

colType<-c("Date","POSIXct","integer","integer","integer","integer","integer","integer","numeric","integer","character")

depthData<-read.table("depthCL.dat",header=F, row.names = NULL, sep = " ", fill=TRUE) #,colClasses=colType

options(digits.secs=6) # option for formatting dates with millisecond timestamps

cleanDates<-strptime(paste(depthData[,1],depthData[,2]),"%Y%m%d %H:%M:%OS")

cleanData<-array(0,c(length(cleanDates),8))

for (col in 3:10) cleanData[,(col-2)]<-as.numeric(as.vector(depthData[,col]))

# remove bad data

keep<-which(!is.na(cleanDates))

cleanData<-cleanData[keep,]

cleanDates<-cleanDates[keep]

keep<-which(!is.na(cleanData[,5]))

cleanData<-cleanData[keep,]

cleanDates<-cleanDates[keep]

bidPrices<-rep(0,10)

bidSizes<-rep(0,10)

bidAge<-rep(0,10)

offerPrices<-rep(0,10)

offerSizes<-rep(0,10)

offerAge<-rep(0,10)

dataRows<-length(cleanData[,1])

bidTotal<-rep(0,dataRows)

offerTotal<-rep(0,dataRows)

midPrice<-rep(0,dataRows)

fwdIndex<-rep(0,dataRows)

# loop through the data, recreating the working order book at each instant in time.

for (rowNum in 1:dataRows) {

if ((cleanData[rowNum,6]==1)&(cleanData[rowNum,7]>0)) {

P<-cleanData[rowNum,7]

Q<-cleanData[rowNum,8]

if (P %in% bidPrices) {

bidSizes[which(bidPrices==P)]<-Q

bidAge[which(bidPrices==P)]<-1

bidAge[which(bidPrices!=P)]<-bidAge[which(bidPrices!=P)]+1

} else {

if (P>max(bidPrices)) {

delete<-which.min(bidPrices)

bidPrices[delete]<-P

bidSizes[delete]<-Q

} else if (P<min(bidPrices)) {

delete<-which.max(bidPrices)

bidPrices[delete]<-P

bidSizes[delete]<-Q

}

}

} else if ((cleanData[rowNum,6]==0)&(cleanData[rowNum,7]>0)) {

P<-cleanData[rowNum,7]

Q<-cleanData[rowNum,8]

if (P %in% offerPrices) {

offerSizes[which(offerPrices==P)]<-Q

offerAge[which(offerPrices==P)]<-1

offerAge[which(offerPrices!=P)]<-offerAge[which(offerPrices!=P)]+1

} else {

if (P>max(offerPrices)) {

delete<-which.min(offerPrices)

offerPrices[delete]<-P

offerSizes[delete]<-Q

} else if (P<min(offerPrices)) {

delete<-which.max(offerPrices)

offerPrices[delete]<-P

offerSizes[delete]<-Q

}

}

}

bidTotal[rowNum]<-sum(bidSizes)

offerTotal[rowNum]<-sum(offerSizes)

if ((min(offerPrices)>0)&(max(bidPrices)>0)) midPrice[rowNum]<-(min(offerPrices)+max(bidPrices))/2

}

# create second indices for dates with correct start of day

ints<-seq(cleanDates[36],cleanDates[length(cleanDates)],"sec")

secondIndex<-findInterval(as.numeric(ints),as.numeric(cleanDates))

intensity<-diff(secondIndex)+1

imbalance<-bidTotal[secondIndex]-offerTotal[secondIndex]

priceChange<-diff(midPrice[secondIndex])

priceDiffs<-diff(midPrice[secondIndex])

results=NULL

library(vars)

# first look at order imbalance

alldat<-as.matrix(cbind(priceDiffs,imbalance[2:(length(imbalance))]))[921:4520,]

colnames(alldat)<-c("P_chg","imbalance")

varest<-VAR(alldat,p=1,type="const", lag.max=20, ic="AIC")

summary(varest)

summary(varest$varresult$P_chg$fitted.values)

serialCov<-NULL

for (ii in (921-60):(4520-60)) serialCov<-rbind(serialCov,cov(priceDiffs[(ii-1):(ii+59)],priceDiffs[(ii):(ii+60)]))

plot(serialCov)

length(serialCov)

alldat<-(cbind(priceDiffs,imbalance[2:(length(imbalance))],log(intensity, base = 10)[2:(length(imbalance))])[921:4520,])

alldat<-cbind(alldat[,1:3],serialCov*1000) # 1000 times to avoid singularity

colnames(alldat)<-c("P_chg","order_chg","log_intensity","serial_cov")

varest<-VAR(alldat,type="const", lag.max=10, ic="AIC")

summary(varest)

summary(varest$varresult$P_chg$fitted.values)