Adventures in the Land of py-stat: Final Post

Now that GSOC is over, I'd like to provide a wrap up what I accomplished this summer.

In general, I have 4 main features that are tested and have results that match the Matlab/R counterparts. They are:

Inference on descriptive statistics
Inference on Linear Regression
Inference and Estimation of the accelerated failure time model
Inference and estimation of regression with instrumental variables

Although that these feature compute correctly, only the descriptive branch is ready to be merged into master as I am still generating documentation and working with the developers to determine exactly how to incorporate the new features.

Here is a bit more of the specifics:

Descriptive

This allows us to compute confidence intervals of descriptive statistics via EL
For example:

import statsmodels.api as sm
import numpy as np

data = np.random.standard_normal(20)
EL = sm.emplike.DescStat(data)
EL.ci_mean()
EL.ci_var()
EL.ci_kurt()
EL.ci_skew()

The above will return the empirical likelihood confidence regions for mean, variance, skewness and kurtosis of 'data.'

Regression

data = sm.datasets.stackloss.load()
model = sm.OLS(data.endog, sm.add_constant(data.exog, prepend=1)
fitted = model.fit()
fitted.ci_params((!))

This returns the confidence interval via EL for the first regression parameter. This is useful because it accounts for heteroskedasticity. Furthermore, we can add an option stochastic_exog=0 to change the regression assumption that the X matrix is stochastic to fixed. The resulting confidence interval will be smaller.

AFT model

data = sm.datasets.heart.laod() # heart not yet in master, I added it.
aftmod = sm.emplike.emplikeAFT(data.endog, data.exog, data.censors)
aftres = aftmod.fit()
aftres.params

This will give the parameters of an accelerated failure time model. I also implemented hypothesis tests and confidence intervals but the confidence intervals are very computationally intensive as it would involve a quadruply nested optimization, with on of the optimizations being a modified EM algorithm. This model was by far the most difficult model of my entire GSOC and took over 3 weeks to implement.

IVmodel

The basic framework for this is

model = sm.emplike.IVReg(endog, exog, instruments)
fittedIV = model.fit()
fitted.params

This will return the params of the model by solving for an optimal probability distribution such that Z(y-XB)=0 where Z is a matrix of instruments, y is the endogenous variable, X is a matrix of covariates and B are the model's parameters.

Like I said, only the descriptive branch is up to snuff but I plan on continuing work with EL in statsmodels and bringing the other branches up to snuff as well as working with the maintainers to incorporate the new features optimally.

As I mentioned in my last post, if any users have questions on how to conduct EL in statsmodels or would like a specific new feature implemented, please email me or ping the statsmodels mailing list and I will do my best to accommodate.

Special thanks again to everybody that helped this summer. Skipper and Josef were more helpful than I could have anticipated.

To see some of the code, check out pull-request #326 which contains a lot of the meat of the project, including the setup for optimizing an empirical likelihood ratio. Pull requests 427 and 406 are also EL code but are continually beaing edited to be ready to merge.

Adventures in the Land of py-stat

Wednesday, August 22, 2012

Final Post

1 comment: