Adventures in the Land of py-stat

Final Post

2012-08-22T15:10:00.001-07:00

Now that GSOC is over, I'd like to provide a wrap up what I accomplished this summer.

In general, I have 4 main features that are tested and have results that match the Matlab/R counterparts. They are:

Inference on descriptive statistics
Inference on Linear Regression
Inference and Estimation of the accelerated failure time model
Inference and estimation of regression with instrumental variables

Although that these feature compute correctly, only the descriptive branch is ready to be merged into master as I am still generating documentation and working with the developers to determine exactly how to incorporate the new features.

Here is a bit more of the specifics:

Descriptive

This allows us to compute confidence intervals of descriptive statistics via EL
For example:

import statsmodels.api as sm
import numpy as np

data = np.random.standard_normal(20)
EL = sm.emplike.DescStat(data)
EL.ci_mean()
EL.ci_var()
EL.ci_kurt()
EL.ci_skew()

The above will return the empirical likelihood confidence regions for mean, variance, skewness and kurtosis of 'data.'

Regression

data = sm.datasets.stackloss.load()
model = sm.OLS(data.endog, sm.add_constant(data.exog, prepend=1)
fitted = model.fit()
fitted.ci_params((!))

This returns the confidence interval via EL for the first regression parameter. This is useful because it accounts for heteroskedasticity. Furthermore, we can add an option stochastic_exog=0 to change the regression assumption that the X matrix is stochastic to fixed. The resulting confidence interval will be smaller.

AFT model

data = sm.datasets.heart.laod() # heart not yet in master, I added it.
aftmod = sm.emplike.emplikeAFT(data.endog, data.exog, data.censors)
aftres = aftmod.fit()
aftres.params

This will give the parameters of an accelerated failure time model. I also implemented hypothesis tests and confidence intervals but the confidence intervals are very computationally intensive as it would involve a quadruply nested optimization, with on of the optimizations being a modified EM algorithm. This model was by far the most difficult model of my entire GSOC and took over 3 weeks to implement.

IVmodel

The basic framework for this is

model = sm.emplike.IVReg(endog, exog, instruments)
fittedIV = model.fit()
fitted.params

This will return the params of the model by solving for an optimal probability distribution such that Z(y-XB)=0 where Z is a matrix of instruments, y is the endogenous variable, X is a matrix of covariates and B are the model's parameters.

Like I said, only the descriptive branch is up to snuff but I plan on continuing work with EL in statsmodels and bringing the other branches up to snuff as well as working with the maintainers to incorporate the new features optimally.

As I mentioned in my last post, if any users have questions on how to conduct EL in statsmodels or would like a specific new feature implemented, please email me or ping the statsmodels mailing list and I will do my best to accommodate.

Special thanks again to everybody that helped this summer. Skipper and Josef were more helpful than I could have anticipated.

To see some of the code, check out pull-request #326 which contains a lot of the meat of the project, including the setup for optimizing an empirical likelihood ratio. Pull requests 427 and 406 are also EL code but are continually beaing edited to be ready to merge.

Winding Down

2012-08-19T18:48:00.002-07:00

GSOC comes to a close tonight at 12:00 midnight and I could not have asked for a more rewarding summer job.

As far as my proposal, I have implemented EL descriptive statistics, linear regression, censored regression and instrumental variable regression. Although they are not all ready to be merged into the statsmodels main, the computation is there and all that is left is the documentation and cleaning, something that I will be continually working on.

I plan on staying active in the statsmodels community and hope to continue contributing empirical likelihood models as well as other models in the indefinite future.

Also, if anybody would like to see other specific empirical likelihood models implemented in statsmodels, please post to the mailinglist or email me directly at justin.grana@gmail.com.

A special thanks to Skipper and Josef as well as Ralf for their help this summer. Your advice was invaluable.

Slight change of plans

2012-08-08T10:09:00.003-07:00

Just to make it public, I have decided not to implement model selection criteria but instead to implement instrumental variable regression. More to come soon.

Estimating an AFT model

2012-08-07T01:11:00.001-07:00

Estimation of the parametric Accelerated Failure Time (AFT) model is no walk in the park. Much of the research points to two main methods. The first is Koul et al (1981) in which he uses a synthetic data approach; and Stute (1993, 1996) in which the author uses WLS regression. These are often preferred to other estimators due to their computational simplicity.

HOWEVER, in all cases the method of Koul performs worse than that of Stute. This is because an AFT model requires an estimate CDF for survival probabilities. In Koul's theoretical paper, the authors do not use the Kaplan-Meier estimator because they note ,"in our proof we need to take the log), and using a KM estimator can potentially lead to taking the log of 0. However, if we replace Koul's estimate of the survival probabilities with the Kaplan Meier estimate, the bias in the estimates essentially goes away. Furthermore, by assuming that max(dependent_variable) is uncensored, we can avoid the divide by 0 dilemma. To add to the ambiguity, other papers that cite Koul claim that Koul uses the Kaplan-Meier estimate when he actually doesn't.

My issue is if I cannot find a theoretical justification for using the KM estimate in Koul's mehtod, I cannot use it to code. That said, is there any theoretical adjustment of adapting Koul's synthetic data method and replacing the survival probabilities with the KM estimate? I have looked but haven't been able to find any.

Midterm Review

2012-07-11T23:01:00.004-07:00

To start things off, GSOC has been a 10/10 for me. I enjoy the work I am doing and the downsides are very minimal.

Status
Per my proposal, my GSOC project is right on target. Some features I have so far are empirical likelihood hypothesis tests and confidence intervals for:

the mean
the variance
the skewness
the kurtosis
correlation
multivariate means
regression parameters
regression parameters in a model forced through the origin.
* I also added some neat plots that I didn't expect to do.

Regression through the origin includes empirical likelihood estimates since this is the only case (so far) where the empirical likelihood estimates differ from the traditional estimates. Tomorrow (Thursday) I will be adding ANOVA. I will also be adding an option to choose if the regressors are fixed or stochastic. The only difference is that with fixed regressors, moment conditions on the independent variables are added.

I have not worked with Bartlett corrections yet because I cannot find a test case. However, I am not ruling it out.

Narrative
In general, the problems I faced with my GSOC are not the ones I expected and the ones I expected were actually easier than I thought. For example, I thought sifting through the theory to understand the correct computations would be difficult. However, this was rather easy and it was easy to pick out the necessary building blocks from the textbooks.

Without question, the problem I didn't foresee that has given me the most trouble so far is working with GIT. However, after creating a test repository of my own and playing around for a couple hours, (completely destroying the test "project" at times) I finally got the hang of it. If I have any recommendation to future GSOCers for statsmodels, I would recommend that they do this. Like I said above, a couple hours of tinkering could save days of frustration.

What I can use help on
Although I am catching on, I still feel there are many style elements that pass me by. Maybe my documentation is not up to snuff. Maybe the unit tests aren't exactly clear. Anyway, I have been following the "no news is good news" motto, but I have a feeling that there can't be all that much good news. There is always room for improvement.

Statsmodels Team
100% helpful and timely in their responses. Thanks to Skipper, Josef, Ralph and couple others that have been actively providing feedback and answers to my questions.

If I were to offer any suggestions in terms of feedback, it would be to make the recommendations as concrete as possible. This is mostly because I am still getting the hang of the lingo. I noticed it has become a bit easier to respond to comments as the summer has progressed.

Second Half Changes
I mentioned this to Skipper but I would like to make a couple changes in my plan. I still want to start Monday with censored regression. Then, instead of moving into model selection, I would like to model EL IV regression. FInally, with my leftover time, I would like to create a general EL framework where the user enters an array of estimating equations and a parameter vector and a hypothesis and the program returns the results of the hypothesis test. This is essentially the setup (but will not be close to the same code) as Bruce Hansen's MATLAB/Gauss package. Then if I have time, I will do El model selection criteria. Any comments on this idea would be greatly appreciated.

Some Code
I sent two files to the statsmodels mailing list that walk through what I have implemented so far. Please pardon spelling, grammar and typos as I wanted to get these files out as soon as possible. Both can be ran in their entirety but I suggest running line by line. The code is full of in-line comments to guide the user through. To run the code, make sure to install statsmodels from my emplike_reg branch. I am assuming matplotlib is installed. If you would like these files, feel free to email me at justin.grana@gmail.com.

I'd be happy to answer any questions on or off list and would be interested to hear of any suggested improvements .

Other notes
I am using emacs as my editor. Another benefit of GSOC is learning how to navigate this nifty and powerful programming. Getting a taste of lisp isn't a bad side effect.

Almost all of my test cases are against MATLAB. With censored regression I will most likely be using R.

Git holes

2012-07-08T19:37:00.003-07:00

(title thanks to Skipper).

This week's post is a bit different but still relevant to GSOC. Coming into GSOC I had no idea what GIt, Github or even what version control meant. Now, I can still still say I don't know what they are, but I can say I have learned how (not) to use them.

I think the key to learning Git is creating a repository and trying out as many git commands as possible and seeing what they do first hand. This was especially helpful for me because I was unfamiliar with the lingo and often I found the documentation very difficult to grasp for someone new to git.

From my minimal experience, anybody trying to learn git should create their own repo and make multiple branches, push commits, pull from remote repositories, cherry-pick, add and delete files and be familiar with the commands that make use of a remote repository and the ones that remain local. 2 Hours of tinkering for me provided more insight than days of reading documentation.

Interesting problem

2012-07-02T11:26:00.002-07:00

I've noticed an interesting problem with empirical likelihood that although Owen recognizes it is one of its shortfalls, I am facing it first hand. One of the advantages of EL is that there are no distributional assumptions placed on the data or the parameters. Therefore, EL is well suited for problems that have nonstandard distributions. However, theoretically, EL inference can only be conducted if the hypothesized values lie in the convex hull of the data. Without going into too much detail, the more skewed the data is, the more unlikely it is that a hypothesized value of the parameter would lie in the convex hull of the data. When the hypothesized parameter value is not in the convex hull, optimization becomes infeasible. This leads to an interesting conclusion: the strength of EL is also one of its major downfalls.

On another note, next week is midterm evaluations and I think I will be looking to alter my second half plans. I would like to include EL IV regression. More to come on that...

api for plots

2012-06-20T19:10:00.001-07:00

In regard to the code for the plots, the function is:

mv_mean_contour(self, mu1_l, mu1_u, mu2_l, mu2_u, step1, step2,
levs=[.2, .1, .05, .01, .001], plot_dta=False)

The user enters the mean ranges of both variables and the interval in which to conduct a joint hypothesis test of the means. Before, these had to be entered by the user because there were problems with hypothesis tests with very large likelihood ratios. however, I think I fixed this problem and can now make all of the input parameters optional. It will take some fiddling first.

Optimization

2012-06-10T12:31:00.000-07:00

I knew that coming into this project, being familiar with numerical optimization methods would be important. As the "beginning" period has ended and the "stretch" has begun, I am happy to say that I already learned the theory and implementation of Newton's method for optimization. With the exception of simulated annealing, this is the only optimization method I feel comfortable enough with the theory to be able to code it myself and know when it is appropriate and when it isn't. Constant learning is without question the top benefit of participating in GSOC.

On another note, I had some fun with empirical likelihood confidence regions. Below is a plot of the confidence regions for the mean daily percent return for the Google and Microsoft stock from June 10, 2011 to June 10, 2012.

and here is the same thing but for only the last 30 trading days and the actual returns plotted.

Week 1 wrap up

2012-05-28T07:47:00.002-07:00

To my surprise, week 1 went smoother than I had imagined. The key contributor to that is the requirement to submit a patch for my application to GSOC. Submitting the patch forced me to learn some of the statsmodels conventions as well as some new tricks in Python. Without having that foundation, my progress would have been much slower.

In week one I laid the foundation for EL by creating an class to initialize El statistics, a class of helper functions that are either optimized or searched for a root and the class of EL functions the end user would see. Although the only end-user implementation has been hypothesis tests and confidence intervals for the mean, the class of helper functions will be able to be reused later and make further progress even faster.

For this week, I hope to implement hypothesis tests and confidence intervals for variance and covariance as well as hypothesis tests for the multivariate case. Although not in my original plan, if there is time I would like to add some 2d confidence region plots.

On another note, I have officially made this my official GSOC theme song

Still Reading

2012-05-12T11:51:00.002-07:00

As I am eager to get started, I am still reading and researching. One paper that I found helpful was by Bera and Bilas "The MM, ME, ML, EL, EF and GMM approaches to estimation: a synthesis." This paper stirred some ideas on how to make the EL project as general as possible so some of the implementations can be used in other models.

Tick Tock

2012-05-06T13:39:00.002-07:00

I am waiting for the semester to end so that I can focus my full time on my GSOC project. Last week was finals week for me so I didn't get to do much thinking about empirical likelihood. However, two Py-stat related events did happen.

First, I had a meeting with a professor to discuss how I can continue contributing to statsmodels after GSOC. To say the least, the options are limitless.

Secondly, I found out I will actually be taking an Empirical Likelihood class next semester. I wanted to make sure it wouldn't duplicate what I will be doing for GSOC so I email the professor. She responded, "most materials are not covered by Owen's book, because they are from my own past few years' research." This is very exciting since this summer I will be able to tie down the basics of e.l. and next semester I will bring my knowledge (and of course statsmodels) to the frontier.

Accepted!

2012-04-26T12:50:00.003-07:00

I found out on Monday that my project, "Empirical Likelihood in Statsmodels" was accepted for GSoC. It was great news for me and I am eager to get started.

Submitting the patch as part of the application proved to be very helpful. Spending a couple days just reading the current code really helped me determine exactly how I will be organizing my project. Also, going through the code drilled in how classes work in Python and I found out what class inheritance was and how to use it.

This week I will be in contact with my mentor(s) to discuss some more of the structure of my project. I also hope to use the next couple days to get more familiar with git-hub. I want to make sure that I have all the essentials tightened up before the official coding period begins.

It begins...kind of

2012-04-04T21:46:00.000-07:00

As the application period comes to an end, I realized I put as much thought into my application as a hefty class assignment. In fact, I would say that just putting the application together helped me to better understand the material that I hope to code for statsmodels this summer and taught me as much if not more than a typical class project usually does.

Briefly, I proposed to implement empirical likelihood estimation in python. Empirical likelihood is a non-parametric method of estimation that gives observed data a very loud voice. I am particularly drawn to this and nonparametric statistics in general for 2 (main) reasons.

First, it frees the researcher from many of the distributional assumptions that are typically found in standard econometrics and statistics textbooks. Although there are countless reasons to remove these assumptions, one is to predict movements in stock prices. Many classical models rely on assumptions or normality (or Brownian motion), when forecasting stock prices or pricing options. However, it has been shown that stock prices follow a distribution with heavy tails. Ignoring these fat tails (higher probabilities of large movements) can be problematic for researchers and practitioners alike.

The second reason I am attracted to empirical likelihood and nonparametric statistics is more pragmatical (or lazy, however you look at it). Statisticians spend plenty of resources deriving complicated analytical solutions that only apply "in the limit" and can often be very misleading when used in finite samples among people who are unsure of their worth. While these are sometimes helpful for practitioners or policy maker that only wants to "throw" a couple variables into a statistical software, it seems as though the derivation of these analytical solutions is somewhat of a mis-allocation of resources (brain power of some very smart people) since specific questions can be answered much more easily through other computational methods. Undoubtedly though, we will never stop developing these nice, pretty analytical solutions. It is in out nature.

"One reason comes from our wish, as theoreticians, to explore the source of the a priori practical principles that lie in our reason. " -Immanuel Kant, Groundwork of the Metaphysics of Morals.