PDA

View Full Version : Statistics Models



Pages : [1] 2

Umileated
12-09-2011, 07:02 PM
I recently started playing around with the CHODR model crafted by Robin Lock over at SLU. Here's his page on the topic:
http://it.stlawu.edu/~chodr/

I also have been tinkering with a variant that isolates either home or away data for a given team depending on the matchup.

Who out there has worked with it before? What other models do you guys like? There has to be some other dork out there.

Patman
12-09-2011, 07:11 PM
CHODR is a Poisson regression... he doesn't consider any results that go into OT, and I believe he discounts ENGs. I've done home advantage for every team, but I don't really know that it really provides be a substantive activity. I've also used an off-set for time-of-match... I think at some point, however, I had to adjust it more accurately to the "first one who scores wins" paradigm... I forget exactly.

To me, the Poisson model is more or less the best of the standard forms at explaining hockey. Univariate distributions rarely have any useful bivariate (correlative) forms... overdispersed poisson (negative binomial) are usually a waste of time. I believe that goal differential IS a better at explaining strength in hockey than wins and losses... granted score gives you more information... but I believe that GF-GA is a bit more predictive... but you can make pros and cons... pro: Boston v. Vancouver SCF... con: TB making the CF.

The inherent mathematical problem of the ranking problem is the fixing contrast... sum to zero, whatever one calls it... that every team is compared in a 1 and -1 fashion. This means that its hard to use more exotic methods. I've been coming around to Rutter's application of the logit (KRACH) model... the idea that you have a hierarchy imposed that denotes some measure of closeness as a means of controlling against large estimates. Anything beyond the generalized linear model class is hard to deal with... hierarchy is a ***** to deal with when having to face issues of 1 and -1 contrasts. So, I have a few models I'd love to employ (I think Bayes is ideal for ranking sports teams in some ways... anything from the "objective" class)... I just don't believe the mathematics involved is favorable (in many plausible models).

Dr. Joyce

edit: I only partially apologize if I just went over everybody's heads... I've been thinking about this for awhile and i'm not really game for playing towards the layman right now... gotta go out and drink and stuff

Umileated
12-09-2011, 07:58 PM
CHODR is a Poisson regression... he doesn't consider any results that go into OT, and I believe he discounts ENGs. I've done home advantage for every team, but I don't really know that it really provides be a substantive activity. I've also used an off-set for time-of-match... I think at some point, however, I had to adjust it more accurately to the "first one who scores wins" paradigm... I forget exactly.

To me, the Poisson model is more or less the best of the standard forms at explaining hockey. Univariate distributions rarely have any useful bivariate (correlative) forms... overdispersed poisson (negative binomial) are usually a waste of time. I believe that goal differential IS a better at explaining strength in hockey than wins and losses... granted score gives you more information... but I believe that GF-GA is a bit more predictive... but you can make pros and cons... pro: Boston v. Vancouver SCF... con: TB making the CF.

The inherent mathematical problem of the ranking problem is the fixing contrast... sum to zero, whatever one calls it... that every team is compared in a 1 and -1 fashion. This means that its hard to use more exotic methods. I've been coming around to Rutter's application of the logit (KRACH) model... the idea that you have a hierarchy imposed that denotes some measure of closeness as a means of controlling against large estimates. Anything beyond the generalized linear model class is hard to deal with... hierarchy is a ***** to deal with when having to face issues of 1 and -1 contrasts. So, I have a few models I'd love to employ (I think Bayes is ideal for ranking sports teams in some ways... anything from the "objective" class)... I just don't believe the mathematics involved is favorable (in many plausible models).

Dr. Joyce

edit: I only partially apologize if I just went over everybody's heads... I've been thinking about this for awhile and i'm not really game for playing towards the layman right now... gotta go out and drink and stuff


Don't apologize for the technobabble - this was the sort of response I was hoping to receive.

As far as the ENG, I believe I read somewhere in the site (or perhaps emails) that he'd like to discount them, but data on when they occur is "fuzzy."

I'm planning on tinkering with the KRACH in the near future. I had set out originally to re-engineer KRACH, but got distracted along the way for a few reasons.
For one, CHODR seemed far easier to program - I've set it up in a simple excel file. Secondly, I also like the appeal of measuring goals for/against rather than win records. CHODR seemed more immediately useful in looking at a specific upcoming matchup (read: just how much is UNH going to disappoint me tonight?) I constructed the home/away scenario model because the "home ice advantage" scalar that is used in traditional CHODR seemed a bit too generalized for me. the Home/Away model does away with this scalar, but also takes more time to converge on a useful prediction.

Unfortunately, I was far too entertained by linear algebra when statistics came around so I'm making up for it now with this project. I also haven't taken advantage of any formal Matlab training - UNH finally created a course in the program in my last semester, and the work I did do with it never really summed up to a cohesive understanding for me (entirely my fault). This was one of the motivators for me in playing around here.

What are your thoughts on some sort of goals for/against variant on the KRACH or at least a ranking model that considers such data? To me, the zero-sum nature of win/loss data makes any ranking model based upon it just a degree or two of interpretation removed from the win/loss record itself - particularly with leagues that place a strong emphasis on intra-conference schedules.

goblue78
12-10-2011, 09:44 AM
Hi... been away for awhile but just got back and saw your post. I agree with Patman that some sort of Poisson regression is probably the best way to go for a number of reasons. I also agree that handling OT is one of the trickiest problems under a Poisson model of goals. I experimented with this a lot last year but haven't had the time this year to follow through. Search back for some of my posts last year and PM me if you want more info about what I did. As another useful reference, take a look at hockeyanalytics.com and in particular his pdf called "Poisson Toolbox."

Edit: Relevant thread here (http://board.uscho.com/showthread.php?95529-A-new-ranking-systen-for-college-hockey)

Patman
12-10-2011, 09:58 AM
Don't apologize for the technobabble - this was the sort of response I was hoping to receive.

As far as the ENG, I believe I read somewhere in the site (or perhaps emails) that he'd like to discount them, but data on when they occur is "fuzzy."

I'm planning on tinkering with the KRACH in the near future. I had set out originally to re-engineer KRACH, but got distracted along the way for a few reasons.
For one, CHODR seemed far easier to program - I've set it up in a simple excel file. Secondly, I also like the appeal of measuring goals for/against rather than win records. CHODR seemed more immediately useful in looking at a specific upcoming matchup (read: just how much is UNH going to disappoint me tonight?) I constructed the home/away scenario model because the "home ice advantage" scalar that is used in traditional CHODR seemed a bit too generalized for me. the Home/Away model does away with this scalar, but also takes more time to converge on a useful prediction.

Unfortunately, I was far too entertained by linear algebra when statistics came around so I'm making up for it now with this project. I also haven't taken advantage of any formal Matlab training - UNH finally created a course in the program in my last semester, and the work I did do with it never really summed up to a cohesive understanding for me (entirely my fault). This was one of the motivators for me in playing around here.

What are your thoughts on some sort of goals for/against variant on the KRACH or at least a ranking model that considers such data? To me, the zero-sum nature of win/loss data makes any ranking model based upon it just a degree or two of interpretation removed from the win/loss record itself - particularly with leagues that place a strong emphasis on intra-conference schedules.

I think anything that tries to combine the two is well-intentioned but will inevitably fail. I suppose its a chicken and egg thing... did you win because you scored a lot and gave up less or did you score a lot and gave up less because you won?

There are some more complicated things out there (I SWEAR I saw a Journal of Quantative Analysis of Sports article that tried some sort of semi-parametric score model for college football... extremely ambitious, IMO). Some of the BCS stuff, their non "win/loss" stuff (as in, not their current BCS method) employ what is called a "game-point function"... f(GF,GA)-->p in [0,1]. The idea is that the score imparts some information on strength of the win which implies what the underlying chance of winning was... i think its an ill-posed concept but it gets it away from the extreme [0,1] dynamic... but it does get you somewhere further than before... there's always a loss of information when projecting down to two figures... just as there is when projecting down to two counting numbers (afterall, a better analysis would consider the players actions and try to measure talent and fatigue... I've heard of some statistical models for tanks... but those are insanely implausible... all statistical models are a philosophic approximation at the underlying truth).

KRACH is something that's very simple to implement... i think i've done it in C at some point... its nearly trivial in R (R is slower but is a wiz with multi-dimensional objects... still slower but easy to write)... if you intend to pursue statistics and/or data analytics then R (its free) is something you ought to start playing around with casually. If I had my druthers (and I don't... and won't... and we don't have the money) I'd hire a very strong R programmer (with a math or stat masters) tomorrow.

I've been more interested in applications... rankings are "great"... but I still dream of something more like Baseball Prospectus playoff predictions. It'd take an ambitious amount of computation work (everybody has their own playoffs, tie-breakers, in-season tournaments, those rules, etc., etc., etc.) that, really, would only be doable by a college student with a good handle on computation and more time than sense (which in hindsight is better spent drinking and chasing girls). The information is fun because it can show you some simple things... but an ordinal ranking is almost the same kind of navel gazing but with more rigor behind it. Learning what the information implies is much more work but much more insightful. Quickest things I've learned... 3-2 scores are the most likely in hockey (right now, as long as GPG is generally below 3)... somebody ran a score prediction contest... every answer was either 3-2 or 4-1 despite a heavy bonus given for predicting shutouts.... i was in 2nd place only a month or two into the season before it was shutdown. I also suspect you can pull your goalie MUCH sooner (say like 5 minutes or more) and still have an advantage... but nobody's going to test that, this is only based on the Poisson model assumption, and doesn't assume that the defending (and leading team) won't get better at 6 v 5.

Umileated
12-10-2011, 11:31 AM
Hi... been away for awhile but just got back and saw your post. I agree with Patman that some sort of Poisson regression is probably the best way to go for a number of reasons. I also agree that handling OT is one of the trickiest problems under a Poisson model of goals. I experimented with this a lot last year but haven't had the time this year to follow through. Search back for some of my posts last year and PM me if you want more info about what I did. As another useful reference, take a look at hockeyanalytics.com and in particular his pdf called "Poisson Toolbox."

Edit: Relevant thread here (http://board.uscho.com/showthread.php?95529-A-new-ranking-systen-for-college-hockey)

I'm looking over your paper that started the older thread. The introduction sound a lot like what I wanted to try and get at. I'll have comments in a day or two, some of them might even wax academic.

mookie1995
12-10-2011, 01:08 PM
lies, dam lies, and statistics.

Umileated
12-13-2011, 10:12 AM
I'm not sure I agree with your 2nd sentence.

When Bill James was still writing Baseball Abstract, he did an entire chapter one year on the significance of how you won, and, in particular, by what margin, that postulated that it was very indicative of what kind of team you are/have. Good teams win by large margins. Bad ones don't. That was the gist of it all. This same article completely pooh-poohed 1 run wins as pretty much meaningless (despite how much you always hear about them in MLB), over the course of baseball history, and he took a look at all of it to formulate that opinion.

It looks like the writer here came to some of the same conclusions that James did, for college hockey, although I readily admit we are talking two entirely different sports here. The parallels to what he said in Baseball Abstract are interesting, though.

Pulled this from goblue's relevant thread below. Anyone have an idea on the year of the article?

goblue78
12-14-2011, 10:07 PM
I suspect 82 or 83, which are the really ancient Abstracts that I no longer have...

Ralph Baer
12-15-2011, 06:08 AM
I also suspect you can pull your goalie MUCH sooner (say like 5 minutes or more) and still have an advantage... but nobody's going to test that, You haven't watching RPI, have you? :)

LynahFan
12-15-2011, 08:34 AM
KRACH is something that's very simple to implement... i think i've done it in C at some point... its nearly trivial in R (R is slower but is a wiz with multi-dimensional objects... still slower but easy to write)... My MATLAB version is only about 30 lines of code for the actual computation, with maybe another 50 or so for reading in the game results, formatting the output for the screen, etc. Completely trivial.

Numbers
12-15-2011, 10:18 AM
Hi everybody. No college degree for this guy who likes numbers (that's a different story). But, I understand the KRACH model, and like it lots (it's interesting to implement for NHL, too). How do you adjust it for home/road? Because it would seem like you have to guess what should be the 'benefit' to being the home team. Or else, you compute a total average over all of hockey for that, and then use that. But, I don't think the advantage for the home team is the same in every barn?
Oh, and as far as code, I don't understand all that, but I have an NHL Excel file set up and am simply using
K(i) = v(i)*(SUM(Over j){1/(K(i) + K(j))} and then iterating manually.

Well, actually, I am using OpenOffice Calc rather than excel.

Thanks,
NUMBERS

Numbers
12-15-2011, 10:29 AM
I've been coming around to Rutter's application of the logit (KRACH) model... the idea that you have a hierarchy imposed that denotes some measure of closeness as a means of controlling against large estimates. Anything beyond the generalized linear model class is hard to deal with... hierarchy is a ***** to deal with when having to face issues of 1 and -1 contrasts. So, I have a few models I'd love to employ (I think Bayes is ideal for ranking sports teams in some ways... anything from the "objective" class)... I just don't believe the mathematics involved is favorable (in many plausible models).

Dr. Joyce

edit: I only partially apologize if I just went over everybody's heads... I've been thinking about this for awhile and i'm not really game for playing towards the layman right now... gotta go out and drink and stuff

Patman,
What exactly is Rutter's application of the KRACH model?
Thanks.
Numbers

Numbers
12-15-2011, 10:35 AM
And, generally, I have another question that seems to belong here.

If college hockey wishes to choose its' NCAA field by game results only, and KRACH (can we please find a better name? And, how would a statistician really refer to this method?) does so as well as any, how do we deal with the following problem?

Currently, the top of the list is filled with CCHA teams. I mean filled. I don't really have a problem with that, but I have a feeling that the math works out that way because the number of non-conf games is small, so a couple of handfuls of good results elevate the entire league.

Again, I don't really have a problem with that - if you want to use results, then use results. What I wish was that there were a way to smear the benefit a little. Does anyone understand what I mean?

Maybe in short it would be like this: KRACH makes the non-conference results of all the teams in one's league to be very important, because of the high number of insulated games within conferences. How can we tone that down a little?

Thanks,
Numbers

Patman
12-15-2011, 10:38 AM
The quick (and disappointing) answer is that you can't. That is to say, if you did it wouldn't look like the form for KRACH.

KRACH works nicely because its somewhat more of a fundamental form and that it can be re-stated through simple sums. Sadly, most of statistics does not build up this way... this is where we get into the generalized linear model. We relate the parameter onto the real line, relate a linear function through that transformation, and then calculate the maximum likelihood... but that's no longer as neat as p_i/(p_i+p_j).

KRACH can be re-stated as the regression form when we allow exp(c)/(1+exp(c))=p where c=beta_i-beta_j. A suitable re-establishment of the beta terms plus a nice constraint gives us the result. Run that maximum likelihood calculation (which we can do so easily by newton-raphson as the likelihood is convex in beta) and then the relationship between that and KRACH results is a*exp(beta_i).

If we wanted to adjust home and away what we'd do is then do c=beta_i-beta_j+home*I... I is 1 if the winning team is the home team and negative if they are the away team (and usually zero for venues considered to be neutral). Sadly, we cannot re-establish this cleanly as multiples of this or that... (I say this, but I'm not 100% confident... it'd have to be something like p_i/(p_i+h*p_j)). On the other hand, as a direct comparison of teams we can still use a*exp(beta_i).

Patman
12-15-2011, 10:48 AM
Patman,
What exactly is Rutter's application of the KRACH model?
Thanks.
Numbers

IIRC, Rutter uses a "heirarchical model"... the general concept is that say you are flipping coins... you know all coins are milled somewhat differently but behave similar to each other... so while they may take on different values we have some reason to believe that certain rates are more frequent than others. We then can flip several coins and use the knowledge from the other coins to provide sharper estimates of each of the coins by borrowing from a common structure (which we will often give a parameteric form... see beta distribution).

In this case, the concept is the same, but then its a question of linkage and calculation. I believe Rutter uses the linearized form as explained in the previous post but then entwines a hierarchy where he assumes each beta_i has a normal distribution with mean zero and variance sigma (but unknown). The idea is similar, you assume a common super-distribution (hyper-distribution, heirarchy, hyper-paramters, so on) and so you can use this to better couch the estimates... it serves in some case as a deflationary influence and as such is useful against extreme values. Rutter, as a Bayesian, also employs what is termed a "prior" on the unknown sigma... often this can be refered to as "prior belief", "subjective probability", etc. though there are forms that try to be objective. Even the heirarchy imposed here can be seen as a Bayesian application. The general idea is you start with a vague initial state and you use your learned knowledge (data) to refine that belief. Anyhoo... there are a lot of other parts that one can argue about or disagree with... but the important notion here is that he ties all the teams through a common distribution under the notion that hockey teams should exhibit some amount of spread and that spread can be calculated. He's also making a rough notional call to the "Stein Estimation" problem. Its an interesting phenomena in science (things that are true in 2-D in science are often not true in 3-D and beyond... I had a professor who said he had a professor who speculated that this is why our universe works)... so while one may argue that hockey teams don't necessarily come from a common pool there is some utility in doing so because it mitigates the overall degree of error.

As such its a bit technical of an approach but certainly something that I personally see as a reasonable model. Hopefully I've made somewhat of an understandable pitch without being too technical.

Umileated
12-15-2011, 12:37 PM
My MATLAB version is only about 30 lines of code for the actual computation, with maybe another 50 or so for reading in the game results, formatting the output for the screen, etc. Completely trivial.

Feeling generous?

LynahFan
12-15-2011, 01:23 PM
Feeling generous?
Always, but it's on a different computer on another continent. I can post it when I have access again.

FlagDUDE08
12-15-2011, 01:29 PM
Always, but it's on a different computer on another continent. I can post it when I have access again.

SSH is your friend. Assuming that different computer is on, connected to the network, and you either know the IP or have a DynDNS setup on it...

Patman
12-15-2011, 09:23 PM
Always, but it's on a different computer on another continent. I can post it when I have access again.

the hardest part, as a statistician, is creating the records and reading them in from file... once they're in array or matrix form its quite simple...


SCRATCH CODE BASED ON R... MIGHT WORK WITH S+
#assume... rate.cur is current rating... initialized at array(100,n.teams)
#assume... rate.new is new rating
#assume... win.vctr is the vector of the count number of victories V_i
#assume... game.mtx is the matrix of games... N_ij

for(k in 1:n.iter){
for(i in 1:n.teams){
rate.new[i]=win.vctr[i]/sum(game.mtx[i,]/(rate.cur[i]+rate.cur))
}
rate.cur=rate.new
}
#note... converges very fast... 100 or fewer... convergence rate term not usually needed
#could probably re-skin with a tapply function and make even smaller