Jump to content
Brewer Fanatic

Regression Analyses and Expectancy Tables in SPSS - Help


I was wondering if any of the people that frequent this board could point me towards a good tutorial on how to do these types of statistical analysis on SPSS. The reason that I am specifying that it be with SPSS is that it is the only statistical program that I really have any experience with. Any help that you guys could give me would be appreciated. Thanks.
Link to comment
Share on other sites

Recommended Posts

Here's tangotiger describing regression analysis for baseball far better than I ever could:

 

Quote:
I want to talk a little bit about a misunderstood or perhaps overlooked concept in statistics as it relates to baseball (or perhaps baseball as it relates to statistics) and why it is important. One of the hard-core stat guys who frequent or lurk on this site may have to help me out with some of the nuts and bolts but I think I have a pretty good handle on the gist of the matter.

 

Remember that regression towards the mean depends upon two things and two things only. One, the sample size, and two, the spread (variance) of true talent in the population. As each one approaches infinity, regression approaches zero, regardless of the magnitude of the other one, and as each one approaches zero, regression approaches 100%, also regardless of the magnitude of the other one. And of course, as both get larger together, regression gets smaller, and vice versa.

 

Whether you fully comprehend or are able to digest that, as you read this, watch TV, eat a sandwich, and yell at the kids all at the same time, is not that important. Know, however, that this is a very important concept in baseball (actually all sports) analysis. Without a basic knowledge of this, you are going to be wrong in just about everything you think you know about team and player talent, projecting the likelihood of future events, etc.

 

But that is not exactly what I want to talk about.

 

Let?s say that you have an unknown (you know nothing about him other than his BA) player who hits .300 in 2000 AB?s. What is your estimate of his true BA? Again, that depends on two things: the number of AB?s or sample size ? in this case, a pretty good amount, and two, the spread of BA talent in the population. Without trying to figure that out (which is doable of course) we can simply (more or less, as in sports we always have selective sampling and survival bias issues that reduce random or representative samples to non-random and non-rep ones) look at ?time period? to ?time period? BA correlations among a large group of players and then extrapolate that number to players with 2000 AB?s. As it turns out, among batters with around 300 some odd AB?s, you get a correlation of about .36, so that for a batter with 2000 AB?s, you get a correlation of around .77 or a regression of around .23 (remember that regression is 1-r given a roughly normal distribution). So that .300 batter regresses to .291 assuming that league average is .260. IOW, that is our estimate of this player?s true BA talent and that is what we would predict him to hit at any time in the future, assuming everything else, including his true talent, remains the same.

 

O.K. fair enough. But what if we have a bunch of players, say an entire team, who hit a collective .300 in a collective 2000 AB?s? Let?s call that 10 or so full-time players on a typical team around 2 months into the season. Your first instinct might be to think that the regression would be the same, and that we would project those same players to hit .291 for the rest of the season ? after all, we still have a bunch of players who are part of the league where the average BA is .260 and we still have 2000 AB?s. Well, that first instinct would be wrong - dead wrong. Why? It is because the spread of talent in the population, which is one of the two criteria that determine the amount of regression, is not nearly the same in the first example (the one batter) as it is with the second (a collection of 10 batters).

 

In order to get a handle on a difficult concept, it is often quite useful to imagine an extreme situation, but one in which the parameters are essentially the same. I do this all the time when trying to sort out the answer to a particular problem or question. Let?s say that our collection of players is comprised of the whole league (starters at least), around 300 full-time position players. And let?s say that they hit a collective .300 in their first 10 AB?s (3 days or so into the season). That is a total of 3000 AB?s, a pretty large sample of performance. I think we know intuitively that we would not expect them to hit .290 -something for the reminder of the season, which is what our single player model would predict. Why is that? Again, it is because our second parameter for determining the amount of regression is much different than in the single player model. In fact, we know that the spread of talent of ALL PLAYERS within the population of ALL baseball players (technically in this example, it is 300 starters among many more players) is by definition zero. So our regression is exactly 100% and we expect our 300 players to hit .260 for the remainder of the season.

 

Why is understanding this concept important? Because it allows us to put in perspective sample performances by teams and other groups of players, such that we don?t get too excited when we see especially good or bad performances even over what seems like substantial periods of time (AB?s or IP?s) or large sample sizes.

 

For example, when we see a team bat .290 in 1000 AB?s, the equivalent of a little more than a month into the season, does that mean that this team is likely a great hitting team (assuming we know little else about them)? Well, even though we are talking about 1000 AB?s, it is likely that the regression toward the mean on that .290 is pretty substantial. For an individual player, it might be only 37% or so. For a team, it is probably closer to 70 or 80% (I don?t really know off the top of my head), maybe more.

 

Ditto for bullpens. I should say especially for bullpens! When you see a team?s collective ERA in 250 innings at the all-star break and it is 5.50 or 3.00, do NOT think of that like you would a starting pitcher after one long season. It is not even close. It is likely that those same two bullpens will not be that far apart in the second half of the season (maybe ¾ of a run), especially given the large regression even for a single pitcher model.

 

In fact, I?ll close this little ditty with some real-life examples of bullpen regression. I looked at all teams? non-starter innings from 1989 to 2000. I took their first half ERA?s and divided them up, collectively, into 6 groups from best to worst. I then looked at each group?s collective second-half ERA. Keep in mind that truly bad bullpens (true talent-wise) tend to improve (a little), true-talent wise (not including regression toward the mean), and that really good bullpens, true-talent wise, tend to get a little worse as pitchers get injured and replaced by, on average, worse ones. In any case, I am making the assumption that most (80-90% maybe) of the difference we see from first half to second half is regression to the mean. Notice the huge number of innings in each half season.

 

# Team seasons IP 1st half ERA 1st half IP 2nd half ERA 2nd half

46 9741 3.10 11005 3.89

54 11331 3.65 12584 4.14

50 10618 4.09 11492 4.13

50 10479 4.47 10748 4.30

55 11830 4.88 12836 4.51

53 11351 5.72 11600 4.59

 

Amazingly, there is less than a ¾ run second-half difference between the teams with the worst and best first-half ERA?s, despite a 2.62 run difference in that first half! That is a 74% regression to the mean in over 200 IP per team. So the next time you hear an announcer, player, manager, or other ?expert? tell us how great or terrible a bullpen (or starting staff or team, etc.) is (and what to expect in the future) after a half season, let alone a few weeks, month, or, if you read the papers, watch TV, or listen to the radio at all, A WEEK OR TWO, take their ?wisdom? with a large grain of salt!


 

LINK

Link to comment
Share on other sites

  • 1 month later...

That's not necessarily regression analysis though. That's regression towards the mean.

 

If I am correct, the OP wants to know about the percentage of variance explained in an outcome variable by one or more predictor variables. This is fairly easy to do in SPSS. If anyone wants to still know how to do this, let me know.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

The Twins Daily Caretaker Fund
The Brewer Fanatic Caretaker Fund

You all care about this site. The next step is caring for it. We’re asking you to caretake this site so it can remain the premier Brewers community on the internet. Included with caretaking is ad-free browsing of Brewer Fanatic.

×
×
  • Create New...