Massey Ratings FAQ

First, to avoid confusion, be aware that I publish two different sets of rankings:

the "Massey Ratings", which utilize actual game scores and margins in a diminishing returns fashion
the BCS compliant version, which do not use the actual score

I will summarize the latter, since that is the one relevant to college football fans.

Massey's BCS ratings are the equilibrium point for a probability model applied to the binary (win or loss) outcome of each game. All teams begin the season rated the same. After each week, the entire season is re-analyzed so that the new ratings best explain the observed results. Actual game scores are not used, but homefield advantage is factored in, and there is a slight de-weighting of early season games. Schedule strength is implicit in the model, and plays a large role in determining a team's rating. Results of games between well-matched opponents naturally carry more weight in the teams' ratings. The final rating is essentially a function of the team's wins and losses relative to the schedule faced.

How are your BCS ratings different than the others?

All of the BCS computer rankings are based on wins and losses relative to schedule faced. Most of the differences can be attributed to the particular mathematical model used to generate the ratings. There is no tidy term in a one line formula I can point to as the difference between mine and the others. Here are some small data-related differences:

does the rating utilize homefield?
does the rating start every team at zero, or with a preseason value?
does the rating weight more recent games more?
does the rating include all teams, or just FBS?

Overall, the BCS computer rankings probably correlate more than a random selection of six human poll ballots would.

Give a quick bio / resume

Somebody has given me a Wikipedia page. I can't vouch that it is 100 % accurate, but it's a good place to start.

Kenneth Massey is a professor of mathematics at Carson-Newman University in Jefferson City, TN, where he lives with his wife, Alina, and children: Page and Charles. He received his B.S. from Bluefield College and a master's degree (ABD) in mathematics from Virginia Tech. His research involved Krylov subspaces in the field of numerical linear algebra.

Massey is a partner with Limfinity consulting, and produces the Massey Ratings, which provide objective team evaluation for professional, college, and high school sports. His college football ratings were a component of the Bowl Championship Series from 1999-2013. Massey has also worked with USA Today High School Sports since 2008.

How did you get involved with the BCS?

I started working on college football ratings as an honors project in mathematics while at Bluefield College in 1995. Continuing this interest as a hobby, I developed a web page and helped pioneer the organization of college football rankings via my composite. The BCS, which started in 1997, realized the need to expand its sample of computer ratings from three to seven. My web site became a central resource point as the BCS officials searched for quality, respected, and well-established computer ratings. I received a phone call from SEC commisioner Roy Kramer in the spring of 1999 to discuss the prospect of adding my ratings to the BCS formula. Mine were chosen because of their demonstrated accuracy and conformance to the consensus, and my personal expertise in the field.

How do you evaluate a computer rating to know which one is the best?

By nature, a rating system tries to explain past results according to some model of the probabilistic and dynamic aspects of competitive sports. Therefore a rating system may by definition be the "best" according to the objective function it tries to optimize, e.g. to minimize the MSE between actual and model scores (in hindsight).

Depending on the objective, data used, and weightings, some systems may be designed to "predict" future outcomes, instead of merely "retro-dicting" past outcomes. This is a valid approach if you accept the maxim that past results are an indicator of future performance.

On the college football and basketball composite pages, I compute two metrics: correlation to consensus and ranking violation percentage. These are not meant to assess the quality of a rating system, but only to provide a crude reference point for comparing and contrasting different systems.

Todd Beck's prediction tracker documents predictions made a priori. It has proven difficult for any rating system to consistently be superior to the Vegas lines.

Explain how your system uses margin of victory (MOV).

The BCS compliant version does not use MOV at all. There is no distinction between a 21-20 nailbiter, and a 63-0 blowout.

The main version does consider scoring margin, but its effect is diminished as the game becomes a blowout. The score of each game is translated into a number between 0 and 1. For example 30-29 might give 0.5270, while 45-21 gives 0.9433 and 56-3 gives around 0.9998

The maximum is topped at 1, so the curve flattens out for blowout scores. In addition, I do a Bayesian correction to reward each winner, regardless of the game's score.

The net effect is that there is no incentive to run up the score. However, a "comfortable" margin (say 10 points) is preferred to a narrow margin (say 3 points).

In summary, winning games against quality competition overshadows blowout scores against inferior opponents. Each week, the results from the entire season are re-evaluated based on the latest results. Consistent winners are rewarded, and a blowout score has only marginal effect on a team's rating.

Explain how your system uses strength of schedule (SOS).

Results and SOS are the yin and yang of computer ratings. Simply put, a team's rating measures their performance relative to the schedule they faced. In a true computer rating, rating and SOS are inter-dependent, and are calculated in conjunction with each other. This relationship is implied by the solution to a large system of simultaneous equations, which represents an equilibrium of some mathematical model. Ratings are a function of SOS, and vice-versa.

Many fans are familiar with crude RPI type systems that may account for SOS by the records of opponents and opponents' opponents. A sophisticated computer rating system does not utilize such artificial and ad hoc factors. Instead, because SOS is an implicit part of the model, it accounts for opponent strength to an effectively infinite number of levels. It makes no sense to say that the SOS component is a certain percentage of the total rating. To gain a glimpse of how SOS can be implicit, consider this simple equation:

rating = performance + SOS

Here performance could be related to win-loss record, or some other objective measure of success. Each team has an equation of that form, which may be re-arranged to get:

SOS = rating - performance

Rating and schedule are functions of each other. All possible connections between teams are accounted for as the cream rises to the top and an equilibrium is reached. As a corollary, there is no need for explicit reference to conference affiliation. Conference strength is accounted for automatically as the model surveys the full topology of the team/game network.

Aggressive scheduling is not penalized, but instead raises the potential rating that a team may reach. However, scheduling alone doesn't earn a high rating - there must be success against it.

Faith without works is dead. -- James 2:20

For a single game, it is better to defeat a poor team than lose to a good team. However, that team's ranking may fall because another team had a more impressive win. Depending on the gap in SOS, an 11-1 team with a tough SOS may be rated higher than a 12-0 team that faced an easier schedule. Of course, there are symmetric forces at work toward the lower end of the rating spectrum.

Since the actual rating model involves non-linear equations, the notion of "average" SOS may be misleading. Games between equally matched teams are more influential to a team's overall rating. For example, the #1 team's risk/reward is greater for playing #2 and #80 than for playing #30 and #31.

Does conference affiliation affect the ratings?

I don't do any prior weighting of conferences, and therefore conference affiliation plays no direct role in the ratings. Schedule differences are implicit in the model. Conferences that perform well in inter-conference matchups will naturally be rated higher. Since these games provide a reference point for the entire conference, the rising tide lifts all boats. For this reason, non-conference games are in some sense more important than conference games.

I want to develop my own rating system. Where do I start?

If you want to research existing rating systems, visit the theory page. In particular, I recommend David Wilson's directory.

To get the feel for how computer ratings work, you may want to try this iterative procedure:

set each team's rating to zero
calculate each team's SOS to be the average rating of their opponents
calculate each team's rating to be their average net margin of victory plus their SOS
go back to step 2 and repeat until the ratings converge

The data page provides data for many leagues.

How does the betting line get set?

Oddsmakers use computer rating models as a tool when setting initial lines for games. However, they also incorporate data related to injuries, officials, matchups, motivation, travel, days off, weather, etc. Actual wagers then determine how the line moves to balance the bets so that the bookmaker is hedged.

What software do you use?

I use only open-source software, including the common LAMP framework (Linux, Apache, Mysql, PHP/Python). The main rating software is written in C++. Data is stored in a MySQL database, and rendered on the web with PHP and Javascript. If you would like to write your own rating software, Octave is a good high-level mathematical language.

How do you deal with forfeits?

Forfeited results are not factored into the computer ratings. If an on-the-field outcome was later forfeited, the original score is used in the calculations, but the result is stricken from the win-loss records.

How do you deal with exhibition games?

Occasionally, especially in college basketball, one team counts a game in official records, but their opponent doesn't. This is a conundrum since we can't really judge the sincerity of strategy and effort in such contests. However, for record keeping practicality, I have decided that a game does not get marked exhibition unless both participating teams deem it so.

Are you satisfied with the BCS formula?

Over the years, the BCS has gotten criticized for fine-tuning its formula. Recent changes have simplified the system for the better and removed extraneous redundancies. The current setup is a good balance of the traditional human polls, which the fan base is most comfortable with, and the objective computer component. Over the years, the two methods have tended to converge as the computers have revealed to the human voters the dangers of regional bias and misunderstanding of schedule strength. There will always be controversy when the formula must split hairs between #2 and #3, but the system is stable and beginning to be accepted by the media and fans.

What would you change?

I think the biggest thing that is hurting college football is the lack of quality inter-conference games. Due to the premium placed by the media on win-loss records, most athletic directors are trying to assure themselves of 3-4 easy home wins each year in their out-of-conference schedule. Great matchups like Texas vs Ohio St in 2005 are few and far between. I would like to see something like an ACC-SEC challenge, whereby teams are matched up for 12 games over one weekend. This would get fans and media excited, and also provide a more solid basis for comparing teams from different conferences.

An increase in good matchups would require a shift in philosophy by the human polls. Currently they start with a preseason notion of who will be good, and adjust each week according to a predictable pattern. They are reluctant to go back and re-evaluate earlier results in light of new information, and thus prior biases are likely to compound as the season goes on. One result is that a 8-3 team that played a brutal schedule is often penalized, while and 11-0 team with a weak schedule is rewarded. Sometimes this strategy backfires (e.g. Auburn 2004), but in general padding ones record with wins means a higher poll ranking. Delaying the release of human polls until mid-season would help minimize bias and develop more respect for teams that play challenging early season non-conference schedules.

Do you favor a college football playoff?

Every year we hear the arguments that college football should have a playoff system. The true champion should be decided "on the field", not by biased sportswriters or computer geeks. To hear such talk, it is easy to believe that a playoff would produce an undisputed national champion.

I argue that the BCS is in fact a 2-team playoff. Regardless of the playoff field's size, there will always the (n+1)st team that claims to deserve a chance to prove itself. The issue is therefore to develop a system that picks the most deserving teams to participate. The BCS's job is to place the two best teams in the national championship game. Perhaps paradoxically, the best team has a better chance of winning the college football championship than it does in college basketball or the NFL. The fewer rounds that must be navigated, the less chance that the best team will be upset. A playoff provides a relatively small sample of game results. The best system should make the regular season most important.

The BCS system is the best college football has ever had to determine an undisputed champion. It is really a two team playoff. So the correct question is whether I favor expanding the number of teams in the playoff. College football is unique among all sports in that every regular season game is vital. Also, the bowl system provides a great reward to players and fans, allowing many schools to finish the year on a high note. A playoff should not ruin this. I would favor a 4 team playoff (the so called "plus one" system), or possibly 8 teams if it was done right, but no more than that.

What is the purpose of this web site?

I maintain this web site primarily as a hobby which creatively combines my interests in math, sports, and computer programming. I enjoy sharing my work with the visitors to my site. The Massey Ratings also serve an official purpose, most notably as a component of college football's BCS.

How long have you done ratings?

I began my foray into computer ratings with college football in 1995. Since then, I have produced ratings for many pro, college, and high school sports. I continually work to improve the content and quality of my ratings and web site. The Massey Ratings have been part of the BCS since 1999.

What is the purpose of the computer ratings?

In any competetive league, there should be an objective and robust method to measure the performance of each team/individual. Win-loss records may be misleading if teams play disparate schedules, and polls suffer from human limitations and subjectivity.

After devising a mathematical model for the sport, an algorithm is implemented, and the resulting computer ratings objectively quantify the strength of each team based on the defining criteria.

How does your rating system work?

There is really no simple answer to this question (although I was once asked to provide one during a live interview on ESPN). Basicly, the ratings are the solution to a large system of equations, which comes from a statistical model and actual game data. For more details, see the Massey Rating Description.

How much time does it take?

I have written a fairly robust software to automate the calculations and web page generation. Daily updates require little intervention on my part. The bulk of my time is spent maintaining data files and writing computer code.

How big is your operation?

I am the sole proprietor of the Massey Ratings. Hats I wear include: researcher, developer, programmer, database manager, webmaster, and marketer. Of course, I don't work in a vacuum, and this effort would be impossible without internet resources and generous folks that have contributed over the years. See the credits.

Where do you get your data?

Most scores are collected electronically from a variety of publicly domain sources. Many individuals have graciously shared the results of their own data collection efforts. When convenient, basic consistency checks are run on multiple independent sources to verify the data's accuracy. I have written software that parses web pages and other sources, extracts the pertinent data, and merges it with my own database. Corrections and hard-to-find scores are entered manually.

How often do you update your pages?

Automated daily updates are scheduled for many of the mainstream sports. Each week (usually Monday), I do a full update of all the leagues.

Is the Massey computer system the best?

This depends on what goals you feel a rating system should meet. Should the rating system be predictive, or should it only measure and reward past performance (such as to determine who deserves the college football national championship)? What data is available? How is the model defined? Basicly any rating system can claim to be the best with respect to what it sets out to accomplish.

That said, I believe that the Massey Ratings satisfy all of the desirable properties of a rating system. The sophistication of the model and algorithm is beyond any other method I'm aware of. Every feature of my system is based on sound statistical assumptions regarding the nature of sports and games. There are rarely any skewed or highly abnormal results, and the Massey Ratings are highly correlated with the consensus.

My rating model has undergone several revisions. These changes are necessary to improve the quality of my ratings in light of new ideas, gained experience, and access to more historical data with which to refine the method.

What's the difference between "rating", "ranking", and "poll"?

A "ranking" is simply an ordinal number (such as 1st, 2nd, 3rd,...) that indicates a team's placement in a strictly non-quantitative sense. In contrast, a team's "rating" is generally a continuous scale measurement and must be interpretted on a scale by comparing it with other teams' ratings. For example, I can rank three teams as follows: (1) Team A, (2) Team B, (3) Team C. This tells me that according to my ranking criteria, A is better than B, and B is better than C. However, it does not tell me how much better. If ratings are assigned as (A = 9.7, B = 9.5, C = 1.2), then it is easy to see that in fact A and B are quite competitive while C is significantly inferior.

A poll is fundamentally different from a rating. Polls typically result from the tabulation of votes. For example, each ballot in college football's AP poll is the opinion of one writer who should be #1,#2,#3, etc. So a poll is really a composite of many opionions or preferences. In contrast, a computer ranking is obtained from a single "measurement" of how good each team is based on the defining criteria.

Team A beat Team B, so why do your ratings still have B ahead of A?

This situation is usually called and "upset." It is generally impossible order the teams to eliminate all inconsistencies in actual game outcomes. Teams are not evaluated on the basis of one game, in which there is potential for high deviation from typical performance levels. Instead, a team's rating is based on its body of work, in some sense an "average" level of performance over the entire season.

Your ratings stink! Why isn't my team ranked higher?

The implementation of a computer rating algorithm is completely objective. So if the computer gives your team a bad (or good) rating, it shouldn't be taken personally. You have the right to disagree with the computer, but more than likely this is evidence of your own subjectivity. I do not meddle with the algorithm to "fix" the ratings. The model defines certain criteria that determine a team's rating, and the results are published on this web site without any human intervention.

What about predictions?

Ratings are designed to reflect past performance, namely: winning games, winning against good competition, and winning convincingly. As a consequence, the ratings have some ability to predict the outcome of future games.

For many sports, I post predictions of upcoming games and monitor their success. In most cases, I would trust a computer's prediction over a human's. However, while this is often the most popular and entertaining application of computer ratings, it is not my primary purpose.

Predictions are obtained by extrapolating the analyisis of past performance to estimate future performance. Usually, the past is a resasonable indicator of what to expect in the future. However sporting events are to a great extent random, so upsets will occur. Furthermore, computer ratings are ignorant of many important factors such as injury, weather, motivation, and other intangibles. With this in mind, it is not wise to hold unrealistic expectations of the predictions.

The purpose of the Massey rankings is to order teams based on achievement. This objective may occasionally yield some surprising results: for instance having good teams from weak conferences ranked higer than one might expect. This is not to say that such a team is "better" than all teams below it. It is simply being rewarded for its success at winning the games it has played.

It is incorrect to assume that the Massey model is predicting that the higher ranked team should defeat a lower ranked team. Model predictions can be derived from latent variables, and may not agree with the rankings.

Why do you post three different rating systems?

While the algorithms that produce computer ratings are objective, the choice of the model itself is not. Multiple systems provide the opportunity to compare alternative interprettations of the same data. Although there is general agreement, computer ratings are also quite diverse. The Massey Ratings are my creation, while the Markov and Sauceda models were developed with help from friends of mine.

Are your ratings biased toward Virginia Tech?

I went to graduate school at Virginia Tech, and I am a proud Hokie fan. However, my ratings are completely objective with regard to every team, including Virginia Tech. To the computer, "Virginia Tech" is just a name, and the ratings would be the same if every team were anonymous.

The integrity of my ratings has been verified by the BCS. Historical results show that the Hokies' Massey ranking conforms to those produced by other polls and computer ratings.

How do you generate preseason ratings?

The BCS compliant ratings do not use preseason information, so everyone starts at zero. A team's rating may look funny or fluctuate wildly until there is enough evidence to get a more precise measurement of the team's strength. As games are played, the computer gradually 'learns' and the cream rises to the top.

For the main version, preseason ratings compensate for the lack of data early in a given season. They give the computer a realistic starting point from which to evaluate teams that have played zero or few games. This limits dramatic changes that could be caused by isolated results not buffered by the context of other games.

The effect of the preseason ratings gradually diminishes each week. When every team has played a sufficient number of games to be accurately evalulated based on this year alone, the preseason bias will be phased out.

Preseason ratings are based on an extrapolation recent years' results, tuned to fit historical trends and regression to the mean. A team's future performance is expected to be consistent with the strength of the program, but sometimes there may be temporary spikes.

Other potentially significant indicators (ex. returning starters, coaching changes, and recruiting) are ignored. Therefore, preseason ratings should not be taken too seriously,

How can X be ranked so high after losing their last game so badly?

The model assesses each team's body of work, and is not prone to over-reacting to one result. Recognizing the variation in outcomes, some games can be classified as upsets, and blowout margins are not given undue influence. The entire season is constantly re-evaluated, and the quality of each win, or the severity of each loss will be adjusted in light of the opponents' other results.

The BCS version doesn't consider margin of victory, so there's no sense even mentioning how many points X lost by. Schedule strength becomes a dominant criterion. When two highly ranked teams play, the loser is not penalized much if they have already beaten some quality teams.

Why do you have a BCS-compliant version of the ratings?

At the request of the BCS, I develped the BCS ratings that use only the bare minimum of data: winner and location. There are several good reasons for such a system.

It completely eliminates any incentive for the unsportsmanlike conduct of running up the score.
It purely rewards wins and penalizes losses regardless of the score. "You play to win the game."
Occam's Razor: we intuitively should prefer a simpler model. There's a certain elegance to using only the win/loss outcome without regard for how it was obtained.
When many games have been played, a good model should converge to the win-loss model anyway.
It poses an interesting challenge, given the sparsity and variability of the data.

What advantage does a computer ranking have over human polls?

Computer ratings have two main advantages. One, they can deal with an enormous amount of data (hundreds of teams and thousands of games). Second, they can analyze objectively - every team is treated the same.

This latter property is often a two-edged sword because it can cause disagreement with public opinion, which is stoked by the media. The public demands an objective system that plays no favorites and doesn't encourage a team to run up the score. However, they don't always agree with the inevitable consequences of such a system.

True, insufficient data can produce abnormal and flawed results. However, computers have no ego, so a good model will correct itself, and provide remarkable insights long before a human will become aware of (and admit) his mistake. In general, it doesn't take long for computer ratings to overtake a human poll in terms of accuracy and fairness.

Frequently Asked Questions