Skip to main content

What Do You Mean by Average?

Describing something as average is typically innocuous.
But sometimes it can be deceiving. Take a look at the definition of average and you'll see a typical amount, or common, ordinary. When it comes to numbers and data it is also tied to what is called the arithmetic mean (sometimes just mean for short) -- which is adding up a bunch of values and dividing by the number of values you have. Key questions here: When are these ideas in alignment and when does it fail?

To help get to the bottom of this, at least from a statistical point of view, is to first talk about resistant statistics -- which are summarizations of data that are not highly influenced by individual values. Let's also have a quick reminder of the median or value that splits the data set into an upper and lower half when the data is ordered.The mean of a data set is not resistant to extreme values, while the median is, and we'll look into why this is the case and what this has to do with fantasy football and sports in general. Take a look at the table below for what you'd see in the box score Lamar Miller in Week 8, and Melvin Gordon in Week 6, each of this past season. 
Player Week Rushing Yards Carries Yards per Carry (YPC)
Melvin Gordon 6 132 18 7.33
Lamar Miller 8 133 18 7.39
That table makes it seem like each player had pretty much exactly the same game, and in terms of fantasy points from rushing, that is true. We see that Miller averaged 7.39 yards per carry and Gordon averaged 7.33. Back to how we described average at the beginning, as the typical amount. Do each of these YPC values represent the typical amount for each player's respective game? Spoiler alert: Nope. Take a look at the distribution of each player's yards over their 18 carries each. Each player's mean and median are also on the plot.

What seems to be the difference between each Miller's and Gordan's yardage? As noted, each player has the essentially the same mean, but Gordon's median is much higher. We also see Miller had one long run of 58 yards. That is the reason we also see his YPC is so high compared to his median and the same as Gordon's, showing that the mean (YPC) is not resistant to extreme values when compared to the rest of the data. Interpreting the median is straight forward for each player: Half the time Gordon carried the ball for at least 7 yards, while half of the time Miller carried for under 3.

So why does this matter? You may say "hey, I'll take that 58 yard gain. So what's the big deal?" Sure, I would take a 50+ yard rush as well. However, context is important. In this case, what value of Lamar Miller's day rushing would you say is more representative of the typical amount, 7.39 yards (the mean, YPC), or 2.5 (the median)? Talking about their rushing production through that lens makes things seem a lot different than looking at box scores and YPC. Let's say each running back is in the same 3rd and 3 situation. We just concluded that half of the time Miller rushed, he was held under 3 yards. What about the percentage of Gordon's carries under 3 yards that game? That was 16.7%, meaning the other 83% of his carries went for 3 or more yards. We often hear announcers and fans say things like "He was averaging 7 yards per carry, why wouldn't they run it!" Based on the median for Lamar Miller's week 8, it may be clear that's not the best choice. Obviously there are a bunch of circumstances when play-callers make their choices, but the main point is relying on just one single summarizing statistic -- in this case the mean -- is often misleading.

"How misleading can it be, Jerome?" is probably what you're screaming at this moment. "I want to test is out!", you exclaim. Luckily for you I made another interactive visual that can help show how resistant the median is compared to the mean and how sample size has an affect on this. Below you can set a sample size for the number of carries you want to test, then click the button below that to generate a random sample of yardages. The sample is drawn from every running play of the 2018 regular season. Once that pops up, just click on the plot to add a new point of that yardage to the data and see how that changes both the mean and the median. The slider will help zoom in and out if you want, and the "Clear Added Points" will remove what you added (Duh). The default number of carries is set at 20, which can be seen to represent one game. Add a 90 yard carry to see how much that changes the mean and median, respectively. Change the number of Rush Attempts to something like 250 to show represent a season.

Hopefully playing around with this has shed some light on to how vulnerable YPC, and any mean, is to extreme values. Now is median always better than the mean? Not necessarily. Keep in mind what each measures and what information is really needed to calculate it. To find the mean, all you need is the total and the number of values. Adding one point will probably change the mean. To find the median, you need the whole data set though adding a value might not change the median. The conclusion here is that it's better to know both of these instead of just one. So next time someone is trying to make a big point suing a player's YPC, Points Per Game, or any "something-per-something", ask if they know the median value. When they reply no, say that since the mean isn't a resistant statistic and is heavily influenced by extreme values you can't buy what they are selling.

As always, let me know what you think by email or twitter. If the interactive plot isn't working well on your device try getting to it directly using the link below. Hope you enjoyed this venture into statistics! And if I just tricked you into learning more about stats, then sorry, not sorry.

Popular posts from this blog

Hitting the NBA Jackpot

Lotteries are typically really tough to win. Powerball, Mega Millions, and even state lotteries are all damn near impossible claim the jackpot. Even winning anything is pretty unlikely. Getting a Pick 5 (numbers 0-9) exactly right is 1 in 100,000. Getting luck enough to draft the next phenom like LeBron, Anthony Davis, or perhaps Zion Williamson? That's much, much easier than getting a few digits in the right order. NBA Draft System and Protected Picks A quick refresher: The NBA determines the selection order of the first 14 (of 30) teams for an upcoming draft of new players by a lottery system. These are teams that didn't make the playoffs. The rest of the draft order is determined by inverse order of regular season record. The NHL also has a lottery system for non-playoff teams but differs for teams in the post season where order is determined partially by playoff performance and regular season point total. In the NFL, non-playoff teams are ordered purely by regular sea...

NCAA Bracket Help is Here (2019)

Last call for brackets! It's the last few hours and I am here to help. Even if the deadline passes use this to try and find any potential upsets. I used three classification models to predict each the winner of any match-up for the 2019 Men's basketball tournament. You can pick whichever model seems the best to you, or use the three in a voting scheme where majority picks the winner! To learn more about the model types (if you're interested) I used:  logistic regression  -  neural network  -  random forest Use the search bar to and type in both school names you are looking for, and the probabilities next to each model are the likelihood that predictor gives to School X winning the match-up. If you're having trouble with the table on this site, click here to open the table in a new window. As always, feel free to reach out with any feedback via email ( math.w.jerome@gmail.com ) or through twitter! Tweet to @MathWithJerome Follow @MathWit...

First Post! What to expect from fantasy football starters (ideally, at least)?

Let's get into what I'm going to call  starter-worthy production. If you're in a 12-team league, with the typical two RB slots to fill, then the top 24 running backs are typically deemed "starters." To get a sense of how things may go for you this week, you visit your favorite rankings site to check out where your guys fall in the top 20 to 30 players at the position. But what point production should you feel good about from those players? Can we find a point threshold that helps identify production worthy of being in your starting lineup? Let's take a look...   I'm defining starter-worthy production for each week as the top 24 scores for RB and WR, and the top 12 scores for QB and TE. The dataset consists of the top scores for each week including 50 for RB/WR, 18 for QB, and 20 for TE. The splits were needed for simplicity and don’t change the analysis. Looking at the visualization below, the blue histogram shows what starter-worthy produ...