Fun with Correlations
Correlations are basic tool in data science. But sometimes the numbers can be deceiving.
Today’s topic is a familiar idea: correlations. A correlation is simple number between 1 and -1 that tells you how strongly two sets of data move together. As one variable moves, is there a pattern to how the other moves as well? Do they move together (positively) or in opposite directions (negatively)? Or maybe there no relationship at all (0.0)? I thought it would be fun to illustrate on an example that combines data from two of my favorite things: Star Wars and the sport of baseball. I’ve collected the win totals of each major league baseball team in the year when major Star Wars films were released. I’ve also collected the film review ratings from an online rating site. Let’s see if there are any correlations between certain teams winning more games and having a good Star Wars movie in that year.
Let’s start with positive correlations. This means a team tends to win more games in years when that year’s Star Wars film is more highly reviewed. The image below from Altair HyperStudy shows the sorted correlation coefficients on the right side. The left side shows scatter plots for the four highest team’s data. Note the scatter points generally indicate a rising relationship like the equation f(x)=x. Fans of the Orioles and Altair’s hometown Tigers should hope for winning seasons if they want better Star Wars movies!
Similarly, we can show similar results for the most negative correlations. In this case, teams like the Diamondbacks or Mariners can take solace that even if their team loses, at least the movie should be good.
But wait, are any of these relationships causal? Can a baseball team’s record influence the quality of movie? Of course not, or at least it is not true due to a cause and effect relationship. But that’s why people always say: “There is a difference between correlation and causation”.
Investigating correlations is a vitally important piece of data analytics, but this silly example shows that a strong correlation is rarely enough to draw a conclusion. The analyst should use correlations to find data trends, but it is required to explain why the correlation makes sense, if at all.
May the fourth be with you.