Which MLB Players Are Most Similar?

One of my favorite aspects of baseball has always been its grounding in statistical reality - while every player has their own style, everybody is shooting for the same things: Hits, Runs, and RBIs. What I find particularly interesting is that there can be an enormous difference in what kinds of players achieve the same statistical milestones. Sure, Giancarlo Stanton hit 39 home runs last season - he’s a beast. But did you know that all 190 pounds of Francisco Lindor managed the same feat?

In this vein, I thought it might be fun to create a simple model that allows us to see which players in baseball are most similar, on the basis of their overall offensive portfolio of production, with no bias for position/age/body type/etc. There are a few well known concepts for player Similarity Scores, which was first introduced by Bill James, but most of those models deal with career statistics and are determined with a point scoring system, not an algorithm. To my knowledge, baseball doesn’t have an easily referenced “in season” set of player similarity scores - so I set out to make one!

Modeling Hitting Data

To begin with, I retrieved a CSV export of player hitting data from Fangraphs - just standard hitting numbers. I then fed this data into a Jupyter notebook and worked through the steps below to calculate every player’s closest matches!

Loading and Rescaling the Data

To start, we’ll load the columns that we want to compare users by - going with straightforward offensive production numbers for now, but can totally see this analysis getting even deeper if one was to feed through some more advanced statistics.

import pandas as p

hitting = p.read_csv("standard hitting.csv")[['AB', 'PA', 'H', '1B', '2B', 
		  '3B', 'HR', 'R', 'RBI', 'BB', 'SO', 'SB', 'AVG']]

It’s important to rescale each column of the data so that differences in scale in the initial data (ie. 5-20 home runs, 0.200 - 0.400 batting averages) don’t erroneously skew our distances between points.

from sklearn.preprocessing import StandardScaler  
import numpy as np

scaler = StandardScaler()  
scaled_hitting = scaler.fit_transform(hitting) 
np_hitting = np.array(scaled_hitting)

Building the Nearest Neighbors Model

To obtain the closest matches for any given player, I decided to use a Nearest Neighbors model, which will calculate the distance of each point in vector space from every other point. This approach is commonly used for clustering and classification algorithms like k-means, but we’ll just be using the intermediate distance calculations for each point to get it’s three nearest matches.

from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(np_hitting)
distances, indices = nbrs.kneighbors(np_hitting, n_neighbors = 4)

The indices list will now hold a list of nearest neighbors for each row (player) in our dataset, and the distances list holds exactly what you’d think it would.


Output Player Listings

Read in the entire hitting dataframe again so that we can get the names of each player (right now we know that row 13’s nearest neighbors are 71, 8, and 102, but that doesn’t do us much good!), then create a dictionary of the nearest neighbors.

hitting_full = p.read_csv("standard hitting.csv")

players = {}

for result in indices:
    closest = []
    for player in result[1:4]:
        closest.append(hitting_full.iloc[player, 0])
    players[hitting_full.iloc[result[0], 0]] = closest

Last thing we’ll do is write our dataset to a CSV that we can take where we wish to visualize the data!

savable_players = p.DataFrame.from_dict(
								columns=['similar1', 'similar2', 'similar3']
savable_players.to_csv( "most_similar_players.csv")



Here is MVP front-runner Cody Bellinger with some other fantastic baseball players. I though this was a cool chart, but was craving an interactive aspect…. so I kept digging.

Similar Players to Cody Bellinger


Searchable Similar Players Table

Shiny Application

Throwing together a simple web app, I decided to merge my nice faceted chart with the datatable view and form a one stop shop for each player and his closest matches. Have fun playing around - I’ve found a lot of the matches to feel very realistic to real life!