It’s Time for Dodger Baseball! Los Angeles Dodgers Data Analysis

Using Machine Learning & Plotly for the 2018, 2019, and 2020 Seasons

Los Angeles Dodgers Logo — https://upload.wikimedia.org/wikipedia/en/thumb/a/a0/Los_Angeles_Dodgers_logo_%28low_res%29.svg/1200px-Los_Angeles_Dodgers_logo_%28low_res%29.svg.png

“Hello everybody, and a very pleasant good evening to you, wherever you may be!” was the legendary introduction by former Hall of Fame Dodgers broadcaster Vin Scully that I had the pleasure to listen to hundreds of times until his retirement in 2016. And I welcome you, reader, to my Dodgers Data Analysis post! I have been a diehard Dodgers fan for 12 years and knowing that baseball season is in full swing brings so much joy into my life. However, this season is different… yet, special.

As a result of COVID-19, this year’s Major League Baseball season is a 60 game sprint, as opposed to the typical 162 game marathon. This obviously brings some very unique and interesting challenges: added pressure, limited time for slumps, and a true feeling that every game & every series truly matters. There are no days off — even on their days off!

Given these unique circumstances, I asked myself the following big question: What are we supposed to expect during a 60 game season? This big question, of course, also brings along with it many smaller questions, such as: What will stat lines look like? Will be there be super inflated winning percentages… or losing percentages? Will big-market teams, like the Dodgers, struggle with no fans cheering in the stands? How are managers going to use their pitchers? With a truly once-in-a-lifetime season like this, along with some fascinating rule changes, like the universal designated hitter and expanded playoffs, I decided to take a stab at my piece of the pie of what to make from this crazy and fascinating season.

This post will consist primarily of three parts: machine learning, data visualization, and data-driven recommendations. I will summarize all that I’ve worked on, but for anyone interested, feel free to explore all that is available in my Github Repo.

All of the data that I used in this analysis was obtained using either APIs or webscraping. My sources included baseball-reference.com, mlb-data.p.rapidapi.com, and baseballsavant.mlb.com. All pertinent information was collected for the 2018, 2019, and 2020 seasons, with 2020 being up to date as of today.

Pertinent information for this analysis included the following:

  • Box Scores — General information about each game, including stats such as runs scored, runs allowed, day/night game, and winning pitcher
  • Season Hitting Stats — Collective hitting stats for position players over the course of a season.
  • Season Pitching Stats — Collective pitching stats for pitchers over the course of the season.
  • Game Logs — A representation of a player’s performance on a game to game basis.

The primary stat I did not pursue is defensive stats. However, that is something I would like to analyze in the future.

*Note* This will be a continuously evolving project. I believe that what I’ve worked on so far is merely just the beginning, and I fully plan on writing a follow-up post to this after the Dodgers win the World Series… I mean, after the season ends, providing a full recap and explanation of how the team and players did compared to their performance in 2018 and 2019.

Rather than go through the entire data cleaning process in one fell swoop, I will address my process when I cover each aspect of this project. With all that being said, it’s time to play ball!

I decided that I wanted to find out if I could see into the future or not. So, I created four machine learning models to try and predict the likelihood of the Dodgers winning or losing their next game by training on previous game data, using box scores. The four types of models were:

  • Random Forest
  • XGBoost
  • Naive Bayes
  • K-Nearest Neighbors

Yep — I’m trying to be every bookie in Las Vegas. I jest, but nonetheless I gave it a shot, and the results were… well, bad. Inconclusive. Whatever sciency term you want to use to describe poorness in results. I believe there are fair reasons for this, though, and I will get into those momentarily. But for now, let’s start with the preprocessing.

As mentioned above, the data used for these predictions was the box scores for each game in 2018, 2019, and 2020, to date, which were obtained through webscraping. Once the data was fully obtained, I then feature engineered two new columns, which were called “Next_Game” and a second W/L column based on Next_Game. I will display my code below:

import pandas as pdboxscore = pd.read_html(‘https://www.baseball-reference.com/teams/LAD/2020-schedule-scores.shtml')box_score = boxscore[0]
box_score = box_score.set_index(‘Gm#’)
#Drop unnecessary columns
box_score.drop([‘Unnamed: 2’, ‘Unnamed: 4’, ‘Inn’, ‘Rank’, ‘Time’, ‘Attendance’, ‘Streak’, ‘Orig. Scheduled’, ‘Save’], axis=1,
inplace=True)
box_score.drop([‘Gm#’], axis=0, inplace=True)
box_score = box_score[0:38]
next_game = range(2,box_score.shape[0]+2)
box_score[‘Next_Game’] = next_game
box_score = box_score.loc[box_score[‘W/L’] != ‘W-wo’]
box_score = box_score.loc[box_score[‘W/L’] != ‘L-wo’]
box_score

The lines that are in bold were the steps taken to create my Next_Game column. Next, the code will be displayed on how I created the second W/L column based off of Next_Game. Initially, it’s going to look like I just pulled the wins and losses regularly— which is correct. It will be the .join() step (in bold) where the magic happens:

next_game_outcome_2020 = box_score.drop([‘Date’, ‘Tm’, ‘Opp’, ‘R’, ‘RA’, ‘W-L’, ‘GB’, ‘Win’, ‘Loss’, ‘D/N’, ‘Next_Game’], axis=1)next_game_outcome_2020 =
next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘W-wo’]
next_game_outcome_2020 = next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘L-wo’]next_game_outcome_2020.index = next_game_outcome_2020.index.astype(‘int64’)next_game_outcome_2020boxscore_2020 = box_score.join(next_game_outcome_2020, on='Next_Game', lsuffix='_df', rsuffix='_other')boxscore_2020 = boxscore_2020.drop(['Date', 'Tm', 'W-L', 'GB'], axis=1)boxscore_2020

And with that, my dataframe now looks like this for 2020:

The column “W/L_other” is now one game result ahead of where the Dodgers actually are (due to the .join() method), and that column will become my target in modeling. So, the models are using game data from the previous games to try and predict the likelihood of them winning or losing the next game they play!

I did this same process for the box scores of 2018 and 2019 as well, and finally put them all into one, big dataframe using .concat().

all_boxscores = [boxscore_2020, boxscore_2019, boxscore_2018]
boxscore_df = pd.concat(all_boxscores)
boxscore_df

Now that the dataframe has been created, it’s time for cleaning. I won’t go into great detail into how I cleaned my data, for I want to dive deeper into other areas, but I did perform the following steps:

  • Using the .info() method to check datatypes and changing any as needed
  • Using the .isna().sum() method to check and resolve null values
  • Checking the value counts to see if there is any class imbalance
  • Creating dummy variables
  • Dropping all unnecessary columns

Now, my data is shiny and clean and ready for modeling. As I mentioned earlier, I ran four different models to try and get the best results possible. I will be displaying the results for the best ensemble method and the best non-ensemble method, which were the XGBoost and the K-Nearest Neighbors models.

*Note* Since these models are updated frequently, the performance of which models do best can vary from time to time because there is more data being added, as well as the features being weighed differently as a season progresses. This is also as of today’s date, so the models you see here versus the models you may see a week from now in my repo could (and probably will) change.

Confusion Matrix and ROC-AUC Curve for XGBoost
Confusion Matrix and ROC-AUC Curve for K-Nearest Neighbors

To my surprise, the K-Nearest Neighbors model has performed the best of the four models on a consistent basis. I believed the XGBoost would’ve been the best since, well, it’s the best at classifying pretty much anything, But, here we are.

The models predicted losses accurately 45% of the time for XGBoost and 52% of the time for KNN, while predicting wins correctly 46% and 57% of the time, respectively. The XGBoost has an AUC score of .41 which is… again, pretty bad. In fact, it is 9% less reliable than flipping a coin. So, yeah… definitely not a Vegas bookie yet. KNN however, has an AUC score of .54, which means that it’s 4% better than random chance! Better, but still not very reliable. I think there is one major reason for the low performance — lack of data.

Overall, there are just 329 samples (to date) for the machine to process. And with a testing size set at 20%, that means it is training on 263 samples and only testing on 66 samples. This is a problem that’s easy to fix though, as I would just need to scrape more box scores from past seasons. So that will be the first place I start going forward. As the models improve in performance, showing off insights, such as the most important features, will have more validity and can be trusted more.

I think there are other preprocessing adjustments I can take to improve performance. For example, I have initially dropped the “Loss” column, which designates who the losing pitcher was for a game. I did this because I was initially only curious on which pitchers were the most impactful during wins. But, re-adding them could be valuable when predicting the losses. So, I think that is another step I will take.

I used GridSearch whenever possible to determine the best parameters to use. But, I think there are some other parameters I could explore using in GridSearch that I haven’t explored yet that could perhaps induce better results. As data science students and professionals know, modeling is a continuous series of adjustments on the fly — like a baseball game!

Although the models haven’t performed to the standards I was hoping, I think this project’s value is truly derived from the data visualizations I’ve been able to create using the fantastic package, plotly.express. In total, I created 6 functions for 6 different statistics that users can visualize and analyze:

For Pitchers:

  • ERA (Earned Run Average)
  • WHIP (Walks/Hits Per Inning Pitched

For Position Players:

  • BA (Batting Average)
  • OBP% (On Base Percentage)
  • SLG% (Slugging Percentage)
  • OPS (On Base + Slugging)

The data for these visualizations was obtained by webscraping the game logs for each player for all of the 2018 and 2019 seasons, and the 2020 season to date. Then, I created the functions for the visuals using an incredibly handy tool called try and except statements. The code for my batting average function is displayed below:

def position_BA(player_name, games):

fig = px.line(x=list(range(games)), y=.250*np.ones(games))

try:
df2018_position = game_logs_2018_position.loc[player_name] fig.add_scatter(x=list(range(len(df2018_position.Date))), y=df2018_position.AVG, name=’2018')

except:
pass

try:
df2019_position = game_logs_2019_position.loc[player_name] fig.add_scatter(x=list(range(len(df2019_position.Date))), y=df2019_position.AVG, name=’2019')

except:
pass

try:
df2020_position = game_logs_2020_position.loc[player_name] fig.add_scatter(x=list(range(len(df2020_position.Date))), y=df2020_position.AVG, name=’2020')

except:
pass

fig.update_layout(title=’Batting Average — ‘ + player_name, xaxis_title=”Game”, yaxis_title=”BA”, legend_title=”Season”)

fig.show()

Essentially, what this is doing is that it’s checking to see if a player has game log info for a particular season. If they do, then it will collect it and display it on the graph. If they don’t, then it will pass onto the next checkpoint, which is the next try statement. This is a type of loop, so it will iterate through each aspect until the task is completed. Then it will display the full visual — even if they haven’t played for all three seasons. This allows for young players and rookies on the roster to still have their performance data be visualized and analyzed.

I will show two examples of what these visuals look like below. ERA for the 🐐, Clayton Kershaw, and BA for newest Dodgers superstar, Mookie Betts:

Up to Date Batting Average Stats for Mookie Betts
Up to Date Earned Run Average Stats for Clayton Kershaw

There are a couple things that I should point out. First, is the blue lines in each graph — they represent the MLB average for that particular stat. For batting average, average is around .250, and for ERA the average is around 4. So let’s just say, both players have been outstanding over the last few seasons! Being above the average for BA is where you want to be, and you want to be below the average for ERA.

Next, make note of the “games” argument in the function. This can be changed however the user wishes, but there are some limitations:

  • Starting pitchers have no more than 12 games available in a season
  • Relievers have no more than 30; players who started and relieved were capped at 25
  • Position players are limited to 60 games

Remember, the design of this project is to imagine a world where the 2018 and 2019 seasons were also 60 games, so that we can appropriately compare performance. Position players can play every game; that’s why they have the full allowance. Relievers, typically, will pitch on average every other day, at most. And starters pitch once every 5th game, meaning the most starts they can make in a 60 game season would be 12.

If you decide to go into the notebook and explore for yourself, there are a list of pitcher names and player names to choose from that you can make your visuals from. :)

From these visuals, you can analyze any player’s performance from the Dodgers’ 40-man roster and create your own opinions, which is something I think is really cool. I have aspirations of creating dashboards in my notebook that way users can just access it and not have to run the code themselves, with the end-goal being having the dashboards hosted on the internet so that anyone in the world can see them. With time and patience, we shall see if it happens.

For anyone reading this that may be a baseball nerd like me, it’s important to acknowledge that these stats available are some of the more simple, yet popular, ones that can be used for analysis. Baseball has been at the forefront at using data for game strategy and player development for many years now, and data scientists & scouts for teams use many different types of statistics, such as situational hitting just to name one, when making strategy and development decisions. Therefore, the recommendations I will be presenting are meant to be generic and open to interpretation. If a conversation is started based on what I say, good!

Starting with position players, I think one way this information could be useful is to help determine batting orders. For example, if you have a player with a low batting average, but slugs really well, you can think of having them hitting 5th or 6th. They’re still in a spot where they can drive in runs and do damage, but they don’t have to feel extra pressure for getting on base, whether it be by a hit or a walk. Now, if a player has a high batting average and high on base percentage, but isn’t hitting them for extra bases or home runs, you can think about hitting them in the first or second spot, especially if they have some speed. You want these guys to get on base often from hits and walks, while putting them in a great position to steal bases. Lastly, if you have a player that has a good batting average, slugs well, and is also good at drawing walks, then they should probably be hitting third or fourth. Their ability to slug means they can drive in the guys at the top of the order, while their ability to get on base means the batters hitting below them have the ability to drive them in.

For pitchers, it’s a little tougher to create game-time recommendations since there’s only two statistics. However, I believe that a pitcher’s performance in ERA and WHIP can create maybe the most important thing a manager needs during a game — trust. If a starter is not allowing runs or baserunners, then you can trust them to stay in the game longer and pitch more innings. And for relievers, depending on who is excelling in these two stats may determine who would pitch in a high-leverage situation. A high-leverage pitching example would be like if your team is up by one run in the 8th inning, there’s 2 outs, and the bases are loaded. You need one guy to get one out. Who’s gonna be the guy? You can’t go to your closer yet because it’s not a save situation, but knowing who has the best combo of not allowing runs and basrunners would be the best person for the job.

As I mentioned in the beginning, I truly believe my work so far is just the beginning of something that could become magical. I look forward to continuing my work on this, and I think the possibilities on where it could go are seemingly endless. Some ideas I have in mind are:

  • Including more seasons
  • Including more (or all) MLB teams
  • Including defensive statistics
  • Exploring advanced statistics, such as XBA, XSLG%, Hard Hit %, K%
  • Types of pitches thrown, average velocities, pitching quadrants
  • More complex prediction models, such as season W/L predictions or predicting player performance projections before a season starts

And so many more.

I love baseball. I love sports statistics. And I love the Dodgers.

I look forward to continuing making this project even better and working on more projects in the future. Thank you for reading!

“So, pull up a chair because it’s time for Dodger baseball!” — Vin Scully

Go Dodgers!

Github Repo

Fitness, Sports, Data — And not necessarily in that order