It’s Time for Dodger Baseball! Los Angeles Dodgers Data Analysis

Using Machine Learning & Plotly for the 2018, 2019, and 2020 Seasons

Los Angeles Dodgers Logo — https://upload.wikimedia.org/wikipedia/en/thumb/a/a0/Los_Angeles_Dodgers_logo_%28low_res%29.svg/1200px-Los_Angeles_Dodgers_logo_%28low_res%29.svg.png

Part 1 — Data Obtainment

All of the data that I used in this analysis was obtained using either APIs or webscraping. My sources included baseball-reference.com, mlb-data.p.rapidapi.com, and baseballsavant.mlb.com. All pertinent information was collected for the 2018, 2019, and 2020 seasons, with 2020 being up to date as of today.

  • Season Hitting Stats — Collective hitting stats for position players over the course of a season.
  • Season Pitching Stats — Collective pitching stats for pitchers over the course of the season.
  • Game Logs — A representation of a player’s performance on a game to game basis.

Part 2 — Machine Learning

I decided that I wanted to find out if I could see into the future or not. So, I created four machine learning models to try and predict the likelihood of the Dodgers winning or losing their next game by training on previous game data, using box scores. The four types of models were:

  • XGBoost
  • Naive Bayes
  • K-Nearest Neighbors

Machine Learning Preprocessing

As mentioned above, the data used for these predictions was the box scores for each game in 2018, 2019, and 2020, to date, which were obtained through webscraping. Once the data was fully obtained, I then feature engineered two new columns, which were called “Next_Game” and a second W/L column based on Next_Game. I will display my code below:

import pandas as pdboxscore = pd.read_html(‘https://www.baseball-reference.com/teams/LAD/2020-schedule-scores.shtml')box_score = boxscore[0]
box_score = box_score.set_index(‘Gm#’)
#Drop unnecessary columns
box_score.drop([‘Unnamed: 2’, ‘Unnamed: 4’, ‘Inn’, ‘Rank’, ‘Time’, ‘Attendance’, ‘Streak’, ‘Orig. Scheduled’, ‘Save’], axis=1,
inplace=True)
box_score.drop([‘Gm#’], axis=0, inplace=True)
box_score = box_score[0:38]
next_game = range(2,box_score.shape[0]+2)
box_score[‘Next_Game’] = next_game
box_score = box_score.loc[box_score[‘W/L’] != ‘W-wo’]
box_score = box_score.loc[box_score[‘W/L’] != ‘L-wo’]
box_score
next_game_outcome_2020 = box_score.drop([‘Date’, ‘Tm’, ‘Opp’, ‘R’, ‘RA’, ‘W-L’, ‘GB’, ‘Win’, ‘Loss’, ‘D/N’, ‘Next_Game’], axis=1)next_game_outcome_2020 =
next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘W-wo’]
next_game_outcome_2020 = next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘L-wo’]next_game_outcome_2020.index = next_game_outcome_2020.index.astype(‘int64’)next_game_outcome_2020boxscore_2020 = box_score.join(next_game_outcome_2020, on='Next_Game', lsuffix='_df', rsuffix='_other')boxscore_2020 = boxscore_2020.drop(['Date', 'Tm', 'W-L', 'GB'], axis=1)boxscore_2020
all_boxscores = [boxscore_2020, boxscore_2019, boxscore_2018]
boxscore_df = pd.concat(all_boxscores)
boxscore_df
  • Using the .isna().sum() method to check and resolve null values
  • Checking the value counts to see if there is any class imbalance
  • Creating dummy variables
  • Dropping all unnecessary columns
Confusion Matrix and ROC-AUC Curve for XGBoost
Confusion Matrix and ROC-AUC Curve for K-Nearest Neighbors

Part 3 — Data Visualization Using Plotly

Although the models haven’t performed to the standards I was hoping, I think this project’s value is truly derived from the data visualizations I’ve been able to create using the fantastic package, plotly.express. In total, I created 6 functions for 6 different statistics that users can visualize and analyze:

  • WHIP (Walks/Hits Per Inning Pitched
  • OBP% (On Base Percentage)
  • SLG% (Slugging Percentage)
  • OPS (On Base + Slugging)
def position_BA(player_name, games):

fig = px.line(x=list(range(games)), y=.250*np.ones(games))

try:
df2018_position = game_logs_2018_position.loc[player_name] fig.add_scatter(x=list(range(len(df2018_position.Date))), y=df2018_position.AVG, name=’2018')

except:
pass

try:
df2019_position = game_logs_2019_position.loc[player_name] fig.add_scatter(x=list(range(len(df2019_position.Date))), y=df2019_position.AVG, name=’2019')

except:
pass

try:
df2020_position = game_logs_2020_position.loc[player_name] fig.add_scatter(x=list(range(len(df2020_position.Date))), y=df2020_position.AVG, name=’2020')

except:
pass

fig.update_layout(title=’Batting Average — ‘ + player_name, xaxis_title=”Game”, yaxis_title=”BA”, legend_title=”Season”)

fig.show()
Up to Date Batting Average Stats for Mookie Betts
Up to Date Earned Run Average Stats for Clayton Kershaw
  • Relievers have no more than 30; players who started and relieved were capped at 25
  • Position players are limited to 60 games

Part 4 — Data Driven Recommendations & Future Works

For anyone reading this that may be a baseball nerd like me, it’s important to acknowledge that these stats available are some of the more simple, yet popular, ones that can be used for analysis. Baseball has been at the forefront at using data for game strategy and player development for many years now, and data scientists & scouts for teams use many different types of statistics, such as situational hitting just to name one, when making strategy and development decisions. Therefore, the recommendations I will be presenting are meant to be generic and open to interpretation. If a conversation is started based on what I say, good!

  • Including more (or all) MLB teams
  • Including defensive statistics
  • Exploring advanced statistics, such as XBA, XSLG%, Hard Hit %, K%
  • Types of pitches thrown, average velocities, pitching quadrants
  • More complex prediction models, such as season W/L predictions or predicting player performance projections before a season starts

Fitness, Sports, Data — And not necessarily in that order

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store