It’s Time for Dodger Baseball! Los Angeles Dodgers Data Analysis

Los Angeles Dodgers Logo —

Part 1 — Data Obtainment

  • Box Scores — General information about each game, including stats such as runs scored, runs allowed, day/night game, and winning pitcher
  • Season Hitting Stats — Collective hitting stats for position players over the course of a season.
  • Season Pitching Stats — Collective pitching stats for pitchers over the course of the season.
  • Game Logs — A representation of a player’s performance on a game to game basis.

Part 2 — Machine Learning

  • Random Forest
  • XGBoost
  • Naive Bayes
  • K-Nearest Neighbors

Machine Learning Preprocessing

import pandas as pdboxscore = pd.read_html(‘')box_score = boxscore[0]
box_score = box_score.set_index(‘Gm#’)
#Drop unnecessary columns
box_score.drop([‘Unnamed: 2’, ‘Unnamed: 4’, ‘Inn’, ‘Rank’, ‘Time’, ‘Attendance’, ‘Streak’, ‘Orig. Scheduled’, ‘Save’], axis=1,
box_score.drop([‘Gm#’], axis=0, inplace=True)
box_score = box_score[0:38]
next_game = range(2,box_score.shape[0]+2)
box_score[‘Next_Game’] = next_game
box_score = box_score.loc[box_score[‘W/L’] != ‘W-wo’]
box_score = box_score.loc[box_score[‘W/L’] != ‘L-wo’]
next_game_outcome_2020 = box_score.drop([‘Date’, ‘Tm’, ‘Opp’, ‘R’, ‘RA’, ‘W-L’, ‘GB’, ‘Win’, ‘Loss’, ‘D/N’, ‘Next_Game’], axis=1)next_game_outcome_2020 =
next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘W-wo’]
next_game_outcome_2020 = next_game_outcome_2020.loc[next_game_outcome_2020[‘W/L’] != ‘L-wo’]next_game_outcome_2020.index = next_game_outcome_2020.index.astype(‘int64’)next_game_outcome_2020boxscore_2020 = box_score.join(next_game_outcome_2020, on='Next_Game', lsuffix='_df', rsuffix='_other')boxscore_2020 = boxscore_2020.drop(['Date', 'Tm', 'W-L', 'GB'], axis=1)boxscore_2020
all_boxscores = [boxscore_2020, boxscore_2019, boxscore_2018]
boxscore_df = pd.concat(all_boxscores)
  • Using the .info() method to check datatypes and changing any as needed
  • Using the .isna().sum() method to check and resolve null values
  • Checking the value counts to see if there is any class imbalance
  • Creating dummy variables
  • Dropping all unnecessary columns
Confusion Matrix and ROC-AUC Curve for XGBoost
Confusion Matrix and ROC-AUC Curve for K-Nearest Neighbors

Part 3 — Data Visualization Using Plotly

  • ERA (Earned Run Average)
  • WHIP (Walks/Hits Per Inning Pitched
  • BA (Batting Average)
  • OBP% (On Base Percentage)
  • SLG% (Slugging Percentage)
  • OPS (On Base + Slugging)
def position_BA(player_name, games):

fig = px.line(x=list(range(games)), y=.250*np.ones(games))

df2018_position = game_logs_2018_position.loc[player_name] fig.add_scatter(x=list(range(len(df2018_position.Date))), y=df2018_position.AVG, name=’2018')


df2019_position = game_logs_2019_position.loc[player_name] fig.add_scatter(x=list(range(len(df2019_position.Date))), y=df2019_position.AVG, name=’2019')


df2020_position = game_logs_2020_position.loc[player_name] fig.add_scatter(x=list(range(len(df2020_position.Date))), y=df2020_position.AVG, name=’2020')


fig.update_layout(title=’Batting Average — ‘ + player_name, xaxis_title=”Game”, yaxis_title=”BA”, legend_title=”Season”)
Up to Date Batting Average Stats for Mookie Betts
Up to Date Earned Run Average Stats for Clayton Kershaw
  • Starting pitchers have no more than 12 games available in a season
  • Relievers have no more than 30; players who started and relieved were capped at 25
  • Position players are limited to 60 games

Part 4 — Data Driven Recommendations & Future Works

  • Including more seasons
  • Including more (or all) MLB teams
  • Including defensive statistics
  • Exploring advanced statistics, such as XBA, XSLG%, Hard Hit %, K%
  • Types of pitches thrown, average velocities, pitching quadrants
  • More complex prediction models, such as season W/L predictions or predicting player performance projections before a season starts




Fitness, Sports, Data — And not necessarily in that order

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Introduction to AMBS Dissertation support — Podcast

Heckyl — Product Case Study

Bernoulli and Binomial Distributions Explained

What Are The Benefits Of Cloud Data Warehousing?

Binary Search

GA COVID-19 Report December 23, 2021

What I Learned Setting up Storage for a Machine Learning Project

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Acusio Bivona

Acusio Bivona

Fitness, Sports, Data — And not necessarily in that order

More from Medium

Sonification 101: How to convert data into music with python

Picture of Matt with dog Marty with code and the Moon in the background. Large text says “Turn Data Into Music With Python!


Python Minesweeper

How to Deal with Null, N/A, or Empty Cells in Your Dataframe Using Python.