In [1]:
# Setting up
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import itertools
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

# ignore warning 
import warnings
warnings.filterwarnings("ignore")
# hide the code cell
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

Predicting a season based on Machine Learning Model

Winter 2019, INFO 370

Members

William Kwok, Wonjo Barng, Kangwoo Choi, Vincent Widjaya

Introduction

Sports is an integral part of nearly all cultures. One of the sports which has lots of fans in the world is soccer, which we will call football throughout this paper. For years individuals have tried to use statistics to figure out what makes teams win, or to try to find out if their favorite teams are the best. Razali suggests that "the research for predicting the results of football matches outcome started as early as 1977 by Stafani R". The English Premier League (EPL), is one of the most popular and largest leagues in the world. In 2017, there were 5 clubs from EPL in top 10 teams by revenue.

In order to gain an understanding of what makes teams win at football, we will be exploring a few research questions:

  • RQ1: Is there an association between the betting odds and the end results of a game? We may be able to find cases of illegal match fixing such as the scandals that occurred in late 2013 and produce a model that predicts if match fixing is happening.
  • RQ2: Is there an association between the betting odds and statistics of a game? Studying this may also result in finding cases of illegal match fixing. For example, we can see if certain teams are under-performing in some games where bets are higher.
  • RQ3: Is there an association between the game stats and the end results of a game? With this, we will be able to predict what factors are associated with winning the game. Coaches can use this information to their advantage and train their players against those conditions.

In answering these questions, we will be working with data from all England Premier League seasons from the 2014/2015 season to the current 2018/2019 season provided by Football Data. These datasets contain data pertaining to the game itself as well as a lot of bets associated with the game.

Data Preparation

To answer these research questions, we selected features that would affect a result of game. There are no null/NAN value on our dataset. Based on our knowledge, we choose following features:

  • HomeTeam - Name of the home team
  • AwayTeam - Name of the away team
  • FTR - Full time result (Home win, Away win, Draw)
  • FTHG - Full time home team goals
  • FTAG - Full time away team goals
  • HS - Home team shots
  • AS - Away team shots
  • HST - Home team shots on target
  • AST - Away team shots on target
  • HSGR - Home team shots goal ratio (calculated)
  • ASGR - Away team shots goal ratio (calculated)

We also include average betting odds from 6 companies, which are Bet365, Bet&win, Interwetten, Pinnacle, VC Bet, and William Hill. The reason that we select these companies is that these companies has valid data in every season. Lowest betting odds means that teams with lower betting odds are likely to win the game. If the betting odds for draw is the lowest, it means that people expect the game would draw. For example, if the betting odd for home team is 1.13, then if home team won the game, each people will get 1.13 times more money than they bet. These are the following variables:

  • odd_home - Average betting odds for Home Teams
  • odd_draw - Average betting odds of betting on draw
  • odd_away - Average betting odds for Away Teams
In [2]:
# get data frames previous 4 seasons and current season.
df_1415 = pd.read_csv('./data/1415.csv')
df_1516 =  pd.read_csv('./data/1516.csv')
df_1617 = pd.read_csv('./data/1617.csv')
df_1718 = pd.read_csv('./data/1718.csv')
df_1819 = pd.read_csv('./data/1819.csv')
df_total = df_1415.append(df_1516).append(df_1617).append(df_1718).append(df_1819)

Combined Dataset

In [3]:
df_total.head()
Out[3]:
AC AF AR AS AST AY AwayTeam B365A B365D B365H ... Referee SJA SJD SJH VCA VCD VCH WHA WHD WHH
0 3.0 19.0 1.0 4.0 2.0 2.0 Crystal Palace 15.0 6.5 1.25 ... J Moss 12.00 5.75 1.25 10.50 6.25 1.25 12.0 5.5 1.25
1 6.0 10.0 0.0 13.0 3.0 1.0 Everton 2.4 3.4 3.20 ... M Jones 2.38 3.30 3.00 2.40 3.40 3.20 2.4 3.1 3.10
2 0.0 20.0 0.0 5.0 4.0 4.0 Swansea 11.0 5.0 1.36 ... M Dean 8.00 5.00 1.36 10.00 5.20 1.36 9.0 4.5 1.36
3 9.0 10.0 0.0 11.0 4.0 2.0 Hull 3.1 3.3 2.50 ... C Pawson 2.88 3.25 2.50 3.12 3.20 2.55 2.9 3.0 2.60
4 8.0 9.0 0.0 7.0 2.0 3.0 Aston Villa 4.5 3.5 1.95 ... A Taylor 4.00 3.40 1.95 4.75 3.30 1.95 4.2 3.2 1.95

5 rows × 68 columns

Dataset Format after data cleaning
In [4]:
# Clean data function
def clean_data(df):
    data = df[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST']]
    data['HSGR'] = data['FTHG']/data['HS']
    data['ASGR'] = data['FTAG']/data['AS']
    data = data.replace([np.inf, -np.inf], 0)
    bet_home = df[['B365H','BWH','IWH','PSH','VCH','WHH']].mean(axis=1)
    bet_draw = df[['B365D','BWD','IWD','PSD','VCD','WHD']].mean(axis=1)
    bet_away = df[['B365A','BWA','IWA','PSA','VCA','WHA']].mean(axis=1)
    
    
    data['odd_home'] = bet_home
    data['odd_draw'] = bet_draw
    data['odd_away'] = bet_away
    
    data.dropna()
    
    return data
In [5]:
# clean data 
df_total = clean_data(df_total)
df_total.head()
Out[5]:
HomeTeam AwayTeam FTHG FTAG FTR HS AS HST AST HSGR ASGR odd_home odd_draw odd_away
0 Arsenal Crystal Palace 2.0 1.0 H 14.0 4.0 6.0 2.0 0.142857 0.250000 1.260000 5.866667 12.085000
1 Leicester Everton 2.0 2.0 D 11.0 13.0 3.0 3.0 0.181818 0.153846 3.073333 3.296667 2.393333
2 Man United Swansea 1.0 2.0 A 14.0 5.0 5.0 4.0 0.071429 0.400000 1.363333 4.925000 9.600000
3 QPR Hull 0.0 1.0 A 19.0 11.0 6.0 4.0 0.000000 0.090909 2.488333 3.193333 3.015000
4 Stoke Aston Villa 0.0 1.0 A 12.0 7.0 2.0 2.0 0.000000 0.142857 1.958333 3.345000 4.250000
Dataset with Engineered Features for RQ2

For RQ2, we realized it was better to create running metrics that looks at teams' previous games, up to the game being observed, rather than using the statistics of individual games. This is because by the time a game's statistics have been finalized, the betters would have already locked in their odds, meaning we would be making predictions based on values that would not have mattered anymore.

To solve this concern, we turned toward feature engineering, creating variables that represented the statistics of past games in the season to predict the odds of a current game. This approach would allow us to make more relevant predictions. Below is the our prepared dataset for models to answer RQ2. All variables below are updated every game played by the home and away team (e.g. they represent their respective statistics for all games before the current).

  • H_W - Home wins
  • H_WR - Home win rate
  • H_avg_diff - Home team's average scores over opponent
  • A_W - Away wins
  • A_WR - Away win rate
  • A_avg_diff - Away team's average scores over opponent

We chose to show rows not among the top of the dataframe because when a team first plays, their previous stats would be null. It would be all zero's, show here is a better representation of the majority of the data.

In [6]:
def engineer_features(df):
    df = df.copy()
    df['goals_h_a'] = df['FTHG'] - df['FTAG']
    df['total_h_a'] = df['HS'] - df['AS']

    H_GT = [] # home games total so far
    H_W = [] # home wins so far
    H_WR = [] # home win rate so far
    H_avg_diff = [] # home avg goals diff

    A_GT = [] # away games total so far
    A_W = [] # away wins so far
    A_WR = [] #away win rate so far
    A_avg_diff = [] # away avg goals diff

    for i in range(len(df)):
        home = df.loc[i, 'HomeTeam']
        away = df.loc[i, 'AwayTeam']
        
        home_h_games = df[df['HomeTeam'] == home].loc[:i-1]['goals_h_a']
        home_a_games = df[df['AwayTeam'] == home].loc[:i-1]['goals_h_a'] * -1
        home_games = home_h_games.append(home_a_games)
        
        away_h_games = df[df['HomeTeam'] == away].loc[:i-1]['goals_h_a']
        away_a_games = df[df['AwayTeam'] == away].loc[:i-1]['goals_h_a'] * -1
        away_games = away_h_games.append(away_a_games)
        
        H_GT.append(len(home_games))
        A_GT.append(len(away_games))
        
        H_W.append((home_games > 0).sum() + (home_games == 0).sum() * 0.5)
        A_W.append((away_games > 0).sum() + (away_games == 0).sum() * 0.5)
        
        if H_GT[i] > 0:
            H_WR.append(H_W[i] / H_GT[i])
            H_avg_diff.append(home_games.mean())
        else:
            H_WR.append(0)
            H_avg_diff.append(0)
            
        if A_GT[i] > 0:
            A_WR.append(A_W[i] / A_GT[i])
            A_avg_diff.append(away_games.mean())
        else:
            A_WR.append(0)
            A_avg_diff.append(0)
    
    df['H_GT'] = H_GT
    df['H_W'] = H_W
    df['H_WR'] = H_WR
    df['H_avg_diff'] = H_avg_diff
    df['A_GT'] = A_GT
    df['A_W'] = A_W
    df['A_WR'] = A_WR
    df['A_avg_diff'] = A_avg_diff
    return df
In [7]:
# clean data for each individual dataset
data_1415 = clean_data(pd.read_csv('./data/1415.csv'))
data_1415.drop(data_1415.tail(1).index,inplace=True)
data_1516 = clean_data(pd.read_csv('./data/1516.csv'))
data_1617 = clean_data(pd.read_csv('./data/1617.csv'))
data_1718 = clean_data(pd.read_csv('./data/1718.csv'))
data_1819 = clean_data(pd.read_csv('./data/1819.csv'))

df_1415_feat_engr = engineer_features(data_1415)
df_1516_feat_engr = engineer_features(data_1516)
df_1617_feat_engr = engineer_features(data_1617)
df_1718_feat_engr = engineer_features(data_1718)
df_1819_feat_engr = engineer_features(data_1819)

df_past_seasons = df_1415_feat_engr.copy()
df_past_seasons = df_past_seasons.append(df_1516_feat_engr)
df_past_seasons = df_past_seasons.append(df_1617_feat_engr)
df_past_seasons = df_past_seasons.append(df_1718_feat_engr)
df_past_seasons = df_past_seasons.reset_index(drop=True)
df_past_seasons = df_past_seasons[['HomeTeam', 'AwayTeam', 
                                   'H_W', 'H_WR', 'H_avg_diff', 
                                   'A_W', 'A_WR', 'A_avg_diff', 
                                   'odd_home', 'odd_draw', 'odd_away']]
In [8]:
df_past_seasons[150:155]
Out[8]:
HomeTeam AwayTeam H_W H_WR H_avg_diff A_W A_WR A_avg_diff odd_home odd_draw odd_away
150 Arsenal Newcastle 8.5 0.566667 0.400000 8.5 0.566667 -0.066667 1.446667 4.471667 7.563333
151 Burnley Southampton 5.0 0.333333 -0.933333 9.0 0.600000 0.866667 5.041667 3.591667 1.761667
152 Chelsea Hull 12.5 0.833333 1.400000 5.5 0.366667 -0.400000 1.181667 7.028333 17.410000
153 Crystal Palace Stoke 5.5 0.366667 -0.333333 6.5 0.433333 -0.200000 2.588333 3.198333 2.926667
154 Leicester Man City 4.0 0.266667 -0.733333 11.5 0.766667 1.200000 6.545000 4.528333 1.493333
In [9]:
# a function that gets the average of betting odds
def average_betting(df):
    betting_accuracies = []
    for index, row in df.iterrows():
        if(row['FTR'] == 'H' and row['odd_home'] < row['odd_away'] and row['odd_home'] < row['odd_draw']):
            betting_accuracies.append(1)
        elif(row['FTR'] == 'D' and row['odd_draw'] < row['odd_away'] and row['odd_draw'] < row['odd_home']):
            betting_accuracies.append(1)
        elif(row['FTR'] == 'A' and row['odd_away'] < row['odd_home'] and row['odd_away'] < row['odd_draw']):
            betting_accuracies.append(1)
        else:
            betting_accuracies.append(0)
        
    return np.mean(betting_accuracies)
In [10]:
# set time trend on plot
time = [average_betting(data_1415),
        average_betting(data_1516), 
        average_betting(data_1617), 
        average_betting(data_1718), 
        average_betting(data_1819)]

Finding patterns from Dataset

Accuracy of betting odds

We expected that the accuracy would increase, but there is no pattern on betting accuracy over time.

In [11]:
# Betting accuracy by timeline ()
plt.plot(["14/15", "15/16", "16/17", "17/18", "18/19"], time, label='betting accuracy')
plt.legend()
plt.title('betting accuracy')
plt.xlabel('time')
plt.ylabel('accuracy')
plt.show()

Actual Result vs Betting odds

We compare the average betting odds and actual results. When home team wins the game, the proability of matching betting odds with actual results is around 84%. When the result of the game is draw, the proability of matching betting odds with actual results is around 0% since people expect to draw less than winning or losing a game. When away team win the game, the proability of matching betting odds with actual results is around 57%.

In [12]:
# getting accuracy about how an actual result matched with lowest betting odds.
def accuracy(df):
    home = 0
    draw = 0
    away = 0
    for index, row in df.iterrows():
        if (row['odd_home'] < row['odd_away'] and row['odd_home'] < row['odd_draw']):
            home = home+1
        elif (row['odd_draw'] < row['odd_away'] and row['odd_draw'] < row['odd_home']):
            draw = draw+1
        else:
            away = away+1
            
    return [home / len(df), draw / len(df), away / len(df)]
In [13]:
# draw pie chart for getting proportation of betting odds for matched result.
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'H']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when Home won the game')
plt.show()
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'D']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when draw')
plt.show()
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'A']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when Away won the game')
plt.show()

Home vs Away

It shows the number of wins for home and away team. Home team used to take advantages.

In [14]:
## get distribution of the result
def getDistResult(data):
    arr = [0,0,0]
    for index, row in data.iterrows():
        if row.FTR == 'H':
            arr[0] += 1
        elif row.FTR == 'D':
            arr[1] += 1
        else:
            arr[2] += 1
    return arr
In [15]:
# by season home or away
result = ['Home Wins', 'Draw', 'Away Wins']
plt.barh(result, getDistResult(df_1415))
plt.xlabel('Frequency')
plt.title('14/15 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1516))
plt.xlabel('Frequency')
plt.title('15/16 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1617))
plt.xlabel('Frequency')
plt.title('16/17 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1718))
plt.xlabel('Frequency')
plt.title('17/18 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1819))
plt.xlabel('Frequency')
plt.title('18/19 Season Result stats')
plt.show()

Modeling

In [16]:
# Clean data function from Kangwoo's notebook
def clean_data_poisson(df):
    data = df[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST']]
    data['HSGR'] = data['FTHG']/data['HS']
    data['ASGR'] = data['FTAG']/data['AS']
    data = data.replace([np.inf, -np.inf], 0)
#     bet_home = df[['B365H','BWH','IWH','PSH','VCH','WHH']].mean(axis=1)
#     bet_draw = df[['B365D','BWD','IWD','PSD','VCD','WHD']].mean(axis=1)
#     bet_away = df[['B365A','BWA','IWA','PSA','VCA','WHA']].mean(axis=1)
    
    
#     data['odd_home'] = bet_home
#     data['odd_draw'] = bet_draw
#     data['odd_away'] = bet_away
    
    data.dropna()
    
    ############### New stuff
    team_scores = {}
    team_shots = {}
    HomeAvgAllTimeSoFar = []
    HomeHighAllTimeSoFar = []
    HomeLowAllTimeSoFar = []
    HomeTotalGoals = []
    HomeTotalShots = []
    HomeTotalAccuracy = []
    AwayAvgAllTimeSoFar = []
    AwayHighAllTimeSoFar = []
    AwayLowAllTimeSoFar = []
    AwayTotalGoals = []
    AwayTotalShots = []
    AwayTotalAccuracy = []
    for index, row in data.iterrows():
        # Add values to all the rows before adding to the team scores
        home_team = row["HomeTeam"]
        away_team = row["AwayTeam"]
        if home_team not in team_scores:
            team_scores[home_team] = []
        if away_team not in team_scores:
            team_scores[away_team] = []
        if home_team not in team_shots:
            team_shots[home_team] = []
        if away_team not in team_shots:
            team_shots[away_team] = []
        home_team_scores = team_scores[home_team]
        away_team_scores = team_scores[away_team]
        home_team_shots = team_shots[home_team]
        away_team_shots = team_shots[away_team]
        if len(home_team_scores) < 1:
            HomeAvgAllTimeSoFar.append(0)
            HomeHighAllTimeSoFar.append(0)
            HomeLowAllTimeSoFar.append(0)
        else: 
            HomeAvgAllTimeSoFar.append(np.mean(home_team_scores))
            HomeHighAllTimeSoFar.append(np.max(home_team_scores).astype("float"))
            HomeLowAllTimeSoFar.append(np.min(home_team_scores).astype("float"))
        if len(away_team_scores) < 1:
            AwayAvgAllTimeSoFar.append(0)
            AwayHighAllTimeSoFar.append(0)
            AwayLowAllTimeSoFar.append(0)
        else: 
            AwayAvgAllTimeSoFar.append(np.mean(away_team_scores))
            AwayHighAllTimeSoFar.append(np.max(away_team_scores).astype("float"))
            AwayLowAllTimeSoFar.append(np.min(away_team_scores).astype("float"))
        s_Home_Scores = np.sum(home_team_scores)
        s_Home_Shots = np.sum(home_team_shots)
        s_Away_Scores = np.sum(away_team_scores)
        s_Away_Shots = np.sum(away_team_shots)
        HomeTotalGoals.append(s_Home_Scores)
        HomeTotalShots.append(s_Home_Shots)
        HomeTotalAccuracy.append(np.nan_to_num(s_Home_Scores/s_Home_Shots))
        AwayTotalGoals.append(s_Away_Scores)
        AwayTotalShots.append(s_Away_Shots)
        AwayTotalAccuracy.append(np.nan_to_num(s_Away_Scores/s_Away_Shots))
        # Add to team scores
        team_scores[home_team].append(row["FTHG"])
        team_scores[away_team].append(row["FTAG"])
        team_shots[home_team].append(row["HS"])
        team_shots[away_team].append(row["AS"])
        
    data["HomeAvgAllTimeSoFar"] = HomeAvgAllTimeSoFar
    data["HomeHighAllTimeSoFar"] = HomeHighAllTimeSoFar
    data["HomeLowAllTimeSoFar"] = HomeLowAllTimeSoFar
    data["AwayAvgAllTimeSoFar"] = AwayAvgAllTimeSoFar
    data["AwayHighAllTimeSoFar"] = AwayHighAllTimeSoFar
    data["AwayLowAllTimeSoFar"] = AwayLowAllTimeSoFar 
    data["HomeTotalGoals"] = HomeTotalGoals
    data["HomeTotalShots"] = HomeTotalShots
    data["HomeTotalAccuracy"] = HomeTotalAccuracy
    data["AwayTotalGoals"] = AwayTotalGoals
    data["AwayTotalShots"] = AwayTotalShots
    data["AwayTotalAccuracy"] = AwayTotalAccuracy
    #####################
    data.dropna()
    
    return data
In [17]:
# clean data for each individual dataset
data_1415_poi = clean_data_poisson(pd.read_csv('./data/1415.csv'))
data_1415_poi.drop(df_1415.tail(1).index,inplace=True)
data_1516_poi = clean_data_poisson(pd.read_csv('./data/1516.csv'))
data_1617_poi = clean_data_poisson(pd.read_csv('./data/1617.csv'))
data_1718_poi = clean_data_poisson(pd.read_csv('./data/1718.csv'))
data_1819_poi = clean_data_poisson(pd.read_csv('./data/1819.csv'))

(1) Using Poisson

We attempted to run a poisson model on the data to produce a guess for if the Home team would win, Away team, or if it would end up in a draw. We encoded the values using sklearn's LabelEncoder function. Then we split the data into test and training data. We then plugged it into a general linear model with a Poisson family using the following formulas: FTR ~ HomeTeam + AwayTeam, FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + AwayTotalAccuracy, and FTR ~ HomeAvgAllTimeSoFar + HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + AwayTotalAccuracy.

Here are a list of the values we used in these models:

  • HomeTeam - Name of the home team
  • AwayTeam - Name of the away team
  • HomeAvgAllTimeSoFar - Average of all scores from each game the home team has played by that point in time.
  • HomeHighAllTimeSoFar - Highest of all scores from each game the home team has played by that point in time.
  • HomeLowAllTimeSoFar - Lowest of all scores from each game the home team has played by that point in time.
  • AwayAvgAllTimeSoFar - Average of all scores from each game the away team has played by that point in time.
  • AwayHighAllTimeSoFar - Highest of all scores from each game the away team has played by that point in time.
  • AwayLowAllTimeSoFar - Lowest of all scores from each game the away team has played by that point in time.
  • HomeTotalGoals - Total home goals to date
  • HomeTotalShots - Total home shots to date
  • HomeTotalAccuracy - Home accuracy to date
  • AwayTotalGoals - Total away goals to date
  • AwayTotalShots - Total away shots to date
  • AwayTotalAccuracy - Away accuracy to date

For each year

First we ran this model on the separate datasets. At first, we tried out just the Home team and Away team names as factors in the model, similar (but not exactly!) to David Sheehan's study. We received these accuracy scores.

  • 14-15: 0.368
  • 15-16: 0.329
  • 16-17: 0.395
  • 17-18: 0.382
  • 18-19: 0.431

This very simple model seems to be able to guess it more than a third of the time mostly. However, that isn't as good as we would like. It is just slightly better than guessing at random. In 2015-2016, it would have been worse than guessing at random.

We then tried with only the factors we generated and found a dip in performance for most of the years.

  • 14-15: 0.289
  • 15-16: 0.289
  • 16-17: 0.355
  • 17-18: 0.382
  • 18-19: 0.362

Finally, we tried both the home team and away team names as factors combined with our factors, and we saw a slight improvement in most of the scores.

  • 14-15: 0.368
  • 15-16: 0.382
  • 16-17: 0.408
  • 17-18: 0.395
  • 18-19: 0.362

We theorize that 18-19 is so low accuracy because there haven't been as many games played thus far.

Combined

We also ran the model on all the combined datasets. Note that while combined, the "All Time" factors are localized to the season the data point came from, meaning there are no "lifetime" data points besides the name of the team (doing so doesn't provide any increase in accuracy). For the model that just had team names as factors, we received an accuracy of 0.387. For the model with just our factors, we received an accuracy of 0.362. For the model with those combined, we received an accuracy of 0.398.

If we take a look at the summary of the model with the team names as factors alongside our factors, we see that some of the factors are significant. To contrast David Sheehan's study, we didn't receive as many significant p values because we did not combine HomeTeam and AwayTeam into a single team variable for this specific analysis.

In [18]:
# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1415 = LabelEncoder() 
encoder_1415.fit(data_1415_poi["FTR"])
data_1415_poi["FTR"] = encoder_1415.transform(data_1415_poi["FTR"])

# split into training and test data
train_1415, valid_1415, train_labels_1415, valid_labels_1415 = train_test_split(
    data_1415_poi.drop("FTR", axis=1),
    data_1415_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1415_all = train_1415
train_1415_all["FTR"] = train_labels_1415

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a1 = accuracy_score(valid_labels_1415, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a2 = accuracy_score(valid_labels_1415, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam ", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a3 = accuracy_score(valid_labels_1415, pred)

a1, a2, a3
Out[18]:
(0.3684210526315789, 0.2894736842105263, 0.3684210526315789)
In [19]:
# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1516 = LabelEncoder()
encoder_1516.fit(data_1516_poi["FTR"])
data_1516_poi["FTR"] = encoder_1516.transform(data_1516_poi["FTR"])

# split into training and test data
train_1516, valid_1516, train_labels_1516, valid_labels_1516 = train_test_split(
    data_1516_poi.drop("FTR", axis=1),
    data_1516_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1516_all = train_1516
train_1516_all["FTR"] = train_labels_1516

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a1 = accuracy_score(valid_labels_1516, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a2 = accuracy_score(valid_labels_1516, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a3 = accuracy_score(valid_labels_1516, pred)

a1, a2, a3
Out[19]:
(0.3815789473684211, 0.2894736842105263, 0.32894736842105265)
In [20]:
# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1617 = LabelEncoder()
encoder_1617.fit(data_1617_poi["FTR"])
data_1617_poi["FTR"] = encoder_1617.transform(data_1617_poi["FTR"])

# split into training and test data
train_1617, valid_1617, train_labels_1617, valid_labels_1617 = train_test_split(
    data_1617_poi.drop("FTR", axis=1),
    data_1617_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1617_all = train_1617
train_1617_all["FTR"] = train_labels_1617

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a1 = accuracy_score(valid_labels_1617, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a2 = accuracy_score(valid_labels_1617, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a3 = accuracy_score(valid_labels_1617, pred)

a1, a2, a3
Out[20]:
(0.40789473684210525, 0.35526315789473684, 0.39473684210526316)
In [21]:
# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1718 = LabelEncoder()
encoder_1718.fit(data_1718_poi["FTR"])
data_1718_poi["FTR"] = encoder_1718.transform(data_1718_poi["FTR"])

# split into training and test data
train_1718, valid_1718, train_labels_1718, valid_labels_1718 = train_test_split(
    data_1718_poi.drop("FTR", axis=1),
    data_1718_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1718_all = train_1718
train_1718_all["FTR"] = train_labels_1718

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a1 = accuracy_score(valid_labels_1718, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a2 = accuracy_score(valid_labels_1718, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a3 = accuracy_score(valid_labels_1718, pred)

a1, a2, a3
Out[21]:
(0.39473684210526316, 0.3815789473684211, 0.3815789473684211)
In [22]:
# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1819 = LabelEncoder()
encoder_1819.fit(data_1819_poi["FTR"])
data_1819_poi["FTR"] = encoder_1819.transform(data_1819_poi["FTR"])

# split into training and test data
train_1819, valid_1819, train_labels_1819, valid_labels_1819 = train_test_split(
    data_1819_poi.drop("FTR", axis=1),
    data_1819_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1819_all = train_1819
train_1819_all["FTR"] = train_labels_1819

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a1 = accuracy_score(valid_labels_1819, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a2 = accuracy_score(valid_labels_1819, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a3 = accuracy_score(valid_labels_1819, pred)

a1, a2, a3
Out[22]:
(0.3620689655172414, 0.3620689655172414, 0.43103448275862066)
In [23]:
# concatenate all data
all_data_poi = pd.concat([data_1415_poi, data_1516_poi, data_1617_poi, data_1718_poi, data_1819_poi], axis=0, ignore_index=True)
In [24]:
encoder_all = LabelEncoder()
encoder_all.fit(all_data_poi["FTR"])
all_data_poi["FTR"] = encoder_all.transform(all_data_poi["FTR"])
train_all, valid_all, train_labels_all, valid_labels_all = train_test_split(
    all_data_poi.drop("FTR", axis=1),
    all_data_poi["FTR"],
    test_size=0.2,
    random_state=123
)

train_all_allcols = train_all
train_all_allcols["FTR"] = train_labels_all
m1 = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()
pred = np.round(m1.predict(valid_all))
pred[np.isnan(pred)] = 1
a1 = accuracy_score(valid_labels_all, pred)

m2 = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()

pred = np.round(m2.predict(valid_all))
pred[np.isnan(pred)] = 1
a2 = accuracy_score(valid_labels_all, pred)

m3 = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()

pred = np.round(m3.predict(valid_all))
pred[np.isnan(pred)] = 1
a3 = accuracy_score(valid_labels_all, pred)


a1, a2, a3
Out[24]:
(0.39779005524861877, 0.36187845303867405, 0.3867403314917127)
In [25]:
m1.summary()
Out[25]:
Generalized Linear Model Regression Results
Dep. Variable: FTR No. Observations: 1447
Model: GLM Df Residuals: 1378
Model Family: Poisson Df Model: 68
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -1782.6
Date: Wed, 13 Mar 2019 Deviance: 1145.1
Time: 02:08:52 Pearson chi2: 865.
No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 0.2067 0.216 0.956 0.339 -0.217 0.630
HomeTeam[T.Aston Villa] -0.8236 0.267 -3.083 0.002 -1.347 -0.300
HomeTeam[T.Bournemouth] -0.3208 0.171 -1.879 0.060 -0.655 0.014
HomeTeam[T.Brighton] -0.1526 0.217 -0.704 0.482 -0.578 0.273
HomeTeam[T.Burnley] -0.3196 0.189 -1.692 0.091 -0.690 0.051
HomeTeam[T.Cardiff] -0.6427 0.344 -1.870 0.061 -1.316 0.031
HomeTeam[T.Chelsea] -0.0966 0.133 -0.725 0.468 -0.358 0.165
HomeTeam[T.Crystal Palace] -0.5529 0.171 -3.232 0.001 -0.888 -0.218
HomeTeam[T.Everton] -0.2673 0.152 -1.762 0.078 -0.565 0.030
HomeTeam[T.Fulham] -0.7869 0.398 -1.977 0.048 -1.567 -0.007
HomeTeam[T.Huddersfield] -0.6784 0.278 -2.440 0.015 -1.223 -0.133
HomeTeam[T.Hull] -0.4553 0.221 -2.061 0.039 -0.888 -0.022
HomeTeam[T.Leicester] -0.2157 0.155 -1.391 0.164 -0.520 0.088
HomeTeam[T.Liverpool] -0.1082 0.135 -0.800 0.424 -0.373 0.157
HomeTeam[T.Man City] -0.0918 0.142 -0.647 0.518 -0.370 0.186
HomeTeam[T.Man United] -0.0234 0.135 -0.174 0.862 -0.288 0.241
HomeTeam[T.Middlesbrough] -0.4627 0.311 -1.487 0.137 -1.073 0.147
HomeTeam[T.Newcastle] -0.3958 0.180 -2.199 0.028 -0.749 -0.043
HomeTeam[T.Norwich] -0.6440 0.323 -1.996 0.046 -1.276 -0.012
HomeTeam[T.QPR] -0.4678 0.284 -1.646 0.100 -1.025 0.089
HomeTeam[T.Southampton] -0.3345 0.155 -2.163 0.031 -0.638 -0.031
HomeTeam[T.Stoke] -0.3278 0.171 -1.914 0.056 -0.664 0.008
HomeTeam[T.Sunderland] -0.6836 0.218 -3.132 0.002 -1.111 -0.256
HomeTeam[T.Swansea] -0.4001 0.182 -2.196 0.028 -0.757 -0.043
HomeTeam[T.Tottenham] -0.1366 0.137 -0.997 0.319 -0.405 0.132
HomeTeam[T.Watford] -0.2621 0.171 -1.529 0.126 -0.598 0.074
HomeTeam[T.West Brom] -0.5356 0.186 -2.887 0.004 -0.899 -0.172
HomeTeam[T.West Ham] -0.3562 0.157 -2.274 0.023 -0.663 -0.049
HomeTeam[T.Wolves] -0.2620 0.301 -0.871 0.384 -0.851 0.328
AwayTeam[T.Aston Villa] 0.3467 0.219 1.581 0.114 -0.083 0.776
AwayTeam[T.Bournemouth] 0.2235 0.182 1.231 0.218 -0.132 0.579
AwayTeam[T.Brighton] 0.2752 0.238 1.156 0.248 -0.191 0.742
AwayTeam[T.Burnley] 0.1318 0.193 0.685 0.494 -0.246 0.509
AwayTeam[T.Cardiff] 0.2860 0.289 0.989 0.323 -0.281 0.853
AwayTeam[T.Chelsea] -0.2379 0.179 -1.332 0.183 -0.588 0.112
AwayTeam[T.Crystal Palace] -0.0014 0.181 -0.008 0.994 -0.356 0.353
AwayTeam[T.Everton] 0.1402 0.172 0.814 0.416 -0.197 0.478
AwayTeam[T.Fulham] 0.5538 0.290 1.909 0.056 -0.015 1.123
AwayTeam[T.Huddersfield] 0.2565 0.232 1.107 0.268 -0.198 0.711
AwayTeam[T.Hull] 0.3666 0.214 1.713 0.087 -0.053 0.786
AwayTeam[T.Leicester] 0.0848 0.176 0.481 0.631 -0.261 0.431
AwayTeam[T.Liverpool] -0.1102 0.175 -0.628 0.530 -0.454 0.234
AwayTeam[T.Man City] -0.4459 0.203 -2.200 0.028 -0.843 -0.049
AwayTeam[T.Man United] -0.2398 0.181 -1.325 0.185 -0.594 0.115
AwayTeam[T.Middlesbrough] 0.2716 0.306 0.889 0.374 -0.327 0.871
AwayTeam[T.Newcastle] 0.2413 0.184 1.310 0.190 -0.120 0.602
AwayTeam[T.Norwich] 0.3895 0.251 1.555 0.120 -0.101 0.881
AwayTeam[T.QPR] 0.4945 0.262 1.890 0.059 -0.018 1.007
AwayTeam[T.Southampton] 0.1077 0.173 0.621 0.535 -0.232 0.448
AwayTeam[T.Stoke] 0.1731 0.185 0.936 0.349 -0.189 0.536
AwayTeam[T.Sunderland] 0.2262 0.200 1.131 0.258 -0.166 0.618
AwayTeam[T.Swansea] 0.2137 0.186 1.147 0.251 -0.151 0.579
AwayTeam[T.Tottenham] -0.3941 0.189 -2.083 0.037 -0.765 -0.023
AwayTeam[T.Watford] 0.2499 0.176 1.416 0.157 -0.096 0.596
AwayTeam[T.West Brom] 0.1332 0.191 0.698 0.485 -0.241 0.507
AwayTeam[T.West Ham] 0.0943 0.172 0.550 0.582 -0.242 0.431
AwayTeam[T.Wolves] 0.1480 0.324 0.457 0.647 -0.486 0.782
HomeAvgAllTimeSoFar 0.1749 0.146 1.202 0.229 -0.110 0.460
HomeHighAllTimeSoFar 0.0191 0.034 0.557 0.577 -0.048 0.086
HomeLowAllTimeSoFar -0.0132 0.097 -0.136 0.892 -0.204 0.177
AwayAvgAllTimeSoFar -0.1817 0.166 -1.091 0.275 -0.508 0.145
AwayHighAllTimeSoFar 0.0228 0.033 0.688 0.492 -0.042 0.088
AwayLowAllTimeSoFar -0.0558 0.110 -0.505 0.614 -0.272 0.161
HomeTotalGoals -0.0023 0.006 -0.380 0.704 -0.014 0.010
HomeTotalShots 0.0003 0.001 0.321 0.748 -0.001 0.002
HomeTotalAccuracy -1.3185 1.531 -0.861 0.389 -4.319 1.682
AwayTotalGoals -0.0006 0.007 -0.094 0.925 -0.014 0.013
AwayTotalShots -6.407e-05 0.001 -0.069 0.945 -0.002 0.002
AwayTotalAccuracy 1.4854 1.756 0.846 0.398 -1.956 4.927
In [26]:
m2.summary()
Out[26]:
Generalized Linear Model Regression Results
Dep. Variable: FTR No. Observations: 1447
Model: GLM Df Residuals: 1434
Model Family: Poisson Df Model: 12
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -1818.3
Date: Wed, 13 Mar 2019 Deviance: 1216.6
Time: 02:08:53 Pearson chi2: 896.
No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -0.0160 0.122 -0.131 0.896 -0.255 0.223
HomeAvgAllTimeSoFar 0.3790 0.114 3.314 0.001 0.155 0.603
HomeHighAllTimeSoFar 0.0088 0.031 0.282 0.778 -0.052 0.070
HomeLowAllTimeSoFar -0.1273 0.093 -1.363 0.173 -0.310 0.056
AwayAvgAllTimeSoFar -0.4953 0.134 -3.688 0.000 -0.759 -0.232
AwayHighAllTimeSoFar 0.0206 0.030 0.690 0.490 -0.038 0.079
AwayLowAllTimeSoFar 0.0202 0.107 0.188 0.851 -0.190 0.230
HomeTotalGoals -0.0016 0.006 -0.274 0.784 -0.013 0.010
HomeTotalShots 0.0011 0.001 1.368 0.171 -0.000 0.003
HomeTotalAccuracy -1.6430 1.321 -1.244 0.213 -4.231 0.946
AwayTotalGoals -0.0033 0.006 -0.516 0.606 -0.016 0.009
AwayTotalShots -0.0007 0.001 -0.864 0.387 -0.002 0.001
AwayTotalAccuracy 3.6994 1.458 2.537 0.011 0.842 6.557

(2) Predicting Betting Odds

In [27]:
def forward_selected(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

def getLastTeamStats(df):
    team_stats = {}
    for team in df.HomeTeam.unique():
        temp_df = df[::-1].reset_index(drop=True)
        for i in range(len(temp_df)):
            game = temp_df.loc[i]
            if team == game['HomeTeam']:
                stats = {}
                stats['W'] = game['H_W']
                stats['WR'] = game['H_WR']
                stats['avg_diff'] = game['H_avg_diff']
                team_stats[team] = stats
            elif team == game['AwayTeam']:
                stats = {}
                stats['W'] = game['A_W']
                stats['WR'] = game['A_WR']
                stats['avg_diff'] = game['A_avg_diff']
                team_stats[team] = stats
            if team in team_stats:
                break
    return team_stats

As explained in the data preparation section, predicting betting odds reasonably required engineering new variables to run models on. The validity of this reasoning is later proven through the steps we took explained in this section. We chose to use a regressor instead of a classifier because bettings odds are continuous variables which don't make sense to be classified. We also chose to not use a percentile feature selector because we believe each of the six engineered variables represent its own crucial information.

The first model we created was a grid search with a pipeline that utilizes k-neighbors regression and 10-fold cross validation, with scaled training values. The variables it took into account were the number of goals scored by each team (FTHG, FTAG) as well as their ratios of goals to number of tries (HSGR, ASGR). The results were acceptable and not far off the models we choose to showcase below, but it was here we realized it didn't make sense to predict odds with game statistics.

Our second model was the same grid search and pipeline, but using only and all six newly engineered features (H_W, H_WR, H_avg_diff, A_W, A_WR, A_avg_diff). As expected, we got better results! There are two main takeaways from the visualizations of the results. One is that this model does not seem to be overfitted, since we are predicting values for the 18-19 season which is not included in the training data and the model under-predicts outliers. Second is that our engineered features seem to be doing its work; we can see on the plots that with time, the accuracy of the model increases due to the nature of the engineered features being running updated statistics. Visually, this is represented by the straight line at the start of all 3 plots (predictions for odd_home, odd_draw, odd_away). The straight line represents all teams' first games played in the season, when the model makes no assumptions about team/game statististics. However, the model's predicted values almost perfectly aligns with the actual values towards the end of the plots, except for the obvious outliers.

Our third model we created was a simple multivariate linear regression using forward selection on the six engineered features. This model was made to further compoare the performance of our second model, just in case an approach without a grid search might perform just as well. The results were about 20% to 40% worse in terms of negative mean absolute error. Also, the forward selection picked pretty much all six variables! It eliminated just one of them for draw betting odds, but it could be an outlier outcome. All this validates our reasoning to use k-neighbors with cross-validation and to skip on feature selection!

Our second model performed best, so we will use that to predict the betting odds for home, draw, and away in the future games of the 18-19 season.

In [28]:
scaler = MinMaxScaler()
knn_reg = KNeighborsRegressor()

df_future_games = pd.read_csv('./data/prediction.csv')
columns_to_use = ['H_W', 'H_WR', 'H_avg_diff', 'A_W', 'A_WR', 'A_avg_diff']

K-Neighbors Regression with Scaler and 10-Fold Cross Validation (Engineered Features)

In [30]:
# GRID SEARCH (K-NEIGHBORS REGRESSOR, SCALER)

pipe = make_pipeline(scaler, knn_reg)
param_grid = {
    'kneighborsregressor__n_neighbors':range(1, 20), 
    'kneighborsregressor__weights':['uniform', 'distance']
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring="neg_mean_absolute_error")

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    grid.fit(df_past_seasons[columns_to_use], df_past_seasons[odd_type])
    predictions = grid.predict(df_1819_feat_engr[columns_to_use])
    score = grid.score(df_1819_feat_engr[columns_to_use], df_1819_feat_engr[odd_type])
    
    plt.figure(figsize=(16, 4))
    plt.plot(np.arange(len(predictions)), predictions, alpha=0.8, label='predictions')
    plt.plot(np.arange(len(predictions)), df_1819_feat_engr[odd_type].values, alpha=0.8, label='actual')
    plt.title('Predicted '+odd_type+' for 2018-2019 Season (so far)', fontsize=15)
    plt.xlabel('Game of Season', fontsize=15)
    plt.ylabel(odd_type, fontsize=15)
    plt.legend(fontsize=15)
    plt.show()
    print(odd_type, ', neg MAE: ', score)
    print(grid.cv_results_['params'][grid.best_index_])
odd_home , neg MAE:  -0.821804049724465
{'kneighborsregressor__n_neighbors': 18, 'kneighborsregressor__weights': 'uniform'}
odd_draw , neg MAE:  -0.645195167850422
{'kneighborsregressor__n_neighbors': 19, 'kneighborsregressor__weights': 'uniform'}
odd_away , neg MAE:  -1.6805184240879012
{'kneighborsregressor__n_neighbors': 19, 'kneighborsregressor__weights': 'uniform'}

Multivariate Linear Regression with Forward Selection (Engineered Features)

In [31]:
# FORWARD SELECTION

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    lin_model = forward_selected(df_past_seasons[np.append(columns_to_use, odd_type)], odd_type)
    predictions = lin_model.predict(df_1819_feat_engr[np.append(columns_to_use, odd_type)])
    score = 0-mean_absolute_error(df_1819_feat_engr[odd_type].values, predictions.values)

    plt.figure(figsize=(16, 4))
    plt.plot(np.arange(len(predictions)), predictions, alpha=0.8, label='predictions')
    plt.plot(np.arange(len(predictions)), df_1819_feat_engr[odd_type].values, alpha=0.8, label='actual')
    plt.title('Predicted '+odd_type+' for 2018-2019 Season (so far)', fontsize=15)
    plt.xlabel('Game of Season', fontsize=15)
    plt.ylabel(odd_type, fontsize=15)
    plt.legend(fontsize=15)
    plt.show()
    print(odd_type, ', neg MAE: ', score)
    print(lin_model.params)
odd_home , neg MAE:  -1.0074253538704099
Intercept     3.033046
A_avg_diff    1.253130
H_avg_diff   -0.648939
H_W          -0.118847
A_W           0.116370
A_WR         -1.234196
H_WR          0.828804
dtype: float64
odd_draw , neg MAE:  -0.9187168020761359
Intercept     4.217888
H_avg_diff    0.687597
H_W           0.069843
A_W          -0.054330
H_WR         -0.954750
A_WR          0.242711
dtype: float64
odd_away , neg MAE:  -2.234560739741936
Intercept     5.238418
H_avg_diff    2.421358
A_avg_diff   -1.248026
H_W           0.325487
A_W          -0.290895
H_WR         -3.063447
A_WR          1.510094
dtype: float64

Predictions of Betting Odds for Future Games In 2018/2019 Season

In [33]:
# Preparation for predicting future games

team_stats = getLastTeamStats(df_1819_feat_engr)
H_W = []
H_WR = []
H_avg_diff = []
A_W = []
A_WR = []
A_avg_diff = []

for i in range(len(df_future_games)):
    game = df_future_games.loc[i]
    H_W.append(team_stats[game['HomeTeam']]['W'])
    H_WR.append(team_stats[game['HomeTeam']]['WR'])
    H_avg_diff.append(team_stats[game['HomeTeam']]['avg_diff'])
    A_W.append(team_stats[game['AwayTeam']]['W'])
    A_WR.append(team_stats[game['AwayTeam']]['WR'])
    A_avg_diff.append(team_stats[game['AwayTeam']]['avg_diff'])

df_future_games['H_W'] = H_W
df_future_games['H_WR'] = H_WR
df_future_games['H_avg_diff'] = H_avg_diff
df_future_games['A_W'] = A_W
df_future_games['A_WR'] = A_WR
df_future_games['A_avg_diff'] = A_avg_diff

higher_wr = []
for i in range(len(df_future_games)):
    if df_future_games.loc[i, 'H_WR'] > df_future_games.loc[i, 'A_WR']:
        higher_wr.append('H')
    elif df_future_games.loc[i, 'H_WR'] < df_future_games.loc[i, 'A_WR']:
        higher_wr.append('A')
    else:
        higher_wr.append('D')
df_future_games['higher_wr'] = higher_wr
In [34]:
# PREDICT ODDS OF FUTURE GAMES USING GRIDSEARCH MODEL

pipe = make_pipeline(scaler, knn_reg)
param_grid = {
    'kneighborsregressor__n_neighbors':range(1, 20), 
    'kneighborsregressor__weights':['uniform', 'distance']
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring="neg_mean_absolute_error")

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    grid.fit(df_past_seasons[columns_to_use], df_past_seasons[odd_type])
    predictions = grid.predict(df_future_games[columns_to_use])
    df_future_games['predicted_'+odd_type] = predictions

lower_odds = []
for i in range(len(df_future_games)):
    if df_future_games.loc[i, 'predicted_odd_home'] > df_future_games.loc[i, 'predicted_odd_away']:
        lower_odds.append('A')
    elif df_future_games.loc[i, 'predicted_odd_home'] < df_future_games.loc[i, 'predicted_odd_away']:
        lower_odds.append('H')
    else:
        lower_odds.append('D')
df_future_games['lower_odds'] = lower_odds

Below is a subset of the dataset representing future games as well as their 3 betting odds. Using our second model, we used the last-calculated statistics for each team to predict the outcomes below.

To draw further observations, we added higher_wr and lower_odds which represent which team currently has the higher win rate and lower betting odds. In the real world, the team with the higher win rate has lower betting odds. This is intuitive as betters are much more likely to vote for the team more likely to win.

In [35]:
# Predicted odds of future games
df_future_games[['HomeTeam', 'AwayTeam', 
                 'H_WR', 'A_WR',
                 'predicted_odd_home', 'predicted_odd_draw', 'predicted_odd_away', 
                 'higher_wr', 'lower_odds']].head()
Out[35]:
HomeTeam AwayTeam H_WR A_WR predicted_odd_home predicted_odd_draw predicted_odd_away higher_wr lower_odds
0 Cardiff West Ham 0.321429 0.464286 2.665556 3.300877 3.101404 A H
1 Crystal Palace Brighton 0.392857 0.370370 2.231759 3.344211 3.743860 H H
2 Huddersfield Bournemouth 0.196429 0.428571 3.110556 3.361754 2.630263 A A
3 Leicester Fulham 0.446429 0.232143 1.815000 3.637368 5.066667 H H
4 Man City Watford 0.821429 0.517857 1.380926 5.542368 10.513684 H H

Finally, to end this section, we looked at how many predictions have the same team for higher win rates and lower odds. We found this is true for 81% of the time, meaning there are games where the team with the lower win rate is predicted to have higher betting odds!

In [36]:
(df_future_games['higher_wr'] == df_future_games['lower_odds']).mean()
Out[36]:
0.8111111111111111

(3) Using Poisson with different features

This model also uses poisson, but we replicated this model from David Sheehan's study. First we modified the dataset so that it shows the number of goals each team scored when they were home or away. We only chose the features that is available to us. In addition to the features that Sheehan used, we also added the bettingwhich are from the prediction in the previous section.

  • Team
  • Opponent
  • Goals = number of goals scored in the match
  • Home = 1: HomeTeam & 0: AwayTeam
  • odd_team = betting odds for the team
  • odd_draw = betting odds for a draw
  • odd_opponent = betting odds for the opponent

Dataset for the model

In [37]:
# Take a home team and away team 
team_opponent_data_home = data_1819[['HomeTeam', 'AwayTeam', 'FTHG', 'odd_home', 'odd_draw', 'odd_away']]
team_opponent_data_home.columns = ['Team', 'Opponent', 'Goals', 'odd_team', 'odd_draw', 'odd_opponent']
team_opponent_data_home['Home'] = 1
team_opponent_data_away = data_1819[['AwayTeam', 'HomeTeam', 'FTAG', 'odd_away', 'odd_draw', 'odd_home']]
team_opponent_data_away.columns = ['Team', 'Opponent', 'Goals', 'odd_team', 'odd_draw', 'odd_opponent']
team_opponent_data_away['Home'] = 0

team_opponent_data = team_opponent_data_home.append(team_opponent_data_home).append(team_opponent_data_away)
team_opponent_data.head()
Out[37]:
Team Opponent Goals odd_team odd_draw odd_opponent Home
0 Man United Leicester 2 1.561667 3.905000 7.083333 1
1 Bournemouth Cardiff 2 1.895000 3.538333 4.388333 1
2 Fulham Crystal Palace 0 2.466667 3.360000 2.950000 1
3 Huddersfield Chelsea 0 6.276667 3.970000 1.590000 1
4 Newcastle Tottenham 1 3.821667 3.420000 2.053333 1
In [38]:
#Perform poisson model
poisson_model = smf.glm(formula="Goals ~ Home + Team + Opponent + odd_team + odd_draw + odd_opponent", data=team_opponent_data, 
                        family=sm.families.Poisson()).fit()
poisson_model.summary()
Out[38]:
Generalized Linear Model Regression Results
Dep. Variable: Goals No. Observations: 867
Model: GLM Df Residuals: 824
Model Family: Poisson Df Model: 42
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -1203.1
Date: Wed, 13 Mar 2019 Deviance: 811.73
Time: 02:11:20 Pearson chi2: 692.
No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.5703 0.416 3.779 0.000 0.756 2.385
Team[T.Bournemouth] -0.5065 0.176 -2.877 0.004 -0.852 -0.161
Team[T.Brighton] -1.0300 0.214 -4.806 0.000 -1.450 -0.610
Team[T.Burnley] -1.0552 0.223 -4.724 0.000 -1.493 -0.617
Team[T.Cardiff] -1.2511 0.229 -5.452 0.000 -1.701 -0.801
Team[T.Chelsea] -0.2131 0.161 -1.320 0.187 -0.530 0.103
Team[T.Crystal Palace] -0.9951 0.201 -4.957 0.000 -1.389 -0.602
Team[T.Everton] -0.6696 0.178 -3.767 0.000 -1.018 -0.321
Team[T.Fulham] -0.9722 0.204 -4.769 0.000 -1.372 -0.573
Team[T.Huddersfield] -1.9282 0.281 -6.869 0.000 -2.478 -1.378
Team[T.Leicester] -0.8200 0.191 -4.284 0.000 -1.195 -0.445
Team[T.Liverpool] 0.1286 0.166 0.777 0.437 -0.196 0.453
Team[T.Man City] 0.4775 0.194 2.464 0.014 0.098 0.857
Team[T.Man United] -0.2308 0.157 -1.472 0.141 -0.538 0.076
Team[T.Newcastle] -1.1982 0.217 -5.529 0.000 -1.623 -0.773
Team[T.Southampton] -0.9464 0.200 -4.725 0.000 -1.339 -0.554
Team[T.Tottenham] -0.2406 0.157 -1.535 0.125 -0.548 0.067
Team[T.Watford] -0.6563 0.184 -3.573 0.000 -1.016 -0.296
Team[T.West Ham] -0.6505 0.180 -3.608 0.000 -1.004 -0.297
Team[T.Wolves] -0.7613 0.189 -4.030 0.000 -1.132 -0.391
Opponent[T.Bournemouth] 0.3064 0.181 1.694 0.090 -0.048 0.661
Opponent[T.Brighton] -0.0174 0.219 -0.079 0.937 -0.447 0.412
Opponent[T.Burnley] 0.1679 0.221 0.759 0.448 -0.266 0.601
Opponent[T.Cardiff] 0.2289 0.219 1.047 0.295 -0.200 0.657
Opponent[T.Chelsea] -0.3939 0.196 -2.008 0.045 -0.778 -0.009
Opponent[T.Crystal Palace] -0.1222 0.199 -0.613 0.540 -0.513 0.269
Opponent[T.Everton] -0.1937 0.190 -1.019 0.308 -0.566 0.179
Opponent[T.Fulham] 0.5029 0.189 2.666 0.008 0.133 0.873
Opponent[T.Huddersfield] 0.0878 0.229 0.384 0.701 -0.360 0.536
Opponent[T.Leicester] -0.1570 0.198 -0.793 0.428 -0.545 0.231
Opponent[T.Liverpool] -1.3267 0.258 -5.137 0.000 -1.833 -0.820
Opponent[T.Man City] -1.0643 0.280 -3.798 0.000 -1.614 -0.515
Opponent[T.Man United] -0.2619 0.186 -1.406 0.160 -0.627 0.103
Opponent[T.Newcastle] -0.2989 0.220 -1.357 0.175 -0.731 0.133
Opponent[T.Southampton] 0.0245 0.200 0.122 0.903 -0.367 0.416
Opponent[T.Tottenham] -0.4211 0.197 -2.138 0.033 -0.807 -0.035
Opponent[T.Watford] -0.0696 0.195 -0.356 0.722 -0.453 0.313
Opponent[T.West Ham] -0.0401 0.193 -0.207 0.836 -0.419 0.339
Opponent[T.Wolves] -0.3295 0.206 -1.600 0.110 -0.733 0.074
Home 0.3330 0.088 3.773 0.000 0.160 0.506
odd_team 0.1958 0.062 3.178 0.001 0.075 0.317
odd_draw -0.4822 0.173 -2.783 0.005 -0.822 -0.143
odd_opponent 0.1429 0.058 2.454 0.014 0.029 0.257

The GLM table shows that Home status and betting odds do have difference with Game number of goals that each team scores. From the model, we are 95% confident that if the teams are playing at home, the team will score [0.160, 0.506] more goals than when they are not playing at home. Also, surprisingly, with the p-values of less than 0.05, betting odds do affect the number of goals that team scores in the game.

EPL standing table (March 2nd, 2019)

The table below shows the ranking of the teams until round 29. At this point, Man City was in the winning run in the EPL with 71 points. By using the poisson regression we created we will calculate the number of goals each team score in each game, and predict the result of the match.

In [39]:
ranking=pd.read_csv('./data/CurrentRanking.csv')
ranking
Out[39]:
Team Point
0 Man City 71
1 Liverpool 70
2 Tottenham 61
3 Man United 58
4 Arsenal 57
5 Chelsea 56
6 Wolves 43
7 Watford 43
8 West Ham 39
9 Everton 37
10 Leicester 35
11 Bournemouth 34
12 Newcastle 31
13 Crystal Palace 33
14 Brighton 30
15 Burnley 30
16 Southampton 27
17 Cardiff 25
18 Fulham 17
19 Huddersfield 14
In [40]:
future_matches = pd.read_csv('./data/betting_odds_prediction.csv')

This is our input dataset. We listed all the home teams and their opponent(away teams) for all remaining games. For each game, we predicted number of goals that home team and away team might score and gave 3 points to the team that are more likely to score more. The limitation of our model is that it was nearly impossilbe to have same score(goals) for both home and away team, so there was no draw in our result. After predicting results of each matches, we predicted that Manchester City will win the season with 98 points.

Input dataset

In [41]:
future_matches.head()
Out[41]:
HomeTeam AwayTeam predicted_odd_home predicted_odd_draw predicted_odd_away
0 Cardiff West Ham 2.759022 3.306794 2.971140
1 Crystal Palace Brighton 2.331841 3.308602 3.557281
2 Huddersfield Bournemouth 3.177163 3.411348 2.602807
3 Leicester Fulham 1.789015 3.662285 5.169737
4 Man City Watford 1.441855 5.221275 9.741579
In [42]:
## iterate the future matches and calculate the point
result = []
for index, row in future_matches.iterrows():
    home_score = poisson_model.predict(pd.DataFrame(data={'Team': row['HomeTeam'], 'Opponent': row['AwayTeam'],
                                       'Home':1, 'odd_team':row['predicted_odd_home'], 'odd_draw':row['predicted_odd_draw'],
                                        'odd_opponent':row['predicted_odd_away']},index=[1]))
    away_score = poisson_model.predict(pd.DataFrame(data={'Team': row['AwayTeam'], 'Opponent': row['HomeTeam'],
                                       'Home':0, 'odd_team':row['predicted_odd_away'], 'odd_draw':row['predicted_odd_draw'],
                                        'odd_opponent':row['predicted_odd_home']},index=[1]))
    if(home_score[1] > away_score[1]):
        ranking.iloc[ranking.loc[ranking['Team']==row.HomeTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.HomeTeam]['Point'].iloc[0] + 3
        result.append('H')
    elif(home_score[1] < away_score[1]):
        ranking.iloc[ranking.loc[ranking['Team']==row.AwayTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.AwayTeam]['Point'].iloc[0] + 3
        result.append('A')
    else:
        ranking.iloc[ranking.loc[ranking['Team']==row.HomeTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.HomeTeam]['Point'].iloc[0] + 1
        ranking.iloc[ranking.loc[ranking['Team']==row.AwayTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.AwayTeam]['Point'].iloc[0] + 1
        result.append('D')
future_matches['predicted_result'] = result

Predicted Standing table of 18/19 season

In [43]:
# Prediction
ranking.sort_values('Point', ascending=False)
Out[43]:
Team Point
0 Man City 98
1 Liverpool 97
4 Arsenal 84
2 Tottenham 82
3 Man United 79
5 Chelsea 77
7 Watford 58
6 Wolves 55
8 West Ham 54
10 Leicester 50
9 Everton 49
11 Bournemouth 46
13 Crystal Palace 45
14 Brighton 42
12 Newcastle 40
15 Burnley 33
16 Southampton 33
17 Cardiff 25
18 Fulham 20
19 Huddersfield 14