# Setting up
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import itertools
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

# ignore warning 
import warnings
warnings.filterwarnings("ignore")
# hide the code cell
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Predicting a season based on Machine Learning Model¶

Winter 2019, INFO 370

Members¶

William Kwok, Wonjo Barng, Kangwoo Choi, Vincent Widjaya

Introduction¶

Sports is an integral part of nearly all cultures. One of the sports which has lots of fans in the world is soccer, which we will call football throughout this paper. For years individuals have tried to use statistics to figure out what makes teams win, or to try to find out if their favorite teams are the best. Razali suggests that "the research for predicting the results of football matches outcome started as early as 1977 by Stafani R". The English Premier League (EPL), is one of the most popular and largest leagues in the world. In 2017, there were 5 clubs from EPL in top 10 teams by revenue.

In order to gain an understanding of what makes teams win at football, we will be exploring a few research questions:

RQ1: Is there an association between the betting odds and the end results of a game? We may be able to find cases of illegal match fixing such as the scandals that occurred in late 2013 and produce a model that predicts if match fixing is happening.
RQ2: Is there an association between the betting odds and statistics of a game? Studying this may also result in finding cases of illegal match fixing. For example, we can see if certain teams are under-performing in some games where bets are higher.
RQ3: Is there an association between the game stats and the end results of a game? With this, we will be able to predict what factors are associated with winning the game. Coaches can use this information to their advantage and train their players against those conditions.

In answering these questions, we will be working with data from all England Premier League seasons from the 2014/2015 season to the current 2018/2019 season provided by Football Data. These datasets contain data pertaining to the game itself as well as a lot of bets associated with the game.

Data Preparation¶

To answer these research questions, we selected features that would affect a result of game. There are no null/NAN value on our dataset. Based on our knowledge, we choose following features:

HomeTeam - Name of the home team
AwayTeam - Name of the away team
FTR - Full time result (Home win, Away win, Draw)
FTHG - Full time home team goals
FTAG - Full time away team goals
HS - Home team shots
AS - Away team shots
HST - Home team shots on target
AST - Away team shots on target
HSGR - Home team shots goal ratio (calculated)
ASGR - Away team shots goal ratio (calculated)

We also include average betting odds from 6 companies, which are Bet365, Bet&win, Interwetten, Pinnacle, VC Bet, and William Hill. The reason that we select these companies is that these companies has valid data in every season. Lowest betting odds means that teams with lower betting odds are likely to win the game. If the betting odds for draw is the lowest, it means that people expect the game would draw. For example, if the betting odd for home team is 1.13, then if home team won the game, each people will get 1.13 times more money than they bet. These are the following variables:

odd_home - Average betting odds for Home Teams
odd_draw - Average betting odds of betting on draw
odd_away - Average betting odds for Away Teams

# get data frames previous 4 seasons and current season.
df_1415 = pd.read_csv('./data/1415.csv')
df_1516 =  pd.read_csv('./data/1516.csv')
df_1617 = pd.read_csv('./data/1617.csv')
df_1718 = pd.read_csv('./data/1718.csv')
df_1819 = pd.read_csv('./data/1819.csv')
df_total = df_1415.append(df_1516).append(df_1617).append(df_1718).append(df_1819)

Combined Dataset¶

df_total.head()

Dataset Format after data cleaning¶

# Clean data function
def clean_data(df):
    data = df[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST']]
    data['HSGR'] = data['FTHG']/data['HS']
    data['ASGR'] = data['FTAG']/data['AS']
    data = data.replace([np.inf, -np.inf], 0)
    bet_home = df[['B365H','BWH','IWH','PSH','VCH','WHH']].mean(axis=1)
    bet_draw = df[['B365D','BWD','IWD','PSD','VCD','WHD']].mean(axis=1)
    bet_away = df[['B365A','BWA','IWA','PSA','VCA','WHA']].mean(axis=1)
    
    
    data['odd_home'] = bet_home
    data['odd_draw'] = bet_draw
    data['odd_away'] = bet_away
    
    data.dropna()
    
    return data

# clean data 
df_total = clean_data(df_total)
df_total.head()

Dataset with Engineered Features for RQ2¶

For RQ2, we realized it was better to create running metrics that looks at teams' previous games, up to the game being observed, rather than using the statistics of individual games. This is because by the time a game's statistics have been finalized, the betters would have already locked in their odds, meaning we would be making predictions based on values that would not have mattered anymore.

To solve this concern, we turned toward feature engineering, creating variables that represented the statistics of past games in the season to predict the odds of a current game. This approach would allow us to make more relevant predictions. Below is the our prepared dataset for models to answer RQ2. All variables below are updated every game played by the home and away team (e.g. they represent their respective statistics for all games before the current).

H_W - Home wins
H_WR - Home win rate
H_avg_diff - Home team's average scores over opponent
A_W - Away wins
A_WR - Away win rate
A_avg_diff - Away team's average scores over opponent

We chose to show rows not among the top of the dataframe because when a team first plays, their previous stats would be null. It would be all zero's, show here is a better representation of the majority of the data.

def engineer_features(df):
    df = df.copy()
    df['goals_h_a'] = df['FTHG'] - df['FTAG']
    df['total_h_a'] = df['HS'] - df['AS']

    H_GT = [] # home games total so far
    H_W = [] # home wins so far
    H_WR = [] # home win rate so far
    H_avg_diff = [] # home avg goals diff

    A_GT = [] # away games total so far
    A_W = [] # away wins so far
    A_WR = [] #away win rate so far
    A_avg_diff = [] # away avg goals diff

    for i in range(len(df)):
        home = df.loc[i, 'HomeTeam']
        away = df.loc[i, 'AwayTeam']
        
        home_h_games = df[df['HomeTeam'] == home].loc[:i-1]['goals_h_a']
        home_a_games = df[df['AwayTeam'] == home].loc[:i-1]['goals_h_a'] * -1
        home_games = home_h_games.append(home_a_games)
        
        away_h_games = df[df['HomeTeam'] == away].loc[:i-1]['goals_h_a']
        away_a_games = df[df['AwayTeam'] == away].loc[:i-1]['goals_h_a'] * -1
        away_games = away_h_games.append(away_a_games)
        
        H_GT.append(len(home_games))
        A_GT.append(len(away_games))
        
        H_W.append((home_games > 0).sum() + (home_games == 0).sum() * 0.5)
        A_W.append((away_games > 0).sum() + (away_games == 0).sum() * 0.5)
        
        if H_GT[i] > 0:
            H_WR.append(H_W[i] / H_GT[i])
            H_avg_diff.append(home_games.mean())
        else:
            H_WR.append(0)
            H_avg_diff.append(0)
            
        if A_GT[i] > 0:
            A_WR.append(A_W[i] / A_GT[i])
            A_avg_diff.append(away_games.mean())
        else:
            A_WR.append(0)
            A_avg_diff.append(0)
    
    df['H_GT'] = H_GT
    df['H_W'] = H_W
    df['H_WR'] = H_WR
    df['H_avg_diff'] = H_avg_diff
    df['A_GT'] = A_GT
    df['A_W'] = A_W
    df['A_WR'] = A_WR
    df['A_avg_diff'] = A_avg_diff
    return df

# clean data for each individual dataset
data_1415 = clean_data(pd.read_csv('./data/1415.csv'))
data_1415.drop(data_1415.tail(1).index,inplace=True)
data_1516 = clean_data(pd.read_csv('./data/1516.csv'))
data_1617 = clean_data(pd.read_csv('./data/1617.csv'))
data_1718 = clean_data(pd.read_csv('./data/1718.csv'))
data_1819 = clean_data(pd.read_csv('./data/1819.csv'))

df_1415_feat_engr = engineer_features(data_1415)
df_1516_feat_engr = engineer_features(data_1516)
df_1617_feat_engr = engineer_features(data_1617)
df_1718_feat_engr = engineer_features(data_1718)
df_1819_feat_engr = engineer_features(data_1819)

df_past_seasons = df_1415_feat_engr.copy()
df_past_seasons = df_past_seasons.append(df_1516_feat_engr)
df_past_seasons = df_past_seasons.append(df_1617_feat_engr)
df_past_seasons = df_past_seasons.append(df_1718_feat_engr)
df_past_seasons = df_past_seasons.reset_index(drop=True)
df_past_seasons = df_past_seasons[['HomeTeam', 'AwayTeam', 
                                   'H_W', 'H_WR', 'H_avg_diff', 
                                   'A_W', 'A_WR', 'A_avg_diff', 
                                   'odd_home', 'odd_draw', 'odd_away']]

df_past_seasons[150:155]

# a function that gets the average of betting odds
def average_betting(df):
    betting_accuracies = []
    for index, row in df.iterrows():
        if(row['FTR'] == 'H' and row['odd_home'] < row['odd_away'] and row['odd_home'] < row['odd_draw']):
            betting_accuracies.append(1)
        elif(row['FTR'] == 'D' and row['odd_draw'] < row['odd_away'] and row['odd_draw'] < row['odd_home']):
            betting_accuracies.append(1)
        elif(row['FTR'] == 'A' and row['odd_away'] < row['odd_home'] and row['odd_away'] < row['odd_draw']):
            betting_accuracies.append(1)
        else:
            betting_accuracies.append(0)
        
    return np.mean(betting_accuracies)

# set time trend on plot
time = [average_betting(data_1415),
        average_betting(data_1516), 
        average_betting(data_1617), 
        average_betting(data_1718), 
        average_betting(data_1819)]

Finding patterns from Dataset¶

Accuracy of betting odds¶

We expected that the accuracy would increase, but there is no pattern on betting accuracy over time.

# Betting accuracy by timeline ()
plt.plot(["14/15", "15/16", "16/17", "17/18", "18/19"], time, label='betting accuracy')
plt.legend()
plt.title('betting accuracy')
plt.xlabel('time')
plt.ylabel('accuracy')
plt.show()

Actual Result vs Betting odds¶

We compare the average betting odds and actual results. When home team wins the game, the proability of matching betting odds with actual results is around 84%. When the result of the game is draw, the proability of matching betting odds with actual results is around 0% since people expect to draw less than winning or losing a game. When away team win the game, the proability of matching betting odds with actual results is around 57%.

# getting accuracy about how an actual result matched with lowest betting odds.
def accuracy(df):
    home = 0
    draw = 0
    away = 0
    for index, row in df.iterrows():
        if (row['odd_home'] < row['odd_away'] and row['odd_home'] < row['odd_draw']):
            home = home+1
        elif (row['odd_draw'] < row['odd_away'] and row['odd_draw'] < row['odd_home']):
            draw = draw+1
        else:
            away = away+1
            
    return [home / len(df), draw / len(df), away / len(df)]

# draw pie chart for getting proportation of betting odds for matched result.
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'H']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when Home won the game')
plt.show()
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'D']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when draw')
plt.show()
plt.pie(accuracy(df_total.loc[df_total['FTR'] == 'A']), labels=['Home', 'Draw', 'Away'])
plt.title('Betting odds when Away won the game')
plt.show()

Home vs Away¶

It shows the number of wins for home and away team. Home team used to take advantages.

## get distribution of the result
def getDistResult(data):
    arr = [0,0,0]
    for index, row in data.iterrows():
        if row.FTR == 'H':
            arr[0] += 1
        elif row.FTR == 'D':
            arr[1] += 1
        else:
            arr[2] += 1
    return arr

# by season home or away
result = ['Home Wins', 'Draw', 'Away Wins']
plt.barh(result, getDistResult(df_1415))
plt.xlabel('Frequency')
plt.title('14/15 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1516))
plt.xlabel('Frequency')
plt.title('15/16 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1617))
plt.xlabel('Frequency')
plt.title('16/17 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1718))
plt.xlabel('Frequency')
plt.title('17/18 Season Result stats')
plt.show()

plt.barh(result, getDistResult(df_1819))
plt.xlabel('Frequency')
plt.title('18/19 Season Result stats')
plt.show()

Modeling¶

# Clean data function from Kangwoo's notebook
def clean_data_poisson(df):
    data = df[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST']]
    data['HSGR'] = data['FTHG']/data['HS']
    data['ASGR'] = data['FTAG']/data['AS']
    data = data.replace([np.inf, -np.inf], 0)
#     bet_home = df[['B365H','BWH','IWH','PSH','VCH','WHH']].mean(axis=1)
#     bet_draw = df[['B365D','BWD','IWD','PSD','VCD','WHD']].mean(axis=1)
#     bet_away = df[['B365A','BWA','IWA','PSA','VCA','WHA']].mean(axis=1)
    
    
#     data['odd_home'] = bet_home
#     data['odd_draw'] = bet_draw
#     data['odd_away'] = bet_away
    
    data.dropna()
    
    ############### New stuff
    team_scores = {}
    team_shots = {}
    HomeAvgAllTimeSoFar = []
    HomeHighAllTimeSoFar = []
    HomeLowAllTimeSoFar = []
    HomeTotalGoals = []
    HomeTotalShots = []
    HomeTotalAccuracy = []
    AwayAvgAllTimeSoFar = []
    AwayHighAllTimeSoFar = []
    AwayLowAllTimeSoFar = []
    AwayTotalGoals = []
    AwayTotalShots = []
    AwayTotalAccuracy = []
    for index, row in data.iterrows():
        # Add values to all the rows before adding to the team scores
        home_team = row["HomeTeam"]
        away_team = row["AwayTeam"]
        if home_team not in team_scores:
            team_scores[home_team] = []
        if away_team not in team_scores:
            team_scores[away_team] = []
        if home_team not in team_shots:
            team_shots[home_team] = []
        if away_team not in team_shots:
            team_shots[away_team] = []
        home_team_scores = team_scores[home_team]
        away_team_scores = team_scores[away_team]
        home_team_shots = team_shots[home_team]
        away_team_shots = team_shots[away_team]
        if len(home_team_scores) < 1:
            HomeAvgAllTimeSoFar.append(0)
            HomeHighAllTimeSoFar.append(0)
            HomeLowAllTimeSoFar.append(0)
        else: 
            HomeAvgAllTimeSoFar.append(np.mean(home_team_scores))
            HomeHighAllTimeSoFar.append(np.max(home_team_scores).astype("float"))
            HomeLowAllTimeSoFar.append(np.min(home_team_scores).astype("float"))
        if len(away_team_scores) < 1:
            AwayAvgAllTimeSoFar.append(0)
            AwayHighAllTimeSoFar.append(0)
            AwayLowAllTimeSoFar.append(0)
        else: 
            AwayAvgAllTimeSoFar.append(np.mean(away_team_scores))
            AwayHighAllTimeSoFar.append(np.max(away_team_scores).astype("float"))
            AwayLowAllTimeSoFar.append(np.min(away_team_scores).astype("float"))
        s_Home_Scores = np.sum(home_team_scores)
        s_Home_Shots = np.sum(home_team_shots)
        s_Away_Scores = np.sum(away_team_scores)
        s_Away_Shots = np.sum(away_team_shots)
        HomeTotalGoals.append(s_Home_Scores)
        HomeTotalShots.append(s_Home_Shots)
        HomeTotalAccuracy.append(np.nan_to_num(s_Home_Scores/s_Home_Shots))
        AwayTotalGoals.append(s_Away_Scores)
        AwayTotalShots.append(s_Away_Shots)
        AwayTotalAccuracy.append(np.nan_to_num(s_Away_Scores/s_Away_Shots))
        # Add to team scores
        team_scores[home_team].append(row["FTHG"])
        team_scores[away_team].append(row["FTAG"])
        team_shots[home_team].append(row["HS"])
        team_shots[away_team].append(row["AS"])
        
    data["HomeAvgAllTimeSoFar"] = HomeAvgAllTimeSoFar
    data["HomeHighAllTimeSoFar"] = HomeHighAllTimeSoFar
    data["HomeLowAllTimeSoFar"] = HomeLowAllTimeSoFar
    data["AwayAvgAllTimeSoFar"] = AwayAvgAllTimeSoFar
    data["AwayHighAllTimeSoFar"] = AwayHighAllTimeSoFar
    data["AwayLowAllTimeSoFar"] = AwayLowAllTimeSoFar 
    data["HomeTotalGoals"] = HomeTotalGoals
    data["HomeTotalShots"] = HomeTotalShots
    data["HomeTotalAccuracy"] = HomeTotalAccuracy
    data["AwayTotalGoals"] = AwayTotalGoals
    data["AwayTotalShots"] = AwayTotalShots
    data["AwayTotalAccuracy"] = AwayTotalAccuracy
    #####################
    data.dropna()
    
    return data

# clean data for each individual dataset
data_1415_poi = clean_data_poisson(pd.read_csv('./data/1415.csv'))
data_1415_poi.drop(df_1415.tail(1).index,inplace=True)
data_1516_poi = clean_data_poisson(pd.read_csv('./data/1516.csv'))
data_1617_poi = clean_data_poisson(pd.read_csv('./data/1617.csv'))
data_1718_poi = clean_data_poisson(pd.read_csv('./data/1718.csv'))
data_1819_poi = clean_data_poisson(pd.read_csv('./data/1819.csv'))

(1) Using Poisson¶

We attempted to run a poisson model on the data to produce a guess for if the Home team would win, Away team, or if it would end up in a draw. We encoded the values using sklearn's LabelEncoder function. Then we split the data into test and training data. We then plugged it into a general linear model with a Poisson family using the following formulas: FTR ~ HomeTeam + AwayTeam, FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + AwayTotalAccuracy, and FTR ~ HomeAvgAllTimeSoFar + HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + AwayTotalAccuracy.

Here are a list of the values we used in these models:

HomeTeam - Name of the home team
AwayTeam - Name of the away team
HomeAvgAllTimeSoFar - Average of all scores from each game the home team has played by that point in time.
HomeHighAllTimeSoFar - Highest of all scores from each game the home team has played by that point in time.
HomeLowAllTimeSoFar - Lowest of all scores from each game the home team has played by that point in time.
AwayAvgAllTimeSoFar - Average of all scores from each game the away team has played by that point in time.
AwayHighAllTimeSoFar - Highest of all scores from each game the away team has played by that point in time.
AwayLowAllTimeSoFar - Lowest of all scores from each game the away team has played by that point in time.
HomeTotalGoals - Total home goals to date
HomeTotalShots - Total home shots to date
HomeTotalAccuracy - Home accuracy to date
AwayTotalGoals - Total away goals to date
AwayTotalShots - Total away shots to date
AwayTotalAccuracy - Away accuracy to date

For each year¶

First we ran this model on the separate datasets. At first, we tried out just the Home team and Away team names as factors in the model, similar (but not exactly!) to David Sheehan's study. We received these accuracy scores.

14-15: 0.368
15-16: 0.329
16-17: 0.395
17-18: 0.382
18-19: 0.431

This very simple model seems to be able to guess it more than a third of the time mostly. However, that isn't as good as we would like. It is just slightly better than guessing at random. In 2015-2016, it would have been worse than guessing at random.

We then tried with only the factors we generated and found a dip in performance for most of the years.

14-15: 0.289
15-16: 0.289
16-17: 0.355
17-18: 0.382
18-19: 0.362

Finally, we tried both the home team and away team names as factors combined with our factors, and we saw a slight improvement in most of the scores.

14-15: 0.368
15-16: 0.382
16-17: 0.408
17-18: 0.395
18-19: 0.362

We theorize that 18-19 is so low accuracy because there haven't been as many games played thus far.

Combined¶

We also ran the model on all the combined datasets. Note that while combined, the "All Time" factors are localized to the season the data point came from, meaning there are no "lifetime" data points besides the name of the team (doing so doesn't provide any increase in accuracy). For the model that just had team names as factors, we received an accuracy of 0.387. For the model with just our factors, we received an accuracy of 0.362. For the model with those combined, we received an accuracy of 0.398.

If we take a look at the summary of the model with the team names as factors alongside our factors, we see that some of the factors are significant. To contrast David Sheehan's study, we didn't receive as many significant p values because we did not combine HomeTeam and AwayTeam into a single team variable for this specific analysis.

# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1415 = LabelEncoder() 
encoder_1415.fit(data_1415_poi["FTR"])
data_1415_poi["FTR"] = encoder_1415.transform(data_1415_poi["FTR"])

# split into training and test data
train_1415, valid_1415, train_labels_1415, valid_labels_1415 = train_test_split(
    data_1415_poi.drop("FTR", axis=1),
    data_1415_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1415_all = train_1415
train_1415_all["FTR"] = train_labels_1415

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a1 = accuracy_score(valid_labels_1415, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a2 = accuracy_score(valid_labels_1415, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam ", 
            data=train_1415_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1415))
a3 = accuracy_score(valid_labels_1415, pred)

a1, a2, a3

(0.3684210526315789, 0.2894736842105263, 0.3684210526315789)

# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1516 = LabelEncoder()
encoder_1516.fit(data_1516_poi["FTR"])
data_1516_poi["FTR"] = encoder_1516.transform(data_1516_poi["FTR"])

# split into training and test data
train_1516, valid_1516, train_labels_1516, valid_labels_1516 = train_test_split(
    data_1516_poi.drop("FTR", axis=1),
    data_1516_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1516_all = train_1516
train_1516_all["FTR"] = train_labels_1516

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a1 = accuracy_score(valid_labels_1516, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a2 = accuracy_score(valid_labels_1516, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1516_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1516))
a3 = accuracy_score(valid_labels_1516, pred)

a1, a2, a3

(0.3815789473684211, 0.2894736842105263, 0.32894736842105265)

# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1617 = LabelEncoder()
encoder_1617.fit(data_1617_poi["FTR"])
data_1617_poi["FTR"] = encoder_1617.transform(data_1617_poi["FTR"])

# split into training and test data
train_1617, valid_1617, train_labels_1617, valid_labels_1617 = train_test_split(
    data_1617_poi.drop("FTR", axis=1),
    data_1617_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1617_all = train_1617
train_1617_all["FTR"] = train_labels_1617

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a1 = accuracy_score(valid_labels_1617, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a2 = accuracy_score(valid_labels_1617, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1617_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1617))
a3 = accuracy_score(valid_labels_1617, pred)

a1, a2, a3

(0.40789473684210525, 0.35526315789473684, 0.39473684210526316)

# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1718 = LabelEncoder()
encoder_1718.fit(data_1718_poi["FTR"])
data_1718_poi["FTR"] = encoder_1718.transform(data_1718_poi["FTR"])

# split into training and test data
train_1718, valid_1718, train_labels_1718, valid_labels_1718 = train_test_split(
    data_1718_poi.drop("FTR", axis=1),
    data_1718_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1718_all = train_1718
train_1718_all["FTR"] = train_labels_1718

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a1 = accuracy_score(valid_labels_1718, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a2 = accuracy_score(valid_labels_1718, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1718_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1718))
a3 = accuracy_score(valid_labels_1718, pred)

a1, a2, a3

(0.39473684210526316, 0.3815789473684211, 0.3815789473684211)

# create label encoder to convert H, A, or D into factors for who wins (Home, Away, or Draw)
encoder_1819 = LabelEncoder()
encoder_1819.fit(data_1819_poi["FTR"])
data_1819_poi["FTR"] = encoder_1819.transform(data_1819_poi["FTR"])

# split into training and test data
train_1819, valid_1819, train_labels_1819, valid_labels_1819 = train_test_split(
    data_1819_poi.drop("FTR", axis=1),
    data_1819_poi["FTR"],
    test_size=0.2,
    random_state=123
)

# Combine the training data and labels to fit into generalized linear model
train_1819_all = train_1819
train_1819_all["FTR"] = train_labels_1819

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a1 = accuracy_score(valid_labels_1819, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a2 = accuracy_score(valid_labels_1819, pred)

# Use Poisson to calculate
m = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_1819_all, family=sm.families.Poisson()).fit()

# Predict and find accuracy
pred = np.round(m.predict(valid_1819))
a3 = accuracy_score(valid_labels_1819, pred)

a1, a2, a3

(0.3620689655172414, 0.3620689655172414, 0.43103448275862066)

# concatenate all data
all_data_poi = pd.concat([data_1415_poi, data_1516_poi, data_1617_poi, data_1718_poi, data_1819_poi], axis=0, ignore_index=True)

encoder_all = LabelEncoder()
encoder_all.fit(all_data_poi["FTR"])
all_data_poi["FTR"] = encoder_all.transform(all_data_poi["FTR"])
train_all, valid_all, train_labels_all, valid_labels_all = train_test_split(
    all_data_poi.drop("FTR", axis=1),
    all_data_poi["FTR"],
    test_size=0.2,
    random_state=123
)

train_all_allcols = train_all
train_all_allcols["FTR"] = train_labels_all
m1 = smf.glm(formula="FTR ~ HomeTeam + AwayTeam + HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()
pred = np.round(m1.predict(valid_all))
pred[np.isnan(pred)] = 1
a1 = accuracy_score(valid_labels_all, pred)

m2 = smf.glm(formula="FTR ~ HomeAvgAllTimeSoFar + \
                    HomeHighAllTimeSoFar + HomeLowAllTimeSoFar + AwayAvgAllTimeSoFar + \
                    AwayHighAllTimeSoFar + AwayLowAllTimeSoFar + HomeTotalGoals + \
                    HomeTotalShots + HomeTotalAccuracy + AwayTotalGoals + AwayTotalShots + \
                    AwayTotalAccuracy", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()

pred = np.round(m2.predict(valid_all))
pred[np.isnan(pred)] = 1
a2 = accuracy_score(valid_labels_all, pred)

m3 = smf.glm(formula="FTR ~ HomeTeam + AwayTeam", 
            data=train_all_allcols, family=sm.families.Poisson()).fit()

pred = np.round(m3.predict(valid_all))
pred[np.isnan(pred)] = 1
a3 = accuracy_score(valid_labels_all, pred)


a1, a2, a3

(0.39779005524861877, 0.36187845303867405, 0.3867403314917127)

m1.summary()

m2.summary()

(2) Predicting Betting Odds¶

def forward_selected(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

def getLastTeamStats(df):
    team_stats = {}
    for team in df.HomeTeam.unique():
        temp_df = df[::-1].reset_index(drop=True)
        for i in range(len(temp_df)):
            game = temp_df.loc[i]
            if team == game['HomeTeam']:
                stats = {}
                stats['W'] = game['H_W']
                stats['WR'] = game['H_WR']
                stats['avg_diff'] = game['H_avg_diff']
                team_stats[team] = stats
            elif team == game['AwayTeam']:
                stats = {}
                stats['W'] = game['A_W']
                stats['WR'] = game['A_WR']
                stats['avg_diff'] = game['A_avg_diff']
                team_stats[team] = stats
            if team in team_stats:
                break
    return team_stats

As explained in the data preparation section, predicting betting odds reasonably required engineering new variables to run models on. The validity of this reasoning is later proven through the steps we took explained in this section. We chose to use a regressor instead of a classifier because bettings odds are continuous variables which don't make sense to be classified. We also chose to not use a percentile feature selector because we believe each of the six engineered variables represent its own crucial information.

The first model we created was a grid search with a pipeline that utilizes k-neighbors regression and 10-fold cross validation, with scaled training values. The variables it took into account were the number of goals scored by each team (FTHG, FTAG) as well as their ratios of goals to number of tries (HSGR, ASGR). The results were acceptable and not far off the models we choose to showcase below, but it was here we realized it didn't make sense to predict odds with game statistics.

Our second model was the same grid search and pipeline, but using only and all six newly engineered features (H_W, H_WR, H_avg_diff, A_W, A_WR, A_avg_diff). As expected, we got better results! There are two main takeaways from the visualizations of the results. One is that this model does not seem to be overfitted, since we are predicting values for the 18-19 season which is not included in the training data and the model under-predicts outliers. Second is that our engineered features seem to be doing its work; we can see on the plots that with time, the accuracy of the model increases due to the nature of the engineered features being running updated statistics. Visually, this is represented by the straight line at the start of all 3 plots (predictions for odd_home, odd_draw, odd_away). The straight line represents all teams' first games played in the season, when the model makes no assumptions about team/game statististics. However, the model's predicted values almost perfectly aligns with the actual values towards the end of the plots, except for the obvious outliers.

Our third model we created was a simple multivariate linear regression using forward selection on the six engineered features. This model was made to further compoare the performance of our second model, just in case an approach without a grid search might perform just as well. The results were about 20% to 40% worse in terms of negative mean absolute error. Also, the forward selection picked pretty much all six variables! It eliminated just one of them for draw betting odds, but it could be an outlier outcome. All this validates our reasoning to use k-neighbors with cross-validation and to skip on feature selection!

Our second model performed best, so we will use that to predict the betting odds for home, draw, and away in the future games of the 18-19 season.

scaler = MinMaxScaler()
knn_reg = KNeighborsRegressor()

df_future_games = pd.read_csv('./data/prediction.csv')
columns_to_use = ['H_W', 'H_WR', 'H_avg_diff', 'A_W', 'A_WR', 'A_avg_diff']

K-Neighbors Regression with Scaler and 10-Fold Cross Validation (Engineered Features)¶

# GRID SEARCH (K-NEIGHBORS REGRESSOR, SCALER)

pipe = make_pipeline(scaler, knn_reg)
param_grid = {
    'kneighborsregressor__n_neighbors':range(1, 20), 
    'kneighborsregressor__weights':['uniform', 'distance']
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring="neg_mean_absolute_error")

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    grid.fit(df_past_seasons[columns_to_use], df_past_seasons[odd_type])
    predictions = grid.predict(df_1819_feat_engr[columns_to_use])
    score = grid.score(df_1819_feat_engr[columns_to_use], df_1819_feat_engr[odd_type])
    
    plt.figure(figsize=(16, 4))
    plt.plot(np.arange(len(predictions)), predictions, alpha=0.8, label='predictions')
    plt.plot(np.arange(len(predictions)), df_1819_feat_engr[odd_type].values, alpha=0.8, label='actual')
    plt.title('Predicted '+odd_type+' for 2018-2019 Season (so far)', fontsize=15)
    plt.xlabel('Game of Season', fontsize=15)
    plt.ylabel(odd_type, fontsize=15)
    plt.legend(fontsize=15)
    plt.show()
    print(odd_type, ', neg MAE: ', score)
    print(grid.cv_results_['params'][grid.best_index_])

odd_home , neg MAE:  -0.821804049724465
{'kneighborsregressor__n_neighbors': 18, 'kneighborsregressor__weights': 'uniform'}

odd_draw , neg MAE:  -0.645195167850422
{'kneighborsregressor__n_neighbors': 19, 'kneighborsregressor__weights': 'uniform'}

odd_away , neg MAE:  -1.6805184240879012
{'kneighborsregressor__n_neighbors': 19, 'kneighborsregressor__weights': 'uniform'}

Multivariate Linear Regression with Forward Selection (Engineered Features)¶

# FORWARD SELECTION

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    lin_model = forward_selected(df_past_seasons[np.append(columns_to_use, odd_type)], odd_type)
    predictions = lin_model.predict(df_1819_feat_engr[np.append(columns_to_use, odd_type)])
    score = 0-mean_absolute_error(df_1819_feat_engr[odd_type].values, predictions.values)

    plt.figure(figsize=(16, 4))
    plt.plot(np.arange(len(predictions)), predictions, alpha=0.8, label='predictions')
    plt.plot(np.arange(len(predictions)), df_1819_feat_engr[odd_type].values, alpha=0.8, label='actual')
    plt.title('Predicted '+odd_type+' for 2018-2019 Season (so far)', fontsize=15)
    plt.xlabel('Game of Season', fontsize=15)
    plt.ylabel(odd_type, fontsize=15)
    plt.legend(fontsize=15)
    plt.show()
    print(odd_type, ', neg MAE: ', score)
    print(lin_model.params)

odd_home , neg MAE:  -1.0074253538704099
Intercept     3.033046
A_avg_diff    1.253130
H_avg_diff   -0.648939
H_W          -0.118847
A_W           0.116370
A_WR         -1.234196
H_WR          0.828804
dtype: float64

odd_draw , neg MAE:  -0.9187168020761359
Intercept     4.217888
H_avg_diff    0.687597
H_W           0.069843
A_W          -0.054330
H_WR         -0.954750
A_WR          0.242711
dtype: float64

odd_away , neg MAE:  -2.234560739741936
Intercept     5.238418
H_avg_diff    2.421358
A_avg_diff   -1.248026
H_W           0.325487
A_W          -0.290895
H_WR         -3.063447
A_WR          1.510094
dtype: float64

Predictions of Betting Odds for Future Games In 2018/2019 Season¶

# Preparation for predicting future games

team_stats = getLastTeamStats(df_1819_feat_engr)
H_W = []
H_WR = []
H_avg_diff = []
A_W = []
A_WR = []
A_avg_diff = []

for i in range(len(df_future_games)):
    game = df_future_games.loc[i]
    H_W.append(team_stats[game['HomeTeam']]['W'])
    H_WR.append(team_stats[game['HomeTeam']]['WR'])
    H_avg_diff.append(team_stats[game['HomeTeam']]['avg_diff'])
    A_W.append(team_stats[game['AwayTeam']]['W'])
    A_WR.append(team_stats[game['AwayTeam']]['WR'])
    A_avg_diff.append(team_stats[game['AwayTeam']]['avg_diff'])

df_future_games['H_W'] = H_W
df_future_games['H_WR'] = H_WR
df_future_games['H_avg_diff'] = H_avg_diff
df_future_games['A_W'] = A_W
df_future_games['A_WR'] = A_WR
df_future_games['A_avg_diff'] = A_avg_diff

higher_wr = []
for i in range(len(df_future_games)):
    if df_future_games.loc[i, 'H_WR'] > df_future_games.loc[i, 'A_WR']:
        higher_wr.append('H')
    elif df_future_games.loc[i, 'H_WR'] < df_future_games.loc[i, 'A_WR']:
        higher_wr.append('A')
    else:
        higher_wr.append('D')
df_future_games['higher_wr'] = higher_wr

# PREDICT ODDS OF FUTURE GAMES USING GRIDSEARCH MODEL

pipe = make_pipeline(scaler, knn_reg)
param_grid = {
    'kneighborsregressor__n_neighbors':range(1, 20), 
    'kneighborsregressor__weights':['uniform', 'distance']
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring="neg_mean_absolute_error")

for odd_type in ['odd_home', 'odd_draw', 'odd_away']:
    grid.fit(df_past_seasons[columns_to_use], df_past_seasons[odd_type])
    predictions = grid.predict(df_future_games[columns_to_use])
    df_future_games['predicted_'+odd_type] = predictions

lower_odds = []
for i in range(len(df_future_games)):
    if df_future_games.loc[i, 'predicted_odd_home'] > df_future_games.loc[i, 'predicted_odd_away']:
        lower_odds.append('A')
    elif df_future_games.loc[i, 'predicted_odd_home'] < df_future_games.loc[i, 'predicted_odd_away']:
        lower_odds.append('H')
    else:
        lower_odds.append('D')
df_future_games['lower_odds'] = lower_odds

Below is a subset of the dataset representing future games as well as their 3 betting odds. Using our second model, we used the last-calculated statistics for each team to predict the outcomes below.

To draw further observations, we added higher_wr and lower_odds which represent which team currently has the higher win rate and lower betting odds. In the real world, the team with the higher win rate has lower betting odds. This is intuitive as betters are much more likely to vote for the team more likely to win.

# Predicted odds of future games
df_future_games[['HomeTeam', 'AwayTeam', 
                 'H_WR', 'A_WR',
                 'predicted_odd_home', 'predicted_odd_draw', 'predicted_odd_away', 
                 'higher_wr', 'lower_odds']].head()

Finally, to end this section, we looked at how many predictions have the same team for higher win rates and lower odds. We found this is true for 81% of the time, meaning there are games where the team with the lower win rate is predicted to have higher betting odds!

(df_future_games['higher_wr'] == df_future_games['lower_odds']).mean()

0.8111111111111111

(3) Using Poisson with different features¶

This model also uses poisson, but we replicated this model from David Sheehan's study. First we modified the dataset so that it shows the number of goals each team scored when they were home or away. We only chose the features that is available to us. In addition to the features that Sheehan used, we also added the bettingwhich are from the prediction in the previous section.

Team
Opponent
Goals = number of goals scored in the match
Home = 1: HomeTeam & 0: AwayTeam
odd_team = betting odds for the team
odd_draw = betting odds for a draw
odd_opponent = betting odds for the opponent

Dataset for the model¶

# Take a home team and away team 
team_opponent_data_home = data_1819[['HomeTeam', 'AwayTeam', 'FTHG', 'odd_home', 'odd_draw', 'odd_away']]
team_opponent_data_home.columns = ['Team', 'Opponent', 'Goals', 'odd_team', 'odd_draw', 'odd_opponent']
team_opponent_data_home['Home'] = 1
team_opponent_data_away = data_1819[['AwayTeam', 'HomeTeam', 'FTAG', 'odd_away', 'odd_draw', 'odd_home']]
team_opponent_data_away.columns = ['Team', 'Opponent', 'Goals', 'odd_team', 'odd_draw', 'odd_opponent']
team_opponent_data_away['Home'] = 0

team_opponent_data = team_opponent_data_home.append(team_opponent_data_home).append(team_opponent_data_away)
team_opponent_data.head()

#Perform poisson model
poisson_model = smf.glm(formula="Goals ~ Home + Team + Opponent + odd_team + odd_draw + odd_opponent", data=team_opponent_data, 
                        family=sm.families.Poisson()).fit()
poisson_model.summary()

The GLM table shows that Home status and betting odds do have difference with Game number of goals that each team scores. From the model, we are 95% confident that if the teams are playing at home, the team will score [0.160, 0.506] more goals than when they are not playing at home. Also, surprisingly, with the p-values of less than 0.05, betting odds do affect the number of goals that team scores in the game.

EPL standing table (March 2nd, 2019)¶

The table below shows the ranking of the teams until round 29. At this point, Man City was in the winning run in the EPL with 71 points. By using the poisson regression we created we will calculate the number of goals each team score in each game, and predict the result of the match.

ranking=pd.read_csv('./data/CurrentRanking.csv')
ranking

future_matches = pd.read_csv('./data/betting_odds_prediction.csv')

This is our input dataset. We listed all the home teams and their opponent(away teams) for all remaining games. For each game, we predicted number of goals that home team and away team might score and gave 3 points to the team that are more likely to score more. The limitation of our model is that it was nearly impossilbe to have same score(goals) for both home and away team, so there was no draw in our result. After predicting results of each matches, we predicted that Manchester City will win the season with 98 points.

Input dataset¶

future_matches.head()

## iterate the future matches and calculate the point
result = []
for index, row in future_matches.iterrows():
    home_score = poisson_model.predict(pd.DataFrame(data={'Team': row['HomeTeam'], 'Opponent': row['AwayTeam'],
                                       'Home':1, 'odd_team':row['predicted_odd_home'], 'odd_draw':row['predicted_odd_draw'],
                                        'odd_opponent':row['predicted_odd_away']},index=[1]))
    away_score = poisson_model.predict(pd.DataFrame(data={'Team': row['AwayTeam'], 'Opponent': row['HomeTeam'],
                                       'Home':0, 'odd_team':row['predicted_odd_away'], 'odd_draw':row['predicted_odd_draw'],
                                        'odd_opponent':row['predicted_odd_home']},index=[1]))
    if(home_score[1] > away_score[1]):
        ranking.iloc[ranking.loc[ranking['Team']==row.HomeTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.HomeTeam]['Point'].iloc[0] + 3
        result.append('H')
    elif(home_score[1] < away_score[1]):
        ranking.iloc[ranking.loc[ranking['Team']==row.AwayTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.AwayTeam]['Point'].iloc[0] + 3
        result.append('A')
    else:
        ranking.iloc[ranking.loc[ranking['Team']==row.HomeTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.HomeTeam]['Point'].iloc[0] + 1
        ranking.iloc[ranking.loc[ranking['Team']==row.AwayTeam].index[0], ranking.columns.get_loc('Point')] = ranking.loc[ranking.Team == row.AwayTeam]['Point'].iloc[0] + 1
        result.append('D')
future_matches['predicted_result'] = result

Predicted Standing table of 18/19 season¶

# Prediction
ranking.sort_values('Point', ascending=False)

	AC	AF	AR	AS	AST	AY	AwayTeam	B365A	B365D	B365H	...	Referee	SJA	SJD	SJH	VCA	VCD	VCH	WHA	WHD	WHH
0	3.0	19.0	1.0	4.0	2.0	2.0	Crystal Palace	15.0	6.5	1.25	...	J Moss	12.00	5.75	1.25	10.50	6.25	1.25	12.0	5.5	1.25
1	6.0	10.0	0.0	13.0	3.0	1.0	Everton	2.4	3.4	3.20	...	M Jones	2.38	3.30	3.00	2.40	3.40	3.20	2.4	3.1	3.10
2	0.0	20.0	0.0	5.0	4.0	4.0	Swansea	11.0	5.0	1.36	...	M Dean	8.00	5.00	1.36	10.00	5.20	1.36	9.0	4.5	1.36
3	9.0	10.0	0.0	11.0	4.0	2.0	Hull	3.1	3.3	2.50	...	C Pawson	2.88	3.25	2.50	3.12	3.20	2.55	2.9	3.0	2.60
4	8.0	9.0	0.0	7.0	2.0	3.0	Aston Villa	4.5	3.5	1.95	...	A Taylor	4.00	3.40	1.95	4.75	3.30	1.95	4.2	3.2	1.95

	HomeTeam	AwayTeam	FTHG	FTAG	FTR	HS	AS	HST	AST	HSGR	ASGR	odd_home	odd_draw	odd_away
0	Arsenal	Crystal Palace	2.0	1.0	H	14.0	4.0	6.0	2.0	0.142857	0.250000	1.260000	5.866667	12.085000
1	Leicester	Everton	2.0	2.0	D	11.0	13.0	3.0	3.0	0.181818	0.153846	3.073333	3.296667	2.393333
2	Man United	Swansea	1.0	2.0	A	14.0	5.0	5.0	4.0	0.071429	0.400000	1.363333	4.925000	9.600000
3	QPR	Hull	0.0	1.0	A	19.0	11.0	6.0	4.0	0.000000	0.090909	2.488333	3.193333	3.015000
4	Stoke	Aston Villa	0.0	1.0	A	12.0	7.0	2.0	2.0	0.000000	0.142857	1.958333	3.345000	4.250000

	HomeTeam	AwayTeam	H_W	H_WR	H_avg_diff	A_W	A_WR	A_avg_diff	odd_home	odd_draw	odd_away
150	Arsenal	Newcastle	8.5	0.566667	0.400000	8.5	0.566667	-0.066667	1.446667	4.471667	7.563333
151	Burnley	Southampton	5.0	0.333333	-0.933333	9.0	0.600000	0.866667	5.041667	3.591667	1.761667
152	Chelsea	Hull	12.5	0.833333	1.400000	5.5	0.366667	-0.400000	1.181667	7.028333	17.410000
153	Crystal Palace	Stoke	5.5	0.366667	-0.333333	6.5	0.433333	-0.200000	2.588333	3.198333	2.926667
154	Leicester	Man City	4.0	0.266667	-0.733333	11.5	0.766667	1.200000	6.545000	4.528333	1.493333

Dep. Variable:	FTR	No. Observations:	1447
Model:	GLM	Df Residuals:	1378
Model Family:	Poisson	Df Model:	68
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-1782.6
Date:	Wed, 13 Mar 2019	Deviance:	1145.1
Time:	02:08:52	Pearson chi2:	865.
No. Iterations:	5	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	0.2067	0.216	0.956	0.339	-0.217	0.630
HomeTeam[T.Aston Villa]	-0.8236	0.267	-3.083	0.002	-1.347	-0.300
HomeTeam[T.Bournemouth]	-0.3208	0.171	-1.879	0.060	-0.655	0.014
HomeTeam[T.Brighton]	-0.1526	0.217	-0.704	0.482	-0.578	0.273
HomeTeam[T.Burnley]	-0.3196	0.189	-1.692	0.091	-0.690	0.051
HomeTeam[T.Cardiff]	-0.6427	0.344	-1.870	0.061	-1.316	0.031
HomeTeam[T.Chelsea]	-0.0966	0.133	-0.725	0.468	-0.358	0.165
HomeTeam[T.Crystal Palace]	-0.5529	0.171	-3.232	0.001	-0.888	-0.218
HomeTeam[T.Everton]	-0.2673	0.152	-1.762	0.078	-0.565	0.030
HomeTeam[T.Fulham]	-0.7869	0.398	-1.977	0.048	-1.567	-0.007
HomeTeam[T.Huddersfield]	-0.6784	0.278	-2.440	0.015	-1.223	-0.133
HomeTeam[T.Hull]	-0.4553	0.221	-2.061	0.039	-0.888	-0.022
HomeTeam[T.Leicester]	-0.2157	0.155	-1.391	0.164	-0.520	0.088
HomeTeam[T.Liverpool]	-0.1082	0.135	-0.800	0.424	-0.373	0.157
HomeTeam[T.Man City]	-0.0918	0.142	-0.647	0.518	-0.370	0.186
HomeTeam[T.Man United]	-0.0234	0.135	-0.174	0.862	-0.288	0.241
HomeTeam[T.Middlesbrough]	-0.4627	0.311	-1.487	0.137	-1.073	0.147
HomeTeam[T.Newcastle]	-0.3958	0.180	-2.199	0.028	-0.749	-0.043
HomeTeam[T.Norwich]	-0.6440	0.323	-1.996	0.046	-1.276	-0.012
HomeTeam[T.QPR]	-0.4678	0.284	-1.646	0.100	-1.025	0.089
HomeTeam[T.Southampton]	-0.3345	0.155	-2.163	0.031	-0.638	-0.031
HomeTeam[T.Stoke]	-0.3278	0.171	-1.914	0.056	-0.664	0.008
HomeTeam[T.Sunderland]	-0.6836	0.218	-3.132	0.002	-1.111	-0.256
HomeTeam[T.Swansea]	-0.4001	0.182	-2.196	0.028	-0.757	-0.043
HomeTeam[T.Tottenham]	-0.1366	0.137	-0.997	0.319	-0.405	0.132
HomeTeam[T.Watford]	-0.2621	0.171	-1.529	0.126	-0.598	0.074
HomeTeam[T.West Brom]	-0.5356	0.186	-2.887	0.004	-0.899	-0.172
HomeTeam[T.West Ham]	-0.3562	0.157	-2.274	0.023	-0.663	-0.049
HomeTeam[T.Wolves]	-0.2620	0.301	-0.871	0.384	-0.851	0.328
AwayTeam[T.Aston Villa]	0.3467	0.219	1.581	0.114	-0.083	0.776
AwayTeam[T.Bournemouth]	0.2235	0.182	1.231	0.218	-0.132	0.579
AwayTeam[T.Brighton]	0.2752	0.238	1.156	0.248	-0.191	0.742
AwayTeam[T.Burnley]	0.1318	0.193	0.685	0.494	-0.246	0.509
AwayTeam[T.Cardiff]	0.2860	0.289	0.989	0.323	-0.281	0.853
AwayTeam[T.Chelsea]	-0.2379	0.179	-1.332	0.183	-0.588	0.112
AwayTeam[T.Crystal Palace]	-0.0014	0.181	-0.008	0.994	-0.356	0.353
AwayTeam[T.Everton]	0.1402	0.172	0.814	0.416	-0.197	0.478
AwayTeam[T.Fulham]	0.5538	0.290	1.909	0.056	-0.015	1.123
AwayTeam[T.Huddersfield]	0.2565	0.232	1.107	0.268	-0.198	0.711
AwayTeam[T.Hull]	0.3666	0.214	1.713	0.087	-0.053	0.786
AwayTeam[T.Leicester]	0.0848	0.176	0.481	0.631	-0.261	0.431
AwayTeam[T.Liverpool]	-0.1102	0.175	-0.628	0.530	-0.454	0.234
AwayTeam[T.Man City]	-0.4459	0.203	-2.200	0.028	-0.843	-0.049
AwayTeam[T.Man United]	-0.2398	0.181	-1.325	0.185	-0.594	0.115
AwayTeam[T.Middlesbrough]	0.2716	0.306	0.889	0.374	-0.327	0.871
AwayTeam[T.Newcastle]	0.2413	0.184	1.310	0.190	-0.120	0.602
AwayTeam[T.Norwich]	0.3895	0.251	1.555	0.120	-0.101	0.881
AwayTeam[T.QPR]	0.4945	0.262	1.890	0.059	-0.018	1.007
AwayTeam[T.Southampton]	0.1077	0.173	0.621	0.535	-0.232	0.448
AwayTeam[T.Stoke]	0.1731	0.185	0.936	0.349	-0.189	0.536
AwayTeam[T.Sunderland]	0.2262	0.200	1.131	0.258	-0.166	0.618
AwayTeam[T.Swansea]	0.2137	0.186	1.147	0.251	-0.151	0.579
AwayTeam[T.Tottenham]	-0.3941	0.189	-2.083	0.037	-0.765	-0.023
AwayTeam[T.Watford]	0.2499	0.176	1.416	0.157	-0.096	0.596
AwayTeam[T.West Brom]	0.1332	0.191	0.698	0.485	-0.241	0.507
AwayTeam[T.West Ham]	0.0943	0.172	0.550	0.582	-0.242	0.431
AwayTeam[T.Wolves]	0.1480	0.324	0.457	0.647	-0.486	0.782
HomeAvgAllTimeSoFar	0.1749	0.146	1.202	0.229	-0.110	0.460
HomeHighAllTimeSoFar	0.0191	0.034	0.557	0.577	-0.048	0.086
HomeLowAllTimeSoFar	-0.0132	0.097	-0.136	0.892	-0.204	0.177
AwayAvgAllTimeSoFar	-0.1817	0.166	-1.091	0.275	-0.508	0.145
AwayHighAllTimeSoFar	0.0228	0.033	0.688	0.492	-0.042	0.088
AwayLowAllTimeSoFar	-0.0558	0.110	-0.505	0.614	-0.272	0.161
HomeTotalGoals	-0.0023	0.006	-0.380	0.704	-0.014	0.010
HomeTotalShots	0.0003	0.001	0.321	0.748	-0.001	0.002
HomeTotalAccuracy	-1.3185	1.531	-0.861	0.389	-4.319	1.682
AwayTotalGoals	-0.0006	0.007	-0.094	0.925	-0.014	0.013
AwayTotalShots	-6.407e-05	0.001	-0.069	0.945	-0.002	0.002
AwayTotalAccuracy	1.4854	1.756	0.846	0.398	-1.956	4.927

	HomeTeam	AwayTeam	H_WR	A_WR	predicted_odd_home	predicted_odd_draw	predicted_odd_away	higher_wr	lower_odds
0	Cardiff	West Ham	0.321429	0.464286	2.665556	3.300877	3.101404	A	H
1	Crystal Palace	Brighton	0.392857	0.370370	2.231759	3.344211	3.743860	H	H
2	Huddersfield	Bournemouth	0.196429	0.428571	3.110556	3.361754	2.630263	A	A
3	Leicester	Fulham	0.446429	0.232143	1.815000	3.637368	5.066667	H	H
4	Man City	Watford	0.821429	0.517857	1.380926	5.542368	10.513684	H	H

	Team	Opponent	Goals	odd_team	odd_draw	odd_opponent	Home
0	Man United	Leicester	2	1.561667	3.905000	7.083333	1
1	Bournemouth	Cardiff	2	1.895000	3.538333	4.388333	1
2	Fulham	Crystal Palace	0	2.466667	3.360000	2.950000	1
3	Huddersfield	Chelsea	0	6.276667	3.970000	1.590000	1
4	Newcastle	Tottenham	1	3.821667	3.420000	2.053333	1

Dep. Variable:	Goals	No. Observations:	867
Model:	GLM	Df Residuals:	824
Model Family:	Poisson	Df Model:	42
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-1203.1
Date:	Wed, 13 Mar 2019	Deviance:	811.73
Time:	02:11:20	Pearson chi2:	692.
No. Iterations:	5	Covariance Type:	nonrobust

	Team	Point
0	Man City	71
1	Liverpool	70
2	Tottenham	61
3	Man United	58
4	Arsenal	57
5	Chelsea	56
6	Wolves	43
7	Watford	43
8	West Ham	39
9	Everton	37
10	Leicester	35
11	Bournemouth	34
12	Newcastle	31
13	Crystal Palace	33
14	Brighton	30
15	Burnley	30
16	Southampton	27
17	Cardiff	25
18	Fulham	17
19	Huddersfield	14

	HomeTeam	AwayTeam	predicted_odd_home	predicted_odd_draw	predicted_odd_away
0	Cardiff	West Ham	2.759022	3.306794	2.971140
1	Crystal Palace	Brighton	2.331841	3.308602	3.557281
2	Huddersfield	Bournemouth	3.177163	3.411348	2.602807
3	Leicester	Fulham	1.789015	3.662285	5.169737
4	Man City	Watford	1.441855	5.221275	9.741579

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.0160	0.122	-0.131	0.896	-0.255	0.223
HomeAvgAllTimeSoFar	0.3790	0.114	3.314	0.001	0.155	0.603
HomeHighAllTimeSoFar	0.0088	0.031	0.282	0.778	-0.052	0.070
HomeLowAllTimeSoFar	-0.1273	0.093	-1.363	0.173	-0.310	0.056
AwayAvgAllTimeSoFar	-0.4953	0.134	-3.688	0.000	-0.759	-0.232
AwayHighAllTimeSoFar	0.0206	0.030	0.690	0.490	-0.038	0.079
AwayLowAllTimeSoFar	0.0202	0.107	0.188	0.851	-0.190	0.230
HomeTotalGoals	-0.0016	0.006	-0.274	0.784	-0.013	0.010
HomeTotalShots	0.0011	0.001	1.368	0.171	-0.000	0.003
HomeTotalAccuracy	-1.6430	1.321	-1.244	0.213	-4.231	0.946
AwayTotalGoals	-0.0033	0.006	-0.516	0.606	-0.016	0.009
AwayTotalShots	-0.0007	0.001	-0.864	0.387	-0.002	0.001
AwayTotalAccuracy	3.6994	1.458	2.537	0.011	0.842	6.557

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	1.5703	0.416	3.779	0.000	0.756	2.385
Team[T.Bournemouth]	-0.5065	0.176	-2.877	0.004	-0.852	-0.161
Team[T.Brighton]	-1.0300	0.214	-4.806	0.000	-1.450	-0.610
Team[T.Burnley]	-1.0552	0.223	-4.724	0.000	-1.493	-0.617
Team[T.Cardiff]	-1.2511	0.229	-5.452	0.000	-1.701	-0.801
Team[T.Chelsea]	-0.2131	0.161	-1.320	0.187	-0.530	0.103
Team[T.Crystal Palace]	-0.9951	0.201	-4.957	0.000	-1.389	-0.602
Team[T.Everton]	-0.6696	0.178	-3.767	0.000	-1.018	-0.321
Team[T.Fulham]	-0.9722	0.204	-4.769	0.000	-1.372	-0.573
Team[T.Huddersfield]	-1.9282	0.281	-6.869	0.000	-2.478	-1.378
Team[T.Leicester]	-0.8200	0.191	-4.284	0.000	-1.195	-0.445
Team[T.Liverpool]	0.1286	0.166	0.777	0.437	-0.196	0.453
Team[T.Man City]	0.4775	0.194	2.464	0.014	0.098	0.857
Team[T.Man United]	-0.2308	0.157	-1.472	0.141	-0.538	0.076
Team[T.Newcastle]	-1.1982	0.217	-5.529	0.000	-1.623	-0.773
Team[T.Southampton]	-0.9464	0.200	-4.725	0.000	-1.339	-0.554
Team[T.Tottenham]	-0.2406	0.157	-1.535	0.125	-0.548	0.067
Team[T.Watford]	-0.6563	0.184	-3.573	0.000	-1.016	-0.296
Team[T.West Ham]	-0.6505	0.180	-3.608	0.000	-1.004	-0.297
Team[T.Wolves]	-0.7613	0.189	-4.030	0.000	-1.132	-0.391
Opponent[T.Bournemouth]	0.3064	0.181	1.694	0.090	-0.048	0.661
Opponent[T.Brighton]	-0.0174	0.219	-0.079	0.937	-0.447	0.412
Opponent[T.Burnley]	0.1679	0.221	0.759	0.448	-0.266	0.601
Opponent[T.Cardiff]	0.2289	0.219	1.047	0.295	-0.200	0.657
Opponent[T.Chelsea]	-0.3939	0.196	-2.008	0.045	-0.778	-0.009
Opponent[T.Crystal Palace]	-0.1222	0.199	-0.613	0.540	-0.513	0.269
Opponent[T.Everton]	-0.1937	0.190	-1.019	0.308	-0.566	0.179
Opponent[T.Fulham]	0.5029	0.189	2.666	0.008	0.133	0.873
Opponent[T.Huddersfield]	0.0878	0.229	0.384	0.701	-0.360	0.536
Opponent[T.Leicester]	-0.1570	0.198	-0.793	0.428	-0.545	0.231
Opponent[T.Liverpool]	-1.3267	0.258	-5.137	0.000	-1.833	-0.820
Opponent[T.Man City]	-1.0643	0.280	-3.798	0.000	-1.614	-0.515
Opponent[T.Man United]	-0.2619	0.186	-1.406	0.160	-0.627	0.103
Opponent[T.Newcastle]	-0.2989	0.220	-1.357	0.175	-0.731	0.133
Opponent[T.Southampton]	0.0245	0.200	0.122	0.903	-0.367	0.416
Opponent[T.Tottenham]	-0.4211	0.197	-2.138	0.033	-0.807	-0.035
Opponent[T.Watford]	-0.0696	0.195	-0.356	0.722	-0.453	0.313
Opponent[T.West Ham]	-0.0401	0.193	-0.207	0.836	-0.419	0.339
Opponent[T.Wolves]	-0.3295	0.206	-1.600	0.110	-0.733	0.074