Contents

Soccer and Machine Learning: 2 hot topics

I’m sure you’ve probably heard about the 2018 FIFA Football World Cup in Russia everywhere during the last few months. And, if you are a techy, I guess you also have realized that Machine Learning and Artificial Intelligence are buzzwords too. So, what better way to start off this 2018 than by writing a post that combines these two hot topics in a machine learning tutorial! In order to do that, I’m going to leverage a dataset of the Fifa 2018 video game.

My goal is to show you how to create a predictive model that is able to forecast how good a soccer player is based on their game statistics (using Python in a Jupyter Notebook). Fifa is one of the most well known video games around the world. You’ve probably played it at least once, right? Although I’m not a fan of video games, when I saw the dataset collected by Aman Srivastava, I immediately thought that it was great for practicing some of the basics of any Machine Learning Project. The Fifa 18 dataset was scraped from the website sofifa.com containing statistics and more than 70 attributes for each player in the Full version of FIFA 18. In this Github Project you can access the csv files that compose the dataset and some jupyter notebooks with the python code used to collect the data. Having said this, now let’s start!

Getting started with the machine learning tutorial

In our recently published Machine Learning e-book we explained most of the basic concepts related to smart systems and how machine learning techniques could add smart capabilities to many kinds of systems in almost any domain that you can imagine. Among other things, we learned that a typical workflow for a Machine Learning Project usually looks like the one shown in the image below:

In this post we’ll go through a simplified view of this whole process, with a practical implementation of each phase. The main objective is to show most of the common steps performed during any machine learning project. Therefore, you could use it as a start point in case you need to address a machine learning project from scratch.

In what follows, we will:

Apply some preprocessing steps to prepare the data.
Then, we will perform a descriptive analysis of the data to better understand the main characteristics that they have.
We will continue by practicing how to train different machine learning models using scikit-learn. It is one of the most popular python libraries for machine learning. We will also use a subset of the dataset for training purposes.
Then, we will iterate and evaluate the learned models by using unseen data. Later, we will compare them until we find a good model that meets our expectations.
Once we have chosen the candidate model, we will use it to perform predictions and to create a simple web application that consumes this predictive model.

At the end, we will arrive at a funny smart app like the one below. It will be able to predict how good a soccer player is based on their game statistics. Sounds cool, yeah? Well, let’s dive in!

1. Preparing the Data

Generally any machine learning project has an initial stage known as data prepapration, data cleaning or the preprocessing phase.

Its main objective is to collect and prepare the data that the learning algorithms will use during the training phase. In our practical and concrete example, an important part of this was already addressed by Aman Srivastava when he scraped different pages from the website sofifa.com. In his Github Project you can access some of the jupyter notebooks with the python code that acts as the data preprocessing modules that were applied to get and generate the original dataset for our project. Below, as an example, we can see the module that does the web scraping of the raw data (html format) and how it transforms the data into a Pandas dataframe (Pandas is a famous Python library for data processing). Finally it generates a csv file with the results. In some way, this data preparation step can be seen like something similar to the old ETLs (extract, transform, load) database processes.

Python Preprocessing module from crawler.ipynb

Many times, multiple sources need to be consumed to collect relevant data for our algorithms. The problem is that different sources have different data quality, different formats, languages, units, etc.

Common issues that we generally face during the data preparation phase:

Format and structure normalization
Detect and fix missing values
Duplicates removal
Units normalization
Constraints validations
Anomaly detection and removal
Study of features importance/relevance
Dimentional reduction, feature selection & extraction

Below, we can see an additional second python script that we added specifically for our project and it continues the pipeline of the preprocessing phase. It takes the complete dataset from the merged CompleteDataset.csv file and it applies some cleanup (anomaly removal) and additional transformations (units normalization) to the data for some of the initial features (columns) that we are going to use. The

Overall column refers to a player’s current rating/ability. That’s what we are going to use as a measure of how good the player is. We will consider Overall as the observable dependent variable that we want to understand. After learning how it relates to the player’s characteristics (also known as features, predictors or independent variables which explain the dependent value), we will predict the results. In order to keep it simple, for now we will avoid any automated dimensional reduction techniques to select the most relevant features of a player. We’ll follow our instincts and common sense to choose Value column as a good feature from where to start. Besides, we can easily imagine there’s a high relation between the value that a player has in the market and how good this player is. Later we will add other features like the age or how good the player is at finishing to try to improve our predictors.

Take a look:

import numpy as np # Linear algebra
import pandas as pd # Data processing

col_types = {'Overall': np.int32, 'Age': np.int32}

#read only the columns that we will use (name and photo are loaded only to visualize the results at the end)
df = pd.read_csv("CompleteDataset.csv", usecols=['Name', 'Photo', 'Value', 'Overall', 'Age', 'Finishing'], dtype=col_types)

#remove € character, leave just numbers
df['Value'] = df['Value'].str.replace('€', '')

#parse string for millions and thousands to numeric values
def parseValue(strVal):
    if 'M' in strVal:
        return int(float(strVal.replace('M', '')) * 1000000)
    elif 'K' in strVal:
        return int(float(strVal.replace('K', '')) * 1000)
    else:
        return int(strVal)   

df['Value'] = df['Value'].apply(lambda x: parseValue(x))

import numpy as np # Linear algebra

import pandas as pd # Data processing

col_types = {'Overall': np.int32, 'Age': np.int32}

#read only the columns that we will use (name and photo are loaded only to visualize the results at the end)

df = pd.read_csv("CompleteDataset.csv", usecols=['Name', 'Photo', 'Value', 'Overall', 'Age', 'Finishing'], dtype=col_types)

#remove € character, leave just numbers

df['Value'] = df['Value'].str.replace('€', '')

#parse string for millions and thousands to numeric values

def parseValue(strVal):

if 'M' in strVal:

return int(float(strVal.replace('M', '')) * 1000000)

elif 'K' in strVal:

return int(float(strVal.replace('K', '')) * 1000)

else:

return int(strVal)

df['Value'] = df['Value'].apply(lambda x: parseValue(x))

#check if there are null/missing values and how many in each column
df.isnull().sum()

1 2	#check if there are null/missing values and how many in each column df.isnull().sum()

Name         0
Age          0
Photo        0
Overall      0
Value        0
Finishing    0
dtype: int64

Name 0

Age 0

Photo 0

Overall 0

Value 0

Finishing 0

dtype: int64

Great!

We can see that we do not have null or missing values. If we had, then we could remove them before continuing. To complete this phase we are going to look for anomaly entries. For instance, we know that nobody could have a value in the market lower than or equal to zero. So those values are bad entries and we need to remove them since they could be dangerous, causing overfitting (an undesirable characteristic of any machine learning model).

#Nobody can have a value lower or equal than zero, so those values are bad entries and we need to remove them
df = df.loc[df.Value > 0]

1 2	#Nobody can have a value lower or equal than zero, so those values are bad entries and we need to remove them df = df.loc[df.Value > 0]

Also, we have observed that there are a few non-numeric entries in the Finishing column, so we are going to exclude them.

def between_1_and_99(s):
    try:
        n = int(s)
        return (1 <= n and n <= 99)
    except ValueError:
        return False

#remove not valid entries for Finishing
df = df.loc[df['Finishing'].apply(lambda x: between_1_and_99(x))]

#now we can define Finishing as integers
df['Finishing'] = df['Finishing'].astype('int')

def between_1_and_99(s):

try:

n = int(s)

return (1 <= n and n <= 99)

except ValueError:

return False

#remove not valid entries for Finishing

df = df.loc[df['Finishing'].apply(lambda x: between_1_and_99(x))]

#now we can define Finishing as integers

df['Finishing'] = df['Finishing'].astype('int')

We could continue executing many other validations and transformations to the data, for instance, check that all values in the Overall column are in the range between 0 and 100, that there are no duplicated entries, etc. But, remember that this is a simplified view of a typical workflow of a machine learning project because we simply want to demonstrate the fundamental ideas today.

Let’s see how our data looks in the first few rows:

df.head()

df.head()

	Name	Age	Photo	Overall	Value	Finishing
0	Cristiano Ronaldo	32	https://cdn.sofifa.org/48/18/players/20801.png	94	95500000	94
1	L. Messi	30	https://cdn.sofifa.org/48/18/players/158023.png	93	105000000	95
2	Neymar	25	https://cdn.sofifa.org/48/18/players/190871.png	92	123000000	89
3	L. Suárez	30	https://cdn.sofifa.org/48/18/players/176580.png	92	97000000	94
4	M. Neuer	31	https://cdn.sofifa.org/48/18/players/167495.png	92	61000000	13

If you find some other interesting preprocessing steps, go ahead! We encourage you to practice with Python and Pandas and let us know how it goes!

Expert corner:

If you are starting with this kind of project I strongly recommend you to use the Anaconda distribution which simplifies the whole installation of the most popular and important packages to work with big-data projects, data-science projects and predictive analysis.

Some important packages included in the Anaconda distribution:

NumPy
SciPy
Matplotlib
Jupyter
Scikit-learn (the library that we will use later in this post when creating the predictive models)

Some good IDEs to start with are:

Spyder (included in Anaconda)
Jupyter Notebook (we actually used it for this post!)
Python Tools for Visual Studio (I personally like it very much)

So, take a look at them and choose the one suitable for you and that you’re comfortable with!

1.1 Understanding the data

Now that the data is ready, before we start applying machine learning algorithms, a good approach is to first explore, play with, and query the data to get to know it better. This process is known as
descriptive analysis. The main objective here is to have a very good understanding of our data. It means to understand the kind of distribution it has and get some statistics, among others. If you skip this phase, you’ll feel like you’re on a blind date with it later. So, let’s explore the data and get to know what we’re working with!

df.describe()

1	df.describe()

	Age	Overall	Value	Finishing
count	17611.000000	17611.000000	1.761100e+04	17611.000000
mean	25.106013	66.229175	2.417993e+06	45.256487
std	4.602144	7.003240	5.392879e+06	19.447567
min	16.000000	46.000000	1.000000e+04	2.000000
25%	21.000000	62.000000	3.250000e+05	29.000000
50%	25.000000	66.000000	7.000000e+05	48.000000
75%	28.000000	71.000000	2.100000e+06	61.000000
max	44.000000	94.000000	1.230000e+08	95.000000

As you can see above, by using the describe operation provided by any Pandas dataframe we can get a summary of some important properties of our data like:

the number of rows (a.k.a observations)
average values
minimums and maximums
percentiles’ values
standard deviation

This operation can be applied to the whole dataframe or you can select particular features, for instance, if you want to see statistics only for the Overall feature you can do it in this way:

df.Overall.describe()

1	df.Overall.describe()

count    17611.000000
mean        66.229175
std          7.003240
min         46.000000
25%         62.000000
50%         66.000000
75%         71.000000
max         94.000000
Name: Overall, dtype: float64

count 17611.000000

mean 66.229175

std 7.003240

min 46.000000

25% 62.000000

50% 66.000000

75% 71.000000

max 94.000000

Name: Overall, dtype: float64

The Overall column refers to a player’s current rating/ability and since it’s the variable that we will use as a measure of how good or bad a player is, then we can get some initial interpretations of our data from the above statistics like the following statements:

We have 17611 observations (players) under study
The average player’s Overall value is about 66
The worst player has an
Overall value of 46
The best player has an
Overall value of 94
Only one quarter of the players have an
Overall value greater than 71

Right now I’m a bit curious to know about who are the best and worst players.

Let’s see who they are by querying our data.

df.nlargest(5, columns='Overall')

1	df.nlargest(5, columns='Overall')

	Name	Age	Photo	Overall	Value	Finishing
0	Cristiano Ronaldo	32	https://cdn.sofifa.org/48/18/players/20801.png	94	95500000	94
1	L. Messi	30	https://cdn.sofifa.org/48/18/players/158023.png	93	105000000	95
2	Neymar	25	https://cdn.sofifa.org/48/18/players/190871.png	92	123000000	89
3	L. Suárez	30	https://cdn.sofifa.org/48/18/players/176580.png	92	97000000	94
4	M. Neuer	31	https://cdn.sofifa.org/48/18/players/167495.png	92	61000000	13

df.nsmallest(5, columns='Overall')

1	df.nsmallest(5, columns='Overall')

	Name	Age	Photo	Overall	Value	Finishing
17973	T. Sawyer	18	https://cdn.sofifa.org/48/18/players/240403.png	46	50000	35
17974	J. Keeble	18	https://cdn.sofifa.org/48/18/players/240404.png	46	40000	15
17975	T. Käßemodel	28	https://cdn.sofifa.org/48/18/players/235352.png	46	30000	40
17976	A. Kelsey	17	https://cdn.sofifa.org/48/18/players/237463.png	46	50000	5
17978	J. Young	17	https://cdn.sofifa.org/48/18/players/231381.png	46	60000	47

Another thing that could be helpful when understanding the data is to see its distribution and try to figure out if it fits some well-known theoretical distribution. In order to do that, histograms can give us a good idea of the underlying data distribution. What they basically do is split the possible values/results into different bins and count the number of ocurrences (observations) where the variable under study falls into each bin. We can do this easily by using matplotlib, one of the most popular Python libraries for 2D plotting.

import matplotlib.pyplot as plt

plt.hist(df.Overall, bins=16, alpha=0.6, color='y')
plt.title("#Players per Overall")
plt.xlabel("Overall")
plt.ylabel("Count")

plt.show()

import matplotlib.pyplot as plt

plt.hist(df.Overall, bins=16, alpha=0.6, color='y')

plt.title("#Players per Overall")

plt.xlabel("Overall")

plt.ylabel("Count")

plt.show()

This histogram probably reminds you of the bell shape that comes with a normal distribution. Normal distributions are very common when studying person’s traits (height, intelligence, etc). In this context, the concept of “normality” reflects the fact that the majority of the individuals have values near the mean. Similarly, it means that the number of individuals decreases quickly as soon as we go far from that mean. Let’s try to see how well our data fits a normal distribution. In order to do this, we can leverage the information provided previously when executing the describe operation, in particular the standard deviation property and the mean value. Let’s see those important values again:

overall_mean = df.Overall.mean()
overall_std = df.Overall.std()
print('The mean value for the Overall feature is ', overall_mean, ' and the standard deviation is ', overall_std)

overall_mean = df.Overall.mean()

overall_std = df.Overall.std()

print('The mean value for the Overall feature is ', overall_mean, ' and the standard deviation is ', overall_std)

The mean value for the Overall feature is  66.22917494747601  and the standard deviation is  7.003240305890887

1	The mean value for the Overall feature is 66.22917494747601 and the standard deviation is 7.003240305890887

Let’s remember how the theoretical normal distribution looks and how the mean and standard deviation relate.

Source
Wikipedia
We can observe that the theoretical normal distribution is symetrical around the mean, that approximately 68% of the values fall between +/- 1 std from the mean, 95% fall between +/- 2 std from the mean and 99.7% fall betweeen +/- 3 std from the mean. As it’s explained in 68-95-99.7 rule, this rule can be expressed mathematically as follows:

Where X is an observation from a normally distributed random variable, μ is the mean of the distribution, and σ is its standard deviation. Previously, when describing the main properties of our data, we had observed that the 50th percentile was 66 and the mean was 66.2 making clear that half of the observations fall on each side of the mean, like in a normal distribution. Now, let’s see how the standard deviation meets the previous rule:

#number of observations in +/-1 std, +/- 2std and +/- 3 std
std1_count = (df[(df.Overall >= (overall_mean - 1*overall_std)) & (df.Overall <= overall_mean + 1*overall_std)]['Overall']).count()
std2_count = (df[(df.Overall >= (overall_mean - 2*overall_std)) & (df.Overall <= overall_mean + 2*overall_std)]['Overall']).count()
std3_count = (df[(df.Overall >= (overall_mean - 3*overall_std)) & (df.Overall <= overall_mean + 3*overall_std)]['Overall']).count()

#percentaje of observations in each range
overall_total_count = df.Overall.count()
percentage_std1 = std1_count/overall_total_count * 100 #empirically it should be 68% approx
percentage_std2 = std2_count/overall_total_count * 100 #empirically it should be 95% approx
percentage_std3 = std3_count/overall_total_count * 100 #empirically it should be 99.7% approx

#number of observations in +/-1 std, +/- 2std and +/- 3 std

std1_count = (df[(df.Overall >= (overall_mean - 1*overall_std)) & (df.Overall <= overall_mean + 1*overall_std)]['Overall']).count()

std2_count = (df[(df.Overall >= (overall_mean - 2*overall_std)) & (df.Overall <= overall_mean + 2*overall_std)]['Overall']).count()

std3_count = (df[(df.Overall >= (overall_mean - 3*overall_std)) & (df.Overall <= overall_mean + 3*overall_std)]['Overall']).count()

#percentaje of observations in each range

overall_total_count = df.Overall.count()

percentage_std1 = std1_count/overall_total_count * 100 #empirically it should be 68% approx

percentage_std2 = std2_count/overall_total_count * 100 #empirically it should be 95% approx

percentage_std3 = std3_count/overall_total_count * 100 #empirically it should be 99.7% approx

print('1 std % : ', percentage_std1, ', 2 std % : ', percentage_std2, ', 3 std % : ', percentage_std3)

1	print('1 std % : ', percentage_std1, ', 2 std % : ', percentage_std2, ', 3 std % : ', percentage_std3)

1 std % :  68.951223667 , 2 std % :  94.9236272784 , 3 std % :  99.8126171143

1	1 std % : 68.951223667 , 2 std % : 94.9236272784 , 3 std % : 99.8126171143

Nice! The Overall feature looks very normal!

In some way, this was expected if we think that it’s a property for which the concept of “normality” could apply perfectly (like other traits of a person, for instance height, weight, intelligence, etc). So, let’s confirm visually how well our data fits a normal distribution:

from scipy.stats import norm

#plot the histogram
plt.hist(df.Overall, bins=16, normed=True, alpha=0.6, color='g')
plt.title("#Players per Overall")
plt.xlabel("Overall")
plt.ylabel("Count")

# Plot the probability density function for norm
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, overall_mean, overall_std)
plt.plot(x, p, 'k', linewidth=2, color='r')
title = "#Players per Overall, Fit results: mean = %.2f,  std = %.2f" % (overall_mean, overall_std)
plt.title(title)

plt.show()

from scipy.stats import norm

#plot the histogram

plt.hist(df.Overall, bins=16, normed=True, alpha=0.6, color='g')

plt.title("#Players per Overall")

plt.xlabel("Overall")

plt.ylabel("Count")

# Plot the probability density function for norm

xmin, xmax = plt.xlim()

x = np.linspace(xmin, xmax, 100)

p = norm.pdf(x, overall_mean, overall_std)

plt.plot(x, p, 'k', linewidth=2, color='r')

title = "#Players per Overall, Fit results: mean = %.2f, std = %.2f" % (overall_mean, overall_std)

plt.title(title)

plt.show()

Before ending this section, I’d like to highlight that sometimes the data preparation stage is underestimated (mainly when you are starting with your first machine learning projects) but you should know that this innocent-seeming first stage most of the time takes more than half of the total project time (sometimes even up to 60-80%!). So, keep that in mind and give this stage the importance it deserves.

2. Machine learning algorithms for building our predictive model

Now it’s time to create our predictive model. That is, to create a mathematical model which links our observed/dependent value/response (the Overall in our example) with the other features available (also called predictors or independent variables, like the Value column in this example). Machine Learning models are created during a learning phase, also known as the training process. As we described in our Machine Learning e-book, in its simplest form, the algorithms used to generate these models can be supervised or unsupervised (depending on their training mode).

In this case, we are working on a regression problem, and its algorithms fall in the supervised learning category (because we can train a model using observations where the expected result is well known and we can “teach” it to our algorithm). There are several algorithms that can be used to solve a regression problem, like what we are about to see. But first, let’s split our dataset in two different subsets. This is a common technique where we choose part of the dataset (generally 80%) for training purposes and the rest (approximately 20%) is used later as unseen data to evaluate how good our model is.

There are also some other techniques like cross-validation that can be applied in order to help minimize overfitting that you can try. For that, you can take a look at this article that gives a nice introduction to the Train/Test split technique, cross-validation, and the overfitting issue. That said, let’s continue splitting our dataset. We can do this very easily by using a pre-built functionality in the module model_selection as follows:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.20, random_state=99)

xtrain = train[['Value']]
ytrain = train[['Overall']]

xtest = test[['Value']]
ytest = test[['Overall']]

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.20, random_state=99)

xtrain = train[['Value']]

ytrain = train[['Overall']]

xtest = test[['Value']]

ytest = test[['Overall']]

Since it’s reasonable to think that there could be a linear relationship between a player’s market value and how good they are, then we can create an initial model applying linear regression, one of the simplest regression models.

2.1 Linear Regression

For this and also for the rest of the models that we will see in this post, I’m going to use scikit-learn, a popular package for Python that provides implementations for most of the state-of-the-art machine learning algorithms that are usually used. So, we can train a default linear regresion model using scikit-learn with just a couple of lines of code as follows:

# Create linear regression object
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)

# Create linear regression object

from sklearn import linear_model

regr = linear_model.LinearRegression()

regr.fit(xtrain, ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

1	LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

As you can see, we use the fit method to train our model. We just need to pass the independet features (xtrain) and the dependent values for them (ytrain). Later we will see that the same approach is used by scikit-learn to train different kinds of models. You’ll love scikit-learn because of this. This tool allows you to change and train different models using always a common approach and similar code. Now, we can use the already trained model and the predict method (that is also available for other kinds of predictive models) to predict how good the players in the test-set are. Remember that we split and reserved 20% of the hole dataset for evaluation/testing purposes it and was not part of the training process.

A good model should be able to understand generic hidden patterns in the data and also work well for unseen data.

# Make predictions using the testing set
y_pred = regr.predict(xtest)

1 2	# Make predictions using the testing set y_pred = regr.predict(xtest)

Now let’s plot the line that represents our linear model!

We will do that to handle the predictions and also the real expected values that we know.

plt.scatter(xtest, ytest,  color='black')
plt.plot(xtest, y_pred, color='blue', linewidth=3)
plt.xlabel("Value")
plt.ylabel("Overall")
plt.show()

plt.scatter(xtest, ytest, color='black')

plt.plot(xtest, y_pred, color='blue', linewidth=3)

plt.xlabel("Value")

plt.ylabel("Overall")

plt.show()

At first glance, it may seem like this model is not very good 🙁 But, in case we can accept a model that produces a few bad predictions, with the most of them being good, then this model is not as bad as we thought. That’s because the predictions seem to be pretty accurate for those players with a value lower than €30M. Although it can not be appreciated very well in the plot, those players represent 99.33% of the total!

print('% of players with a value lower that €30M: ', df[df.Value <= 30000000].Value.count() / df.Value.count() * 100, '%')

1	print('% of players with a value lower that €30M: ', df[df.Value <= 30000000].Value.count() / df.Value.count() * 100, '%')

% of players with a value lower that €30M:  99.3299642269 %

1	% of players with a value lower that €30M: 99.3299642269 %

We can also visualize a histogram that shows the number of players in different ranges of values. It’s easy to see that most players have a value lower than €10M and it’s almost insignificant the amount of players with a value greater than €30M.

plt.hist(test.Value)
plt.title("#Players per Value")
plt.xlabel("Value")
plt.ylabel("Count")
plt.show()

plt.hist(test.Value)

plt.title("#Players per Value")

plt.xlabel("Value")

plt.ylabel("Count")

plt.show()

At this point, you’re probably thinking that having some metric to represent how good our model is would be fantastic.

If so, you are right!

There are different metrics to evaluate this, for regression models two of the most common are the mean square error (MSE) and the r2 score. In general, we will want low values for the MSE and high values for the R2. That said, let’s use them and see the numbers for our model

import numpy as np # Linear algebra
from sklearn.metrics import mean_squared_error, r2_score #common metris to evaluate regression models

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pred))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(ytest, y_pred))

import numpy as np # Linear algebra

from sklearn.metrics import mean_squared_error, r2_score #common metris to evaluate regression models

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(ytest, y_pred))

Mean squared error: 28.88
Variance score: 0.41

1 2	Mean squared error: 28.88 Variance score: 0.41

Now it’s time to decide whether this model is good enough for an initial app prototype. If it’s not the case, you will identify other algorithms and techniques to see if you can improve the initial model. In what follows, we will practice with some other techniques just in order to illustrate a general machine learning project workflow. With this objective, we will train a Ridge regression model to approximate the model’s function through polynomial interpolation and then an SVR model that is also a common option for nonlinear models.

2.2 Polynomial interpolation and Ridge Regression

If you’re familiar with calculus, you’ll know that a common and efficient way of computing a complex function is by approximating it by using polynomials. So, we can take this idea and assume that the data points in our dataset are points of a complex math function. Our goal is to find a polynomial that fits the curve of that function quite well. In linear regression models, we can use a trick known as basis functions that allow us to model nonlinear problems in terms of something linear. The trick consists in transforming the basic model of linear regression for a feature’s vector X = (x1,…,xn) from something like this into something like this

You can note that the basic model is actually a special case of the general case when

What is really interesting about the previous transformation is that we can use a nonlinear function, and the model itself is still a linear model (because the coefficients never multiply or divides each other).

Now let’s move this math to code!

In this case, we’ll use polynomials as our basis functions and the Ridge model to solve the regression problem. Using Ridge regression (instead of the standard linear regression model) can help minimize overfitting, which is a possible colateral issue of adding basis functions to our regression model.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

pol = make_pipeline(PolynomialFeatures(6), linear_model.Ridge())
pol.fit(xtrain, ytrain)

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

pol = make_pipeline(PolynomialFeatures(6), linear_model.Ridge())

pol.fit(xtrain, ytrain)

I decided on the value, 6, for the degree parameter after doing some manual experimentation with the parameters. If you have some time, I recommend you can give a quick overview about hyper-parameters optimization phase for finding the best values that can be configured when training a model.

y_pol = pol.predict(xtest)
plt.scatter(xtest, ytest,  color='black')
plt.scatter(xtest, y_pol,  color='blue')
plt.xlabel("Value")
plt.ylabel("Overall")
plt.show()

y_pol = pol.predict(xtest)

plt.scatter(xtest, ytest, color='black')

plt.scatter(xtest, y_pol, color='blue')

plt.xlabel("Value")

plt.ylabel("Overall")

plt.show()

Visually, the polynomial regression looks better than the standard linear regression. Like before, let’s see how good this model is by calculating the MSE and R2 metrics.

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pol))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(ytest, y_pol))

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pol))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(ytest, y_pol))

Mean squared error: 13.07
Variance score: 0.73

1 2	Mean squared error: 13.07 Variance score: 0.73

Very good! We have reduced MSE and increased R2 as we pretended 🙂

2.3 Support Vector Regression

The last model that we will try is called Support Vector Regression which can be seen as an extension of the Support Vector Machine method used for classification problems. In particular, we will use the SVR implementation (one of the three available implementations). Using the SVR implementation with a RBF (radial basis function) kernel is also a common approach for resolving nonlinear problems. We can define and train an SVR model with a RBF kernel as follows:

from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', gamma=1e-3, C=100, epsilon=0.1)
svr_rbf.fit(xtrain, ytrain.values.ravel())

from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', gamma=1e-3, C=100, epsilon=0.1)

svr_rbf.fit(xtrain, ytrain.values.ravel())

I’m leaving out of this explanation the different parameters for the SVR model. I defined the values above after a couple of manual experiments, but you could probably find a better combination. For now this configuration is good enough. In order to keep this simple we are not including a hyper-parameters optimization phase here. However, it’s something that you probably should do in a real project. Like for the previous models, we have the predict method available to get predictions:

y_rbf = svr_rbf.predict(xtest)

1	y_rbf = svr_rbf.predict(xtest)

Same as before, let’s plot both the real expected value and the predicted ones.

plt.scatter(xtest, ytest,  color='black')
plt.scatter(xtest, y_rbf,  color='blue')
plt.xlabel("Value")
plt.ylabel("Overall")
plt.show()

plt.scatter(xtest, ytest, color='black')

plt.scatter(xtest, y_rbf, color='blue')

plt.xlabel("Value")

plt.ylabel("Overall")

plt.show()

And following the same approach again let’s calculate the MSE and R2 metrics

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_rbf))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(ytest, y_rbf))

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_rbf))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(ytest, y_rbf))

Mean squared error: 6.00
Variance score: 0.88

1 2	Mean squared error: 6.00 Variance score: 0.88

This is really very good! We were able to considerably reduce the MSE and also increase the R2 score in an important way.

2.4 Adding more features to improve predictions

Let’s add now some more features to train our model. We are going to add the player’s age and how good the player is at scoring.

xtrain = train[['Value', 'Age', 'Finishing']]
xtest = test[['Value', 'Age', 'Finishing']]

1 2	xtrain = train[['Value', 'Age', 'Finishing']] xtest = test[['Value', 'Age', 'Finishing']]

xtrain.head()

1	xtrain.head()

	Value	Age	Finishing
1141	12000000	23	76
13167	650000	19	55
17890	60000	18	43
5393	1400000	29	27
8268	550000	30	41

Let’s see if we can improve the models by using the new set of features as input:

Ordinary least squares regression using more features

regr_more_features = linear_model.LinearRegression()
regr_more_features.fit(xtrain, ytrain)
y_pred_more_features = regr_more_features.predict(xtest)
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pred_more_features))
print('Variance score: %.2f' % r2_score(ytest, y_pred_more_features))

regr_more_features = linear_model.LinearRegression()

regr_more_features.fit(xtrain, ytrain)

y_pred_more_features = regr_more_features.predict(xtest)

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pred_more_features))

print('Variance score: %.2f' % r2_score(ytest, y_pred_more_features))

Mean squared error: 20.19
Variance score: 0.59

1 2	Mean squared error: 20.19 Variance score: 0.59

Polynomial regression using more features

pol_more_features = make_pipeline(PolynomialFeatures(4), linear_model.Ridge())
pol_more_features.fit(xtrain, ytrain)
y_pol_more_features = pol_more_features.predict(xtest)
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pol_more_features))
print('Variance score: %.2f' % r2_score(ytest, y_pol_more_features))

pol_more_features = make_pipeline(PolynomialFeatures(4), linear_model.Ridge())

pol_more_features.fit(xtrain, ytrain)

y_pol_more_features = pol_more_features.predict(xtest)

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_pol_more_features))

print('Variance score: %.2f' % r2_score(ytest, y_pol_more_features))

Mean squared error: 8.32
Variance score: 0.83

1 2	Mean squared error: 8.32 Variance score: 0.83

Support Vector regression using more features

svr_rbf_more_features = SVR(kernel='rbf', gamma=1e-3, C=100, epsilon=0.1)
svr_rbf_more_features.fit(xtrain, ytrain.values.ravel())
y_rbf_more_features = svr_rbf_more_features.predict(xtest)
print("Mean squared error: %.2f" % mean_squared_error(ytest, y_rbf_more_features))
print('Variance score: %.2f' % r2_score(ytest, y_rbf_more_features))

svr_rbf_more_features = SVR(kernel='rbf', gamma=1e-3, C=100, epsilon=0.1)

svr_rbf_more_features.fit(xtrain, ytrain.values.ravel())

y_rbf_more_features = svr_rbf_more_features.predict(xtest)

print("Mean squared error: %.2f" % mean_squared_error(ytest, y_rbf_more_features))

print('Variance score: %.2f' % r2_score(ytest, y_rbf_more_features))

Mean squared error: 1.23
Variance score: 0.97

1 2	Mean squared error: 1.23 Variance score: 0.97

As you can see, we were able to improve our model’s precision even more. But wait, it’s important to note here that you won’t always get better results by adding more and more features. Adding redundant information or features that do not provide any relevant information for our interest could end up decreasing the quality and accuracy of predictions by overfitting the model. It also makes the model more complex, with more time needed for training it.

Feature engineering techniques can help to choose good features for our models. The study of a feature’s importance or relevance, feature selection and feature extraction, and applying dimentionally reduction techniques are important things to consider to find an optimal set of features to use. Having said that, you can do an exercise! Try to add many more features to train these same models and see if you can improve them. Next, let’s use the Support Vector Regression (SVR) model using more features (value, age and finishing) as our best candidate model. Let’s calculate and add to the test dataframe (unseen data) the predictions and also the error percentage that they represent.

pd.options.mode.chained_assignment = None
test['Overall_Prediction_RBF'] = y_rbf_more_features
test['Error_Percentage'] =  np.abs((test.Overall - y_rbf_more_features) / test.Overall * 100)

pd.options.mode.chained_assignment = None

test['Overall_Prediction_RBF'] = y_rbf_more_features

test['Error_Percentage'] = np.abs((test.Overall - y_rbf_more_features) / test.Overall * 100)

At this point I’m a bit curious about who are the players with the highest error rates in the predictions.

Let’s query the results to take a look:

test[['Name', 'Age', 'Value', 'Overall', 'Overall_Prediction_RBF','Error_Percentage']].nlargest(15, columns='Error_Percentage')

1	test[['Name', 'Age', 'Value', 'Overall', 'Overall_Prediction_RBF','Error_Percentage']].nlargest(15, columns='Error_Percentage')

	Name	Age	Value	Overall	Overall_Prediction_RBF	Error_Percentage
9	G. Higuaín	29	77000000	90	74.356462	17.381709
10	Sergio Ramos	31	52000000	90	74.356462	17.381709
15	G. Bale	27	69500000	89	74.356462	16.453413
21	A. Griezmann	26	75000000	88	74.356462	15.504020
23	P. Aubameyang	28	61000000	88	74.419167	15.432765
20	J. Oblak	24	57000000	88	74.734834	15.074052
52	T. Müller	27	47500000	86	74.356462	13.538998
48	Isco	25	56500000	86	74.356462	13.538998
69	Y. Carrasco	23	51500000	85	74.356462	12.521809
77	B. Leno	25	34000000	85	74.690997	12.128238
30	Thiago Silva	32	34000000	88	77.808358	11.581411
105	K. Manolas	26	31500000	84	75.177893	10.502509
132	T. Lemar	21	38500000	83	74.356462	10.413901
17759	K. Tokushige	33	20000	51	56.037744	9.877930
57	David Luiz	30	33000000	86	77.895012	9.424404

As you can see, there are only 13 players in the test dataset (13 in 3523 players, 0.37%) with a predicted error rate greater than 10%. We can also take a look at the histogram for the error variable.

plt.hist(test.Error_Percentage, bins=16)
plt.title("#Players per %error")
plt.xlabel("%error")
plt.ylabel("Count")
plt.show()

plt.hist(test.Error_Percentage, bins=16)

plt.title("#Players per %error")

plt.xlabel("%error")

plt.ylabel("Count")

plt.show()

It’s easy to see that most players have a predicted error rate lower than 2%. Besides, the amount of players with an error rate greater than 5% is almost insignificant. Let’s use now the trained model to make predictions for all the players in the complete dataset:

y_rbf_all = svr_rbf_more_features.predict(df[['Value', 'Age', 'Finishing']])
print("Mean squared error: %.2f" % mean_squared_error(df[['Overall']], y_rbf_all))
print('Variance score: %.2f' % r2_score(df[['Overall']], y_rbf_all))

y_rbf_all = svr_rbf_more_features.predict(df[['Value', 'Age', 'Finishing']])

print("Mean squared error: %.2f" % mean_squared_error(df[['Overall']], y_rbf_all))

print('Variance score: %.2f' % r2_score(df[['Overall']], y_rbf_all))

Mean squared error: 0.61
Variance score: 0.99

1 2	Mean squared error: 0.61 Variance score: 0.99

The metrics have improved more still. In fact, it makes sense because we added data that was part of the previous training process.

3. Building the application

Just as an illustrative step, we are going to build now a simple web application. It will be able to search players and list them among their predictions so you can play a bit with the results of this experiment. This prototype has a main disadvantage. Unfortunately, it can only display predictions for players in this dataset and not for new players in case you know their features (value, age and finishing). The reason for this simplification is that we are not hosting the real model with a backend side. I just generated a Javascript model for now that fits a basic AngularJS application, with the code that you can download from this GitHub. The javascript model for the players was built by adding the prediction and error to the main dataframe.

Later we used the “to_json” method as follows:

#from IPython.html import widgets

pd.options.mode.chained_assignment = None
df['Overall_Prediction_RBF'] = y_rbf_all
df['Error_Percentage'] =  np.abs((df.Overall - y_rbf_all) / df.Overall * 100)
jsonDf = df.to_json(orient='records')
#widgets.HTML(value = ''' players = ''' + jsonDf)

#from IPython.html import widgets

pd.options.mode.chained_assignment = None

df['Overall_Prediction_RBF'] = y_rbf_all

df['Error_Percentage'] = np.abs((df.Overall - y_rbf_all) / df.Overall * 100)

jsonDf = df.to_json(orient='records')

#widgets.HTML(value = ''' players = ''' + jsonDf)

In a real project, a more realistic architecture for hosting, consuming and updating a predictive model should be considered. But for a prototype I believe this is good enough. So, here you have the prototype!

4. Where to go next?

A well-known recommended approach when someone present some results in any research is to try to replicate the results by yourself. By doing this you can understand better, validate, find errors and improve things. So, if you liked this post and you are starting with predictive analytics and machine learning I encourage you to install the recommended software and environment. Then, execute the provided code to get your own results. If you felt a bit concerned about the math and statistics involved, you can go through lot of available content on the web. There’re posts, videos, courses, books and others, such as:

Complete Course on Linear Algebra by MIT
Complete Course on Multivariable Calculus by MIT
Mathematics at Khan Academy
Full Cheatsheet on Probability
Here, some other book resources are also recommended

In particular, I liked the approach given in the following chapters of the Deep Learning Book (Goodfellow-et-al-2016) that summarizes very well all the required background knowledge:

Continue to explore:

Relationship between machine learning and big data and how to perform distributed processing
Performance and the usage of gpgpu to train models
Hyper-parameters tuning/optimization
Cross-validation approach
Feature engineering, feature relevance, feature selection, and extraction
Dimentionally reduction techniques
Apply and compare other techniques for regression problems
Use of categorical variables
Cold start problem (how to start when no data is available)
Deploy of trained models
Maintenance and update of models
System’s architecture

So, if you try any of these things, please share with us your experience 🙂

End notes

During this machine learning tutorial, we went through a simplified view of a typical ML process, like the one presented in the diagram “The Machine Learning Process” at the beginning of this post. We gave a practical implementation of each phase showing most of the common steps that are generally performed. We leveraged the raw data provided by sofifa.com and some modules for preparing the data created by by Aman Srivastava and then we added new modules to the pipeline to continue adapting the data to our needs.

While studying the characteristics of our data, we were able to get some relevant information like mean, standard deviation, percentiles, maximum, minimum, etc. Also, we observed that a normal distribution described pretty well the distribution of our data. In order to practice with regression problems, we created different machine learning models. Among them, there are linear regression, polynomial regression and supported vector regression.

We used python and scikit-learn, starting with just one feature (value) and then adding some new features for training the models (age and finishing features). We noticed that by adding new features to the model, we would not always have better results and we mentioned common approaches to address this problem like feature selection, extraction and dimensionality reduction. As evaluation metrics for regression models, we applied two of the most commons: the mean square error (MSE) and the R2 score. We were able to improve these metrics considerably when comparing the first basic model and the last one.

Let us know your experience!

At the end of this journey, we ended up with a good candidate model. We embedded in a pure front end AngluarJS web application that allows the user to search players. Also, it’s possible to compare how good the players are against the system’s predictions. Even if you are not Peter Brand in the film MoneyBall (I strongly recommend this movie based on a true story if you haven´t seen it yet), the final app is still helpful to practice your machine learning knowledge (or if you are thinking in develop a sport bet site!). You also might want to try it during the upcoming 2018 FIFA Soccer World Cup. Don’t forget to comment your results 🙂

Wondering how to apply machine learning to something other than soccer? Check out our experience using AI to help with the difficult task of story points estimation!

About us

UruIT works with US companies, from startups to Fortune 500 enterprises, as nearshore partner for project definition, application design, and technical development. Our Uruguay and Colombia-based teams have worked on over 150 design and development projects up to date.

Are you ready to make the leap in your software development project? Tell us about your project and let’s talk.

Soccer and Machine Learning:
2 hot topics