Contents

Use Machine Learning for Software Development Estimation

In this work, we will present some ideas on how to build a smart component that is able to predict the complexity of a software development task. In particular, we will try to automate the process of sizing a task based on the information that is provided as part of its title and description and also leveraging the historical data of previous estimations.

In agile development, this technique is known as story points estimation and it defers from other classic estimation techniques that predict hours because the goal of this software development estimation is not to estimate how long a task will take, but to predict how complex a task is.

Typically, this software development estimation process is done by agile teams in order to know what all the tasks are that the team can commit to complete in the next sprint, generally a period of 2 weeks. Based on previous experience in the past 2 or 3 sprints, they can know in advance the average amount of points they are able to complete and so, they take that average as a measure of the threshold for the next sprint. Then, during the estimation process (generally through a fun activity like a planning poker), each member of the team gives a number of points that reflects how complex they think the task is.

There is a set of tasks that the team has decided to be the “base histories” that are well-known tasks already labeled with their associated complexity (that the team has agreed on) and can be used later as a base for comparison. After each sprint, new tasks can be added to this list. With time, this list can collect several examples of tasks with different complexities that the team can use later to compare and estimate new tasks. Teams generally reach a high level of accuracy in their estimations after some time, due to the continuous improvement based on accumulated experience, collecting more and more software development estimations.

Generally, the mental process that each team member goes through when estimating a new task is:

Based on his/her previous experience, he/she looks for similar tasks that they have done in the past
He/she gives the same number of points that similar tasks have been assigned in the past.
If there isn’t a similar task, he/she starts an ordered comparison from the least to the most complex tasks. The reasoning is something like this: “Is this task more complex than these ones?” If so, he/she moves on with the set of well-known base tasks in order of increasing complexity. This process is repeated until the new task falls in one of the categories of complexity. By convention, in case a new task looks more complex than size X but less complex than size Y (Y being the next size in order of complexity after X) the size assigned for the task’s estimation will be Y.

If we look at this process, we can find lots of similarities with a classic machine learning problem, where we have a task, T, that improves over time with experience, E, by a performance measure, P. In our case, T is the task of estimating/predicting the complexity of a new ticket (bug, new feature, improvement, support, etc), the experience, E, is the historical data of previous estimations and the performance measure, P, is the difference between the actual level of complexity and the software development estimation.

In the following, we present a machine learning approach to predict the complexity of a new task, based on the historical data of previously estimated tasks. We will use the Facebook FastText tool to learn text vector representations (word and sentence embeddings) that will be implicit used as input for a text classifier that classifies a task in three categories: easy, medium and complex. Note that we are changing things a bit, going from a point based estimation to a category based estimation.

This is because unfortunately, the number of estimates were very unbalanced in our dataset, and by grouping tasks into these three categories, we can slightly simplify our problem. Anyway, we can think of each of these classes (easy, medium, complex) as points in a simplified version of the story points estimation process (in a real story points estimation, size generally follows a Fibonacci sequence 1, 2, 3, 5, 8, 13, 21, etc., or some minor variation of this sequence).

In the end, we will build a basic web application like the one below, that is able to use the model that we trained so you can see it in action. It will allow you to search and pick up stories from the backlog (testing set) so you can compare the team’s average software development estimation vs the AI estimation (to see the AI’s prediction vs the team’s, click the cards on the right side! ). Sounds cool, yeah? Well, let’s dive in!

Out[1]:

Although it’s not required, in order to follow this work better, we recommend that you know some basic concepts around machine learning. Our free eBook, The Business Executive’s Guide to Smart Applications, can give you a quick introduction to the topic if you need it. Also, our practical machine learning tutorial, Soccer and Machine Learning: 2 hot topics for 2018, can be a good resource for learning about a typicall machine learning workflow from scratch.

Having said that, now let’s start!

Preparing the data

In order to train a neural network model for text classification we will use part of the dataset collected during the research presented in the paper, A deep learning model for estimating story points, that you can download from this github repository

Let’s start by loading the appceleratorstudio dataset

In [22]:

Python
import pandas as pd
import numpy as np

df = pd.read_csv("appceleratorstudio.csv", usecols=['issuekey', 'title', 'description', 'storypoint'])

				1
2
3
4

						import pandas as pd
import numpy as np
 
df = pd.read_csv("appceleratorstudio.csv", usecols=['issuekey', 'title', 'description', 'storypoint'])

Since our estimations will be based on the text information provided in the title and description columns, let’s check for null or empty values in the dataset

In [23]:

df.isnull().sum()

1	df.isnull().sum()

Out [23]:

Python
issuekey        0
title           0
description    43
storypoint      0
dtype: int64

				
					
				1
2
3
4
5

						issuekey        0
title           0
description    43
storypoint      0
dtype: int64

					

			

We can see that there are 43 entries that have null values in the description column. So, let’s remove any entry that is not complete:

In [24]:

Python
df = df.dropna(how='any')

				1

						df = df.dropna(how='any')

Now, let’s see how our data looks in the first few rows:

In [25]:

Python
df.head()

				1

						df.head()

Out[25]:

	issuekey	title	description	storypoint
0	TISTUD-6	Add CA against object literals in function inv…	{html}<div><p>The idea here is that if our met…	1
1	TISTUD-9	Update branding for Appcelerator plugin to App…	{html}<div><p>At least fix feature icons, asso…	1
2	TISTUD-11	Create new JSON schema for SDK team	{html}<div><p>Create JSON schema containing pr…	1
3	TISTUD-13	Create Project References Property Page	{html}<div><p>Create property page for project…	1
4	TISTUD-16	New Desktop Project Wizard	{html}<div><p>Desktop (need to convert existin…	1

A very good approach is to take a look at the main characteristics of the data that you are going to be working on. In order to do this, we can use the describe operation available in any pandas dataframe:

In [26]:

Python
df.storypoint.describe()

				1

						df.storypoint.describe()

Out[26]:

count    2876.000000
mean        5.636300
std         3.309936
min         1.000000
25%         3.000000
50%         5.000000
75%         8.000000
max        40.000000
Name: storypoint, dtype: float64

count 2876.000000

mean 5.636300

std 3.309936

min 1.000000

25% 3.000000

50% 5.000000

75% 8.000000

max 40.000000

Name: storypoint, dtype: float64

As you can see above, by using the describe operation provided by any Pandas dataframe we can get a summary of some important properties of our datalike:

the number of rows (a.k.a observations)
average values
minimums and maximums
percentiles’ values
standard deviation

Another good idea is to plot a histogram. A histogram can give us a good notion of the underlying data distribution. What it basically does is split the possible values/results into different bins and counts the number of occurrences (observations) where the variable under study falls into each bin. We can do this easily by using matplotlib, one of the most popular Python libraries for 2D plotting.

In [27]:

Python
import matplotlib.pyplot as plt
 
plt.hist(df.storypoint, bins=20, alpha=0.6, color='y')
plt.title("#Items per Point")
plt.xlabel("Points")
plt.ylabel("Count")
 
plt.show()

				
					
				1
2
3
4
5
6
7
8

						import matplotlib.pyplot as plt
 
plt.hist(df.storypoint, bins=20, alpha=0.6, color='y')
plt.title("#Items per Point")
plt.xlabel("Points")
plt.ylabel("Count")
 
plt.show()

					

			

We can easily see that the number of occurrences is not uniform throughout the different size categories (points).

Let’s see the amount of items per point:

In [28]:

Python
df.groupby('storypoint').size()

				1

						df.groupby('storypoint').size()

Out[28]:

storypoint
1      148
2      112
3      571
5     1126
8      751
9        1
13     137
20      22
21       3
34       1
40       4
dtype: int64

storypoint

1 148

2 112

3 571

5 1126

8 751

9 1

13 137

20 22

21 3

34 1

40 4

dtype: int64

By seeing the histogram and the number of examples per point, we can see that we have many more examples of 5 and 8 than the others. That’s known as an imbalanced dataset and it could be an issue in classification problems. There are different techniques to deal with imbalanced data, starting by collecting more samples of the entries with low frequency or generating new artificial entries (oversampling) or removing some entries in the classes with higher frequency (downsampling).

In our case, we will start by grouping points into three different categories to reduce the imbalanced data.

In [29]:

Python
df.loc[df.storypoint <= 2, 'storypoint'] = 0 #small
df.loc[(df.storypoint > 2) & (df.storypoint <= 5), 'storypoint'] = 1 #medium
df.loc[df.storypoint > 5, 'storypoint'] = 2 #big

				1
2
3

						df.loc[df.storypoint <= 2, 'storypoint'] = 0 #small
df.loc[(df.storypoint > 2) & (df.storypoint <= 5), 'storypoint'] = 1 #medium
df.loc[df.storypoint > 5, 'storypoint'] = 2 #big

In [30]:

Python
df.groupby('storypoint').size()

				1

						df.groupby('storypoint').size()

Out[30]:

storypoint
0     260
1    1697
2     919
dtype: int64

storypoint

0 260

1 1697

2 919

dtype: int64

After grouping we continue having imbalanced data, so we will do a basic oversampling and downsampling, but we will do it later when we apply a cross-validation technique that is an approach for training and evaluating the performance of a machine learning model through different partitions of our dataset as we will explain later.

At this point, it’s important to note that in this work the goal is to solve a classification problem (predict the class associated to the complexity of a task: 0-easy, 1-medium or 2-complex) instead of a regression problem (predict a continuous real value) like in the paper, A deep learning model for estimating story points.

Before we continue, let’s do some cleanup to our data. This is also a common step that generally any machine learning process needs to apply because of the following issues:

Common issues generally faced during the data preparation phase:

Format and structure normalization
Detect and fix missing values
Remove duplicates
Normalize units
Validate constraints
Detect and remove anomalies
Study features importance/relevance
Dimentional reduction, feature selection & extraction

For this work, most of these issues were already addressed by the authors of A deep learning model for estimating story points when collecting the dataset. Anyway, we will need to do some extra cleanup to the data for our purpose: remove some html tags as well English stop words (words like the, this, that, etc) because they can add noise to our problem and it’s better to remove them.

In [31]:

Python
import numpy as np
import csv
from nltk.corpus import stopwords

#Define some known html tokens that appear in the data to be removed later
htmltokens = ['{html}','<div>','<pre>','<p>', '</div>','</pre>','</p>']

#Clean operation
#Remove english stop words and html tokens
def cleanData(text):
    
    result = ''
    
    for w in htmltokens:
        text = text.replace(w, '')
    
    text_words = text.split()    
    
    resultwords  = [word for word in text_words if word not in stopwords.words('english')]
    
    if len(resultwords) > 0:
        result = ' '.join(resultwords)
    else:
        print('Empty transformation for: ' + text)
        
    return result

def formatFastTextClassifier(label):
    return "__label__" + str(label) + " "

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

						import numpy as np
import csv
from nltk.corpus import stopwords
 
#Define some known html tokens that appear in the data to be removed later
htmltokens = ['{html}','<div>','<pre>','<p>', '</div>','</pre>','</p>']
 
#Clean operation
#Remove english stop words and html tokens
def cleanData(text):
    
    result = ''
    
    for w in htmltokens:
        text = text.replace(w, '')
    
    text_words = text.split()    
    
    resultwords  = [word for word in text_words if word not in stopwords.words('english')]
    
    if len(resultwords) > 0:
        result = ' '.join(resultwords)
    else:
        print('Empty transformation for: ' + text)
        
    return result
 
def formatFastTextClassifier(label):
    return "__label__" + str(label) + " "

					

			

Important: Since we are removing stop words and html tags in our dataset, later when we want to predict with some unseen data we will need to apply the same transformation before requesting the model’s prediction for that input.

In order to work easily, we will also create two new columns:

One new column called “title_desc” that is just the concatenation of the title and description columns
A second column called “label_title_desc” that contains the number of points with a specific prefix expected by FastText to recognize it as the labeled information (class)

While doing this, we will also change everything to lower case to make the training phase case insensitive. These new columns will be used later for training our learning algorithms.

In [32]:

Python
df['title_desc'] = df['title'].str.lower() + ' - ' + df['description'].str.lower()
df['label_title_desc'] = df['storypoint'].apply(lambda x: formatFastTextClassifier(x)) + df['title_desc'].apply(lambda x: cleanData(str(x)))

				1
2

						df['title_desc'] = df['title'].str.lower() + ' - ' + df['description'].str.lower()
df['label_title_desc'] = df['storypoint'].apply(lambda x: formatFastTextClassifier(x)) + df['title_desc'].apply(lambda x: cleanData(str(x)))

Finally, since we were removing some empty entries, we will re-index our dataset to fix it and have continuous indices again:

In [33]:

Python
df = df.reset_index(drop=True)

				1

						df = df.reset_index(drop=True)

Dealing with the imbalanced dataset – Oversampling

As you will see in the final main method, in order to deal with imbalanced dataset we are doing a basic oversampling that simply consists of adding copies of the datapoints in the minority class until reaching the number of items in the majority class.

Other more complex oversampling techniques exist, like SMOTE, where artificial datapoints (called synthetic data points) are created by taking two datapoints in the minority class (one datapoint and one of its k nearest neighbors) creating the new artificial point in the space between the two real points. If we think about this technique in a 2D scenario, the new datapoint is created in some random place on the line between the two points as you can see in the image below:

Smote Oversampling (ref: http://rikunert.com/SMOTE_explained)

If you want to try more complex techniques you can try Python’s excellent package, imbalanced-learn with several algorithms already implemented for you.

Anyway, for this work, a basic oversampling technique that creates copies of the existing data was used. The main reason for that is simplicity, because dealing with new synthetic datapoint created artificially implies finding a text representation for a sentence that could map to that new vector representation, because in the end, the FastText tool expects text in sentences and not the embeddings. Possible workarounds exist for this like approximating the synthetic point with a new sentence generated by averaging the embeddings of words used by the nearest k sentences to the synthetic points, for instance. This could be something interesting to try, so if you do it please let us know your results!

Note: Basic random downsampling of the majority class that is also a common and simple technique was combined with the oversampling, but didn’t improve the results. So, in the end, just a basic oversampling was used in order to minimize the effect of an imbalanced dataset.

In [34]:

Python
from collections import Counter

def SimpleOverSample(_xtrain, _ytrain):
    xtrain = list(_xtrain)
    ytrain = list(_ytrain)

    samples_counter = Counter(ytrain)
    max_samples = sorted(samples_counter.values(), reverse=True)[0]
    for sc in samples_counter:
        init_samples = samples_counter[sc]
        samples_to_add = max_samples - init_samples
        if samples_to_add > 0:
            #collect indices to oversample for the current class
            index = list()
            for i in range(len(ytrain)):
                if(ytrain[i] == sc):
                    index.append(i)
            #select samples to copy for the current class    
            copy_from = [xtrain[i] for i in index]
            index_copy = 0
            for i in range(samples_to_add):
                xtrain.append(copy_from[index_copy % len(copy_from)])
                ytrain.append(sc)
                index_copy += 1
    return xtrain, ytrain

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

						from collections import Counter
 
def SimpleOverSample(_xtrain, _ytrain):
    xtrain = list(_xtrain)
    ytrain = list(_ytrain)
 
    samples_counter = Counter(ytrain)
    max_samples = sorted(samples_counter.values(), reverse=True)[0]
    for sc in samples_counter:
        init_samples = samples_counter[sc]
        samples_to_add = max_samples - init_samples
        if samples_to_add > 0:
            #collect indices to oversample for the current class
            index = list()
            for i in range(len(ytrain)):
                if(ytrain[i] == sc):
                    index.append(i)
            #select samples to copy for the current class    
            copy_from = [xtrain[i] for i in index]
            index_copy = 0
            for i in range(samples_to_add):
                xtrain.append(copy_from[index_copy % len(copy_from)])
                ytrain.append(sc)
                index_copy += 1
    return xtrain, ytrain

					

			

Before ending this section, I’d like to highlight that sometimes people underestimate the data preparation stage (mainly those who are just starting with their first machine learning project) but you should know that this innocent-seeming first stage, most of the time, takes more than half of the total project time (sometimes even up to 60-80%!). So, keep that in mind and treat this stage with the importance it deserves.

Creating our classifier

In this work, we want to give FastText a try, a tool developed by Facebook that is an extension of the well-known Word2Vec word embedding tool previously created by a research team led by Tomas Mikolov at Google.

Without entering into much detail, we can say that embeddings consist of techniques that learn vector representations of words, sentences, or documents, so that the vector representation of similar and semantically related words, sentences, or documents are close together in the high dimensional vector space. Then, by leveraging this characteristic onto the learned vectors, we can use them as features for any kind of machine learning algorithm, to train a classifier or as input to a clustering algorithm, etc.

Word embedding techniques are not something new in Natural Language Processing (NLP), although in the last years, new embeddings techniques based in predictive neural networks models have become very popular and they have revolutionized the machine learning field in many domains, not just that of NLP. Word embeddings have started to be used in other domains like e-commerce and recommender systems with the variation known as prod2vec, meta-prod2vec, or in mobile applications like app2vec, among others. Recently, I’ve applied different embedding techniques to create Internet Domain Name embeddings from DNS trace logs and they have demonstrated to be a good approach for learning semantic similarities and analogy tasks between Internet Domain Names. You can see details about how I have used word2vec for learning Internet Domain Names in Vector representation of Internet Domain Names using a Word Embedding technique

FastText

In regard to FastText, its main advantage over word2vec is that it considers subwords inside a word. Instead of considering each word as a single token, a word is split into a set of substrings called ngrams and later the training phase is done considering each subword of a word, and the vector representation of a word is formed by averaging the vector representation of its subwords (and the word itself). The most important parameters to tune when using FastText are minn and maxn that define the min and max length of n-grams when splitting words.

Additionally, FastText can be used either in supervised or unsupervised mode. When using FastText in supervised mode, you can train a supervised model by using a dataset specially prepared (labeled) using a set of sentences (one per line) along with a label that acts as the class to which the sentence belongs. So, by training a FastText model in supervised mode, you can later perform classification tasks over new unseen sentences, which is very helpful for a lot of text classification and sentiment analysis problems.

Having said this, we present a simple custom python wrapper for the supervised mode of the native FastText interface. Although there is already a wrapper for FastText and a native module in the well-known Gensim package, none of them include support for the supervised mode of FastText (only the unsupervised mode). So, I decided to create our custom and very basic wrapper with the minimum that we need for our purpose, that is:

a constructor to create new instances of the wrapper with its own state
a fit method to trigger the training process by calling the executable file and passing the required parameters (*)
a predict method that receives an array with a list of sentences and returns another array (of the same size) with the integer predictions in {0, 1, 2} for each sentence

**(*)** The training process is executed with the following parameters:

500 epochs (iterations over the corpus)
Vector size of 300 dimensions
minn=4 and maxn=6 (minimum and maximum numbers of n-grams respectively)
pretrained file used to transfer previous knowledge of the English language and domain-specific knowledge (I’ve tried using the pretrained vectors for English language provided by FastText, but in the end generating my own pretrained models using other system datasets in the same domain worked better. You can download these other datasets from the same github repository in order to use them for building your own pre-trained model.

In [35]:

Python
import uuid
import subprocess

class FastTextClassifier:

    rand = ""
    inputFileName = ""
    outputFileName = ""
    testFileName = ""
    
    def __init__(self):
        self.rand = str(uuid.uuid4())
        self.inputFileName = "issues_train_" + self.rand + ".txt"
        self.outputFileName = "supervised_classifier_model_" + self.rand
        self.testFileName = "issues_test_" + self.rand + ".txt"
    
    def fit(self, xtrain, ytrain):
        outfile=open(self.inputFileName, mode="w", encoding="utf-8")
        for i in range(len(xtrain)):
            #line = "__label__" + str(ytrain[i]) + " " + xtrain[i]
            line = xtrain[i]
            outfile.write(line + '\n')
        outfile.close()            
        p1 = subprocess.Popen(["cmd", "/C", "fasttext supervised -input " + self.inputFileName + " -output " + self.outputFileName + " -epoch 500 -wordNgrams 4 -dim 300 -minn 4 -maxn 6 -pretrainedVectors pretrain_model.vec"],stdout=subprocess.PIPE)
        p1.communicate()[0].decode("utf-8").split("\r\n")
        
        
    def predict(self, xtest):
        #save test file
        outfile=open(self.testFileName, mode="w", encoding="utf-8")
        for i in range(len(xtest)):
            outfile.write(xtest[i] + '\n')
        outfile.close()
        #get predictions
        p1 = subprocess.Popen(["cmd", "/C", "fasttext predict " + self.outputFileName + ".bin " + self.testFileName],stdout=subprocess.PIPE)
        output_lines = p1.communicate()[0].decode("utf-8").split("\r\n")
        test_pred = [int(p.replace('__label__','')) for p in output_lines if p != '']
        return test_pred

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

						import uuid
import subprocess
 
class FastTextClassifier:
 
    rand = ""
    inputFileName = ""
    outputFileName = ""
    testFileName = ""
    
    def __init__(self):
        self.rand = str(uuid.uuid4())
        self.inputFileName = "issues_train_" + self.rand + ".txt"
        self.outputFileName = "supervised_classifier_model_" + self.rand
        self.testFileName = "issues_test_" + self.rand + ".txt"
    
    def fit(self, xtrain, ytrain):
        outfile=open(self.inputFileName, mode="w", encoding="utf-8")
        for i in range(len(xtrain)):
            #line = "__label__" + str(ytrain[i]) + " " + xtrain[i]
            line = xtrain[i]
            outfile.write(line + '\n')
        outfile.close()            
        p1 = subprocess.Popen(["cmd", "/C", "fasttext supervised -input " + self.inputFileName + " -output " + self.outputFileName + " -epoch 500 -wordNgrams 4 -dim 300 -minn 4 -maxn 6 -pretrainedVectors pretrain_model.vec"],stdout=subprocess.PIPE)
        p1.communicate()[0].decode("utf-8").split("\r\n")
        
        
    def predict(self, xtest):
        #save test file
        outfile=open(self.testFileName, mode="w", encoding="utf-8")
        for i in range(len(xtest)):
            outfile.write(xtest[i] + '\n')
        outfile.close()
        #get predictions
        p1 = subprocess.Popen(["cmd", "/C", "fasttext predict " + self.outputFileName + ".bin " + self.testFileName],stdout=subprocess.PIPE)
        output_lines = p1.communicate()[0].decode("utf-8").split("\r\n")
        test_pred = [int(p.replace('__label__','')) for p in output_lines if p != '']
        return test_pred

					

			

About pre-trained models

Pre-trained models are generally used for transferring previous knowledge from a similar domain to our specific problem, where we can continue and extend the learning by specializing the model to our concrete problem. Using pre-trained models allows us to avoid starting from scratch and to also get good results even if we have small datasets for training.

In our particular case, we base our predictions on understanding the meaning of the natural language information provided in the title and description of each entry. So, having a pretrained model of English text could be very helpful.

My first attempt was to use the pre-trained model available in the FastText Github repository for the English language. Although this improved the overall accuracy of my solution a bit, the improvement was not considered too much.

Deep Learning

The second approach that I followed was to use the same idea already applied in A deep learning model for estimating story points and use the pre-trained csv files (issues with title and description but without story points) in order to train a fastText unsupervised model in the same domain as my problem (software development issues) and use it later when training the fastText supervised classifier in my concrete problem and specific dataset. By doing this we can start with some basic knowledge, like vector representations for words and sentences in the software development domain (domain specific), so we can have a good parameter initialization without using the labeled data.

Using the second approach was much better than the first one (final results will be shown at the end). Although the pre-trained vectors trained from Wikipedia had been trained with considerably much more data than the pre-trained vectors that I got using the pre-trained csv files for other open source repositories, the second approach achieved the best results. This confirms something already known, that the use of pre-trained vectors trained in a similar domain can achieve better results even if they’re trained with much fewer data than other pre-trained vectors that were trained in a less similar domain.

The code below shows the files that were used to train the pre-trained vectors using other open source repositories and how to join them into a single pandas dataframe.

In [36]:

Python
import pandas as pd
import numpy as np

pretrain_files = ['apache_pretrain.csv', 
                  'jira_pretrain.csv', 
                  'spring_pretrain.csv', 
                  'talendforge_pretrain.csv', 
                  'moodle_pretrain.csv',
                  'appcelerator_pretrain.csv',
                  'duraspace_pretrain.csv',
                  'mulesoft_pretrain.csv',
                  'lsstcorp_pretrain.csv']

pretrained = None

for file in pretrain_files:
    df_pretrain = pd.read_csv('PretrainData/' + file, usecols=['issuekey', 'title', 'description'])
    if(pretrained is not None):
        pretrained = pd.concat([pretrained, df_pretrain])
    else:
        pretrained = df_pretrain

pretrained = pretrained.dropna(how='any')

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

						import pandas as pd
import numpy as np
 
pretrain_files = ['apache_pretrain.csv', 
                  'jira_pretrain.csv', 
                  'spring_pretrain.csv', 
                  'talendforge_pretrain.csv', 
                  'moodle_pretrain.csv',
                  'appcelerator_pretrain.csv',
                  'duraspace_pretrain.csv',
                  'mulesoft_pretrain.csv',
                  'lsstcorp_pretrain.csv']
 
pretrained = None
 
for file in pretrain_files:
    df_pretrain = pd.read_csv('PretrainData/' + file, usecols=['issuekey', 'title', 'description'])
    if(pretrained is not None):
        pretrained = pd.concat([pretrained, df_pretrain])
    else:
        pretrained = df_pretrain
 
pretrained = pretrained.dropna(how='any')

					

			

We also create a new column concatenating the title and description and applying the same cleanup method than before (to be consistent):

In [16]:

Python
pretrained['title_desc'] = (pretrained['title'].str.lower() + ' - ' + pretrained['description'].str.lower()).apply(lambda x: cleanData(str(x)))

				1

						pretrained['title_desc'] = (pretrained['title'].str.lower() + ' - ' + pretrained['description'].str.lower()).apply(lambda x: cleanData(str(x)))

Since the fastText tool works with sentences saved in files, we need to create and save a file with all the pre-trained data to be processed:

In [17]:

Python
outfile=open("issues_pretrain.txt", mode="w", encoding="utf-8")
for line in pretrained.title_desc.values:
    outfile.write(line + '\n')
outfile.close()

				
					
				1
2
3
4

						outfile=open("issues_pretrain.txt", mode="w", encoding="utf-8")
for line in pretrained.title_desc.values:
    outfile.write(line + '\n')
outfile.close()

					

			

Finally we can get the pre-trained vectors running the following command:

fasttext skipgram -input issues_pretrain.txt -output pretrain_model -epoch 100 -wordNgrams 4 -dim 300 -minn 4 -maxn 6 -lr 0.01

Defining the training: an evaluation strategy

Before running the main learning method, let’s define the strategy that we will use to select the training and evaluation set and also how we are going to evaluate the effectiveness, or how good our model really is.

Selecting Training and Testing sets

A good model should be able to understand generic hidden patterns in the data and also work well for unseen data. For this reason, we should preserve some examples in our dataset to not be included in the set of examples that will be used for training purposes, so later we can test our model using those examples.

Generally, we would like to have as much representative data as possible for training, and also a large number of different examples to put our model under test. But, sometimes it’s difficult to have both, mainly when we have small datasets like in our scenario. When this occurs, a technique called k-fold cross-validation can help us.

When using this approach, the data is split into k partitions (folds), then in each of the k iterations you select a different folder as the testing set and the rest of the folders are joined to create the new training set. In the end, an average evaluation cryteria of the k iterations is used to measure the quality of our model.

This is a common technique that helps evaluate models to ensure that the results do not depend on how you have selected your training data. K-fold cross-validation is also helpful when your dataset is small (like in our case), because you can ensure that every datapoint will be part of the training and also part of the testing set at any moment. Then, you do not risk leaving out of the training set some important (but few) examples like those that could occur if you use a fixed 80-20 approach for splitting the dataset into a training and testing set.

Although there are some pre-built functionalities in sklearn to work with cross-validation, let’s code our own simple method that given a folder index i (0 <= i < k), it returns a testing set (the dataframe in the folder i) and a training set that is the union of all the folders different than i. By doing this you can visualize clearly how it works in the background. Let’s do it!

In [37]:

Python
def rebuild_kfold_sets(folds, k, i):
    training_set = None
    testing_set = None

    for j in range(k):
        if(i==j):
            testing_set = folds[i]
        elif(training_set is not None):
            training_set = pd.concat([training_set, folds[j]])
        else:
            training_set = folds[j]
    
    return training_set, testing_set

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13

						def rebuild_kfold_sets(folds, k, i):
    training_set = None
    testing_set = None
 
    for j in range(k):
        if(i==j):
            testing_set = folds[i]
        elif(training_set is not None):
            training_set = pd.concat([training_set, folds[j]])
        else:
            training_set = folds[j]
    
    return training_set, testing_set

					

			

Defining the evaluation criteria

Confusion matrix, precision, recall, and F1 score metrics

A common approach when evaluating classification models is to use a confusion matrix. This is a simple but efficient tool to see how well our model is performing at its predictions.

As you can see in the image above, the confusion matrix is a square matrix with one row and a column for each class in the classifier. The x-axis is used for the predicted values and the y-axis for the true values. Then a cell (i,j) in this matrix represents the number of predictions where the class j was predicted but the true class was the class i. Note that when i=j then cell(i,j) = cell(i,i) which are cells over the diagonal and represent the number of correct predictions that were done for the class i.

Some important metrics that we can calculate from the values in this matrix are:

The overall accuracy (# of correct predictions / # total predictions)
Precision for class i (# of correct predictions for class i / # total predictions for class i)
Recall for class i (# of correct predictions for class i / # total true items in class i )

Also, since it’s easy to have a high recall with low precision and the opposite (high precision with low recall) it’s usual to add an F1 score metric (harmonic average of precision and recall) in order to combine precision and recall in just one metric, both being important to increase its value.

F1 = 2 x (precision x recall) / (precision + recall)

With this introduction to what a confusion matrix is, and the metrics that we can calculate, now let’s define a helper method to help us plot a pretty confusion matrix like the one in the image above. Later, after obtaining the final results we will see how to examine a confusion matrix and the metrics in more detail.

In [38]:

Python
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
def plot_confusion_matrix_with_accuracy(classes, y_true, y_pred, title, sum_overall_accuracy, total_predictions):
    cm = ConfusionMatrix(y_true, y_pred) 
    print('Current Overall accuracy: ' + str(cm.stats()['overall']['Accuracy']))
    if total_predictions != 0:
        print('Total Overall Accuracy: ' + str(sum_overall_accuracy/total_predictions))
    else:
        print('Total Overall Accuracy: ' + str(cm.stats()['overall']['Accuracy']))

    conf_matrix = confusion_matrix(y_true, y_pred)
    plt.figure()
    plot_confusion_matrix(conf_matrix, classes=classes, title=title)
    plt.show()

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

						import itertools
 
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 
 
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
 
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
 
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
def plot_confusion_matrix_with_accuracy(classes, y_true, y_pred, title, sum_overall_accuracy, total_predictions):
    cm = ConfusionMatrix(y_true, y_pred) 
    print('Current Overall accuracy: ' + str(cm.stats()['overall']['Accuracy']))
    if total_predictions != 0:
        print('Total Overall Accuracy: ' + str(sum_overall_accuracy/total_predictions))
    else:
        print('Total Overall Accuracy: ' + str(cm.stats()['overall']['Accuracy']))
 
    conf_matrix = confusion_matrix(y_true, y_pred)
    plt.figure()
    plot_confusion_matrix(conf_matrix, classes=classes, title=title)
    plt.show()

					

			

Main method (k-fold cross-validation, with oversampling in training folders)

Now that we have our data ready, a python wrapper for the FastText classifier and a strategy for getting the training and testing sets, we can finally present the main learning method.

In [39]:

Python
from sklearn.metrics import confusion_matrix
from pandas_ml import ConfusionMatrix

# K-folds cross validation 
# K=5 or K=10 are generally used. 
# Note that the overall execution time increases linearly with k
k = 5

# Define the classes for the classifier
classes = ['0','1','2']

# Make Dataset random before start
df_rand = df.sample(df.storypoint.count(), random_state=99)

# Number of examples in each fold
fsamples =  int(df_rand.storypoint.count() / k)

# Fill folds (obs: last folder could contain less than fsamples datapoints)
folds = list()
for i in range(k):
    folds.append(df_rand.iloc[i * fsamples : (i + 1) * fsamples])
        
# Init
sum_overall_accuracy = 0
total_predictions = 0

# Repeat k times and average results
for i in range(k):
    
    #1 - Build new training and testing set for iteration i
    training_set, testing_set  = rebuild_kfold_sets(folds, k, i)
    y_true = testing_set.storypoint.tolist()

    #2 - Oversample (ONLY TRAINING DATA)
    X_resampled, y_resampled =  SimpleOverSample(training_set.label_title_desc.values.tolist(), training_set.storypoint.values.tolist())
    
    #3 - train
    clf = FastTextClassifier()
    clf.fit(X_resampled, y_resampled)
    
    #4 - Predict
    y_pred = clf.predict(testing_set.label_title_desc.values.tolist())
         
    #3 - Update Overall Accuracy
    for num_pred in range(len(y_pred)):
        if(y_pred[num_pred] == y_true[num_pred]):
            sum_overall_accuracy += 1
        total_predictions += 1

    #4 - Plot Confusion Matrix and accuracy 
    plot_confusion_matrix_with_accuracy(classes, y_true, y_pred, 'Confusion matrix (testing-set folder = ' + str(i) + ')', sum_overall_accuracy, total_predictions)

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

						from sklearn.metrics import confusion_matrix
from pandas_ml import ConfusionMatrix
 
# K-folds cross validation 
# K=5 or K=10 are generally used. 
# Note that the overall execution time increases linearly with k
k = 5
 
# Define the classes for the classifier
classes = ['0','1','2']
 
# Make Dataset random before start
df_rand = df.sample(df.storypoint.count(), random_state=99)
 
# Number of examples in each fold
fsamples =  int(df_rand.storypoint.count() / k)
 
# Fill folds (obs: last folder could contain less than fsamples datapoints)
folds = list()
for i in range(k):
    folds.append(df_rand.iloc[i * fsamples : (i + 1) * fsamples])
        
# Init
sum_overall_accuracy = 0
total_predictions = 0
 
# Repeat k times and average results
for i in range(k):
    
    #1 - Build new training and testing set for iteration i
    training_set, testing_set  = rebuild_kfold_sets(folds, k, i)
    y_true = testing_set.storypoint.tolist()
 
    #2 - Oversample (ONLY TRAINING DATA)
    X_resampled, y_resampled =  SimpleOverSample(training_set.label_title_desc.values.tolist(), training_set.storypoint.values.tolist())
    
    #3 - train
    clf = FastTextClassifier()
    clf.fit(X_resampled, y_resampled)
    
    #4 - Predict
    y_pred = clf.predict(testing_set.label_title_desc.values.tolist())
         
    #3 - Update Overall Accuracy
    for num_pred in range(len(y_pred)):
        if(y_pred[num_pred] == y_true[num_pred]):
            sum_overall_accuracy += 1
        total_predictions += 1
 
    #4 - Plot Confusion Matrix and accuracy 
    plot_confusion_matrix_with_accuracy(classes, y_true, y_pred, 'Confusion matrix (testing-set folder = ' + str(i) + ')', sum_overall_accuracy, total_predictions)

					

			

Current Overall accuracy: 0.6660869565217391
Total Overall Accuracy: 0.6660869565217391

1 2	Current Overall accuracy: 0.6660869565217391 Total Overall Accuracy: 0.6660869565217391

Current Overall accuracy: 0.6852173913043478
Total Overall Accuracy: 0.6756521739130434

1 2	Current Overall accuracy: 0.6852173913043478 Total Overall Accuracy: 0.6756521739130434

Current Overall accuracy: 0.6226086956521739
Total Overall Accuracy: 0.6579710144927536

1 2	Current Overall accuracy: 0.6226086956521739 Total Overall Accuracy: 0.6579710144927536

Current Overall accuracy: 0.6591304347826087
Total Overall Accuracy: 0.6582608695652173

1 2	Current Overall accuracy: 0.6591304347826087 Total Overall Accuracy: 0.6582608695652173

Current Overall accuracy: 0.6521739130434783
Total Overall Accuracy: 0.6570434782608696

1 2	Current Overall accuracy: 0.6521739130434783 Total Overall Accuracy: 0.6570434782608696

Evaluating the results

After executing the cross-validation, the overall accuracy is 0.657 (approximately 66%).

But, sometimes it’s not enough to use the overall accuracy. Suppose you have a dataset with datapoints divided into two classes, A and B. Suppose that 90% of the datapoints in your dataset belong to class A. Then you can create a dummy classifier that always predicts class A (without using any learning algorithm) and it achieves 90% accuracy! So, for that reason, other metrics like precision, recall or F1 score for each possible class could be helpful in scenarios like this (recall for class B in this example would have been 0, showing a clear problem predicting items of the minority class).

Having said this, we can take a look at the different confusion matrices that were generated in each iteration of our cross-validation process in order to average the important metrics to see how well our model performs for the different categories.

Class 0

Average Precision for class 0: (12 + 8 + 9 + 10 + 8)/ (2 + 4 + 12 + 0 + 6 + 8 + 4 + 7 + 9 + 3 + 5 + 10 + 0 + 5 + 8) = 47/83 = 0.5663 (57%)
Average Recall for class 0: (12 + 8 + 9 + 10 + 8) / (12 + 32 + 8 + 8 + 26 + 8 + 9 + 43 + 9 + 10 + 29 + 13 + 8 + 28 + 17) = 48/260 = 0.1846 (18%)
Average F1 score for class 0: 2 x (0.5663 x 0.1846)/(0.5663 + 0.1846) = 0.2784

Class 1

Average Precision for class 1: (281 + 289 + 267 + 273 + 270)/ (98 + 281 + 32 + 69 + 289 + 26 + 108 + 267 + 43 + 83 + 273+ 29 + 90 270 + 28) = 1380/1986 = 0.6949 (69%)
Average Recall for class 1: (281 + 289 + 267 + 273 + 270) / (4 + 281 + 48 + 6 + 289 + 72 + 7 + 267 + 46 + 5 + 273 + 63 + 5 + 270 + 60) = 1380/1696 = 0.8137 (81%)
F1 score for class 1: 2 x (0.6949 x 0.8137)/(0.6949 + 0.8137) = 0.7496

Class 2

Average Precision for class 2: (90 + 97 + 82 + 96 + 97)/ (90 + 48 + 8 + 97 + 72 + 8 + 82 + 46 + 9 + 96 + 63 + 13 + 97 + 60 + 17) = 462/806 = 0.5732 (57%)
Average Recall for class 2: (90 + 97 + 82 + 96 + 97) / (2 + 98 + 90 + 0 + 69 + 97 + 4 + 108 + 82 + 3 + 83 + 96 + 0 + 90 + 97) = 462/919 = 0.5027 (50%)
Average F1 score for class 2: 2 x (0.5732 x 0.5027)/(0.5732 + 0.5027) = 0.5356

About precision

Precision for class X represents how good the model (classifier) is when it predicts that a data point belongs to class X.

We can see that the average precision for class 1 it’s close to 70% that means that 7 out of 10 of the class 1 predictions were predicted to be of class 1, the true value was predicted successfully. The other 30% were ‘confused’, mainly as items of class 2. You can repeat the same reasoning to see how the model performs when predicting class 0 and 2, achieving 57% precision in both classes, where the main confusion is items in class 1 – medium (in some way this could be expected, small sizes are confused with medium and some big are also confused with medium, but it’s good to have little confusion between small and big).

About recall

Recall for class X represents how good the model (classifier) is at finding data points that belong to class X.

We can observe that the main issue of our model is finding items from class 0 (small). Despite that almost 57% of the model’s predictions for class 0 are accurate, it still fails to give us as many predictions for class 0 as it should. For that reason, the average recall of class 0 is not as good as we’d like. On the other hand, our model is very good at finding class 1 (medium) items, successfully finding more than 80% of the items in that class. Finally, the recall for class 2 (big) allows us to see that our model is able to successfully find half of the items in class 2 (big), which is better than its recall for class 0 (small) but is not as great as its recall for class 1 (medium).

About the F1 score

The F1 score metric (harmonic average of precision and recall) allows us to combine precision and recall in just one metric, both being important to increase its value. So, since the F1 score for class 0 is much lower than the F1 score for class 2, which is lower than the F1 score for class 1, then we can validate that our previous observations are consistent and that our model performs very well at classifying items of class 1 (medium). It is also good at classifying items of class 2 (big) and although the precision at class 0 (small), suggests that we can trust quite well when the model predicts an item is small, we should note that a high percentage of the small items are not being found by our model. This is probably the first point to be improved upon in the future.

Building the application

Just as an illustrative step, we are going to now build a simple web application. It will be able to search and pick a story to estimate so you can play a bit with the results of this experiment. By using the filter you can search for some specific topics like “Java”, “Windows” or whatever you want and then by clicking the card’s icon on the right, you can see the team’s average prediction and the AI agent prediction. In order to do this, we used one of the models trained during the cross-validation process. Unfortunately, it can only display predictions for stories in the testing set (note that these stories were not seen by the learning algorithm during the training phase), and not for new stories. The reason for this simplification is that we are not hosting the real model with a backend side. I just generated a Javascript model for now that fits a basic AngularJS application, with the code that you can download from this GitHub repository. The javascript model for the stories was built by adding the predictions to the testing_set dataframe in this way:

Python
ts = testing_set[["issuekey", "title", "description", "storypoint"]] #select only the columns to be serialized

				1

						ts = testing_set[["issuekey", "title", "description", "storypoint"]] #select only the columns to be serialized

Python
ts["prediction"] = y_pred # add predictions to the dataframe

				1

						ts["prediction"] = y_pred # add predictions to the dataframe

Later we used the “to_json” method as follows:

Python
from IPython.html import widgets
jsonDf = ts.to_json(orient='records')
widgets.HTML(value = ''' backlog_items = ''' + jsonDf)

				1
2
3

						from IPython.html import widgets
jsonDf = ts.to_json(orient='records')
widgets.HTML(value = ''' backlog_items = ''' + jsonDf)

In a real project, a more realistic architecture for hosting, consuming and updating a predictive model should be considered. But for a prototype, I believe this is good enough. So, here you have the prototype! Remember, to see the AI’s prediction vs the team’s, click the cards on the right side!

Out[2]:

Final thoughts on software development estimation

Obtaining good estimations has been a big headache historically in the software industry and different techniques have been tried in many kinds of development processes. In this work, we validated the idea of a new kind of estimation strategy based on machine learning by identifying automatically similar tasks in the past and the historical estimations. This is very similar to what is known as estimation based on expert judgment, with the only difference that the “expert” here is an AI software component.

Some of the important things that we can conclude from this work are:

Imbalanced datasets can be difficult to deal with in classification problems.
The use of random downsampling did not achieve better results in our concrete problem.
The use of a basic oversampling improved our results a bit. Some more complex oversampling techniques probably could get even better results and it’s something interesting to try out.
The use of general pre-trained vectors for the English language from Wikipedia improved the results a bit, but not considerably.
The use of more domain-specific pre-trained vectors from other open source repositories was better than using general pre-trained vectors, obtaining better results. It’s better to have few data but with good quality and very representative, than to have tons of data that are not representative of our domain or data with poor quality or noise.
K-fold cross-validation is a helpful approach for model evaluation, mainly when the dataset is not too large as in our case
Overall accuracy is important, but it’s not the only metric to be considered when evaluating a classifier. Metrics like Precision, Recall and F1 score can be a good option for evaluation metrics in classification problems.

We were able to train a model that could be taken as a base model from where to start and try to optimize it by tuning different parameters that need more experimenting. Other things that could help to improve the performance of our model is to use a smarter approach for oversampling and maybe the possibility of adding more real representative examples for the minority classes, in particular, more examples of small size items.

Also, it’s worth mentioning that if we are bad at estimating and the quality of the estimations in our dataset is poor, then the learning algorithm will probably have poor estimations as well. For this reason, something interesting to test would be to use a curated base history of items for training. That is, items that after having been done, the actual size (metric for the effort or how many hours they took) is added to the historical information.

Finally, although the results were not excellent, they were pretty good. We validated that a machine learning classification approach can be a good option for addressing the common yet difficult issue of estimating how complex a software development task is.

If you improve upon some of the results presented here, we’d love to know it! Also share your experience if you use some of these ideas to develop your own jira plugin to make your planning meetings more fun 🙂

About us

UruIT works with US companies, from startups to Fortune 500 enterprises, as nearshore partner for project definition, application design, and technical development. Our Uruguay and Colombia-based teams have worked on over 150 design and development projects up to date.

Are you ready to make the leap in your software development project? Tell us about your project and let’s talk!

Use Machine Learning for Software Development Estimation