The R language is used, in general, by those who come from a background in math and statistics. Many people with Data Scientist roles usually use R and feel very comfortable with its language. If you don’t have any R experience and you are curious about it, we recommend the interactive course available in Code School and the open source IDE RStudio to run your first scripts. You can also find many resources about R in the following material: “CRAN: Manuals”, “CRAN: Contributed Documentation” and “R Resources for Beginners”. An interesting feature about R is that if you use SQL Server as a database engine, you can deploy your models directly into the database engine (from SQL Server 2016 version) and use them in your SQL queries merging scripts from R. This way, you can avoid needing to read and taking datasets from the database to the R application. In a similar way, from the CTP 2.0 version of SQL Server 2017, Microsoft extends the support for Machine Learning in the database engine adding Python as another supported language. That is why Microsoft has changed “R Services” for “Machine Learning Services”; and both R and Python are now two options to use this tool.

I think making the switch from R services to Machine Learning Services has been a very good strategy for Microsoft. It aims to shorten the distance between the world of predictive analysis and the data, because data is the one with real value and it facilitates its adoption and achieves performing solutions and a safer integration.

The Ultimate
Introduction to

Machine Learning


Machine learning, Artificial Intelligence; these are no longer simply the subject matter of Sci-Fi novels and comic books. They are quickly transforming the technology that we use in the day to day. In reality, it’s difficult to not be in contact with these technologies, whether you realize it or not. Don’t believe it? From the more obvious applications of these new technologies like the up-and-coming self-driving cars, to the more common forms such as streaming apps like Spotify and Netflix that learn your tastes over time and give you personal recommendations, these are just some pieces of evidence that we have entered a new era. It’s incredible how much progress has been made in the past 10 years especially, and it’s exciting to realize that we have only scratched the surface of what these new technologies can do to better the way in which we live. In truth, any company that creates systems or applications will eventually have to adopt one “smart” technology or another, or else go the way of the floppy disk!

At UruIT we are fascinated by and heavily involved in the development of smart applications, which is why through this guide, we want to show how information systems such as traditional web pages and business apps can benefit significantly from the use of Machine Learning (ML) techniques. Also, we want to show how these can be combined with classic Information System Retrieval (ISR) techniques, Recommender Systems (RS) and good practices in the user interface design and user experience (UX) to transform a traditional information system into a smart system. If you are in any way involved with the software industry, but you don’t know much about smart mobile and web applications, this guide is for you!

Here we'll explore the unlimited potential of this new generation of applications:
  • First we’ll go over the basic definitions and concepts
  • then show common examples of usage,
  • and review some of the main tools and products to use when building your smart application.


Imagine a traditional news website in which everybody sees the same content. Meaning, every single visitor sees the same home page, whether they are a 15 year-old girl from Los Angeles or a 75 year-old man from Boise, Idaho.

Then, imagine instead, a news site which learns from its readers’ past interactions: the news they read, the ratings they gave (explicit feedback), the amount of time spent reading each article (implicit feedback), the semantics of what they write in the comments (also implicit feedback), or how similar users act. This information will allow the site to:

  • Select which news to show in the main section based on personal interests.
  • Adjust and personalize the user experience (layout, colors, font size, help, etc.) based on the user’s profile, demographic data: age, sex, nationality, educational level, etc., context, such as the day and hour of the week or emotions that the user shared in his or her social network.
  • Personalize the advertising shown to each user.
  • Analyze if there are discernable groups/segments with similar interests or not. Among other things, this analysis may provide valuable data to make smarter marketing decisions. For example, this data could be used to send better email updates that contain only the news that the reader cares about the most, so that they will love the site’s emails and anticipate them over time, instead of viewing them as more junk in their inbox.

Which news site would you rather visit? Which news site would you rather own? Clearly, the advantages of the “smart” news site are enjoyed by both the owner and the readers.

In addition to the previous capabilities mentioned, let’s suppose that this smart news site, based on information such as day and hour of the week, type and impact of news, among other things, could predict the number of visits during the next few hours. This would allow the site to be prepared for possible peak swipe hours in the system and to automatically launch, when needed, more nodes from the web farm. Now, that is a really smart system! If you find this smart system example interesting, take a minute to check out Google News or News360 to see it all in action in real life.

Before illustrating more examples, we will first define and stress some concepts that you should know about smart applications, many of which we just revealed through this news site example!


In the previous section, we saw how a system could benefit from predictions and user preferences to implement suggestions, maximize opportunities, minimize risks and take action to fully leverage its potential.

Some of the fundamental concepts when implementing intelligent systems are: prediction, predictive models and predictive analytics. These are related to the creation of mathematical and statistical models which link observed values to a variable of interest (sometimes called the dependent or response variable) with contextual information, in other words, the data available (also called predictors or independent variables which explain what was observed).

That is why identifying what information is important is crucial. Data mining and feature selection/extraction help in this sense, as they simplify the complexity of the problem. This weeding out of the unimportant information is known as Dimensionality Reduction.

Some dimensional reduction techniques are:
  • Principal Components Analysis (PCA)
  • Linear discriminant analysis (LDA)
  • Decision Trees
  • Random Forests
  • Factor Analysis
  • And More

To get familiar with these techniques, the beginner’s guide by S. Ray, K. Jain and D. Gupta is a good read, as well as Seven Techniques for Data Dimensionality Reduction. In the Recommendation System domain, techniques such as Matrix Factorization and Single Value Decomposition (SVD)  have been stressed during recent years as they boast great results, as seen in the famous Netflix competition.

One of the main advantages of dimensionality reduction is that it allows us to simplify its complexity, thus eliminate insignificant variables for a marked improvement in performance and scalability when working with large amounts of data. Furthermore, if the amount of dimensions could be reduced to 2 or 3, the data visualization and the understanding would improve. In this sense, a technique of Dimensionality Reduction which stands out regarding projection and visualization of data is t-SNE, winner of the Merck Visualization Challenge by Kaggle in 2012.

Recently, I have used this technique in the research Vector representation of Internet Domain Namesusing a Word Embedding technique presented in SLIOIA 2017 (Latin American Symposium on Operations Research and Artificial Intelligence) with the purpose of finding semantic similarities between internet web pages. It allowed the minimizing and visualizing of the vectorial representation, learned from Internet domain, of more than 100 dimensions to just 2 or 3. Figure 1 shows an example of 2 dimensional projections after implementing t-SNE for some vectors from Internet domains. Thus, similar sites are crafted by close sites and different sites are modeled by far sites.

Fig 1 - Two dimensional projections for vectorial representaion in internet domains

Now we can start to link these concepts to understand that a smart system is, usually, heavily based on predictions. These predictions are generated from:

  • The recognition of patterns in data, which often times come from multiple heterogeneous sources and need preprocessing or transformation towards a common model.
  • The selection and extraction of the most significant characteristics which allow for reducing the complexity of the problem and relating it in some way to generate a predictive model. This will then be displayed and used from an application or a system to maximize opportunities and minimize risks.

All this is what is known as the process of Machine Learning, which is summarized in Figure 2.


Machine Learning (ML) is a term that, though it has existed for many decades, has recently become a hot topic, mainly due to the progress made in Artificial Neural Networks (ANN) and Deep Learning (DL). These are specific techniques from Machine Learning which have enabled us to move significantly forward in the cognitive areas such as: image processing, text analysis and natural language processing (NLP). Furthermore, as it’s shown in Figure 3, Machine Learning is a field inside of Artificial Intelligence (AI).

In the information age, the Internet of Things (IoT) and Big Data have empowered the development of Machine Learning and smart applications, due to their great capacity to generate, process and store large amounts of data. It is the fuel that Machine Learning needs in order to work.

Fig 3 - Deep Learning and Machine Learning assubset of artificial intelligence

Just like people learn and improve their skills by gathering more experience and information, programs based on Machine Learning improve the accuracy of the results through the constant use of the system, data about the interaction of the users and the contextual information obtained from multiple external sources.T

he algorithms from Machine Learning can be supervised or unsupervised, depending on their training mode. The algorithm is supervised if, when training our model, for each input we indicate the expected output – also called label, target or response. In this case, if the output has a numeric value among an infinite amount of possible values, we are talking about a regression problem. On the contrary, if the value is discrete or finite, each output is called class and the problem to solve is that of classification. If there were only 2 outcomes (yes/no, true/false, etc.) the problem is called binary classification.

In contrast, the algorithm is unsupervised if, when training our model, the expected outputs for the different inputs are not taught. In this case, the unsupervised algorithms are useful to understand the structure and the organization of the data. They are usually used to group the inputs into groups or clusters (clustering) or to apply dimensionality reduction. The unsupervised algorithms are difficult to evaluate because there are no expected results with which to compare a prediction.

To evaluate the models generated from the supervised algorithms, we divide the dataset in two parts, one called the training set, used to train the model (80% or 70% of the total of the entire dataset), and another called the evaluation/validation set, used to evaluate the accuracy of the generated model (20% or 30% of the dataset). In this way, we can observe for each input in the evaluation set the difference between the model’s prediction and the actual expected result. Figure 4 shows some of the main algorithms of Machine Learning and in which category they could be employed. We can see that some may be used in more than one category.

In the news site example, predicting the number of users is a regression problem, knowing to which category they belong is a classification problem and studying them to know if there are groups/segments or not is a clustering problem.

Fig.4 - Algorithms of Machine Learning

We already know that data is the fuel needed to make Machine Learning work. In other words, the more the system is used and the more data available for pattern identification there is, the better. Hence, if you are about to launch or already have systems in production to which you would like to add features of Predictive Analysis and Machine Learning in the future, make sure you are registering all the relevant information. When in doubt about the relevance of something, it is preferable to store that information. It’s better to overestimate than underestimate and then end up needing it. Many times this information is represented in access logs, web pages visited, usage time, etc. Due to the amount that this information can generate, in most cases it does not make sense to keep it in the transactional database of the system. Thus, in an early stage of any Machine Learning project it is crucial to create a good logging strategy, decide what kind of storage will be used, if they would be centralized or distributed, define a backup plan, etc.

So, considering that data is of paramount importance, you are probably wondering: how could Machine Learning be applied to a new system if there is no data history? For example, during the first days using the news site, we could wonder: how can we identify patterns in the data to predict the amount of users that will access the site next weekend? Or which news should we suggest to a new user without a record of visits? These problems are known as “lack of data” and “cold start”. A common alternative to solve the initial lack of data is to access external datasets in the same application domain and use it as system input for the initial model. Then, periodically, we can update the generated model taking into consideration the new data in the system since the last update. UCI Machine Learning Repository and Kaggle Datasets are two points of reference where we can find a great variety of public datasets to use in Machine Learning projects. For the cold start problem, other techniques are often combined to minimize the lack of history. For example, we could infer a user profile from social networks and (if we have the news metadata or properties) apply algorithms to find which piece of news has the properties that better suit said profile. This is common in Recommender System, which is another interesting field inside Machine Learning.

A Recommender System tries to predict users’ preferences about the system to make suggestions. The personalization of the advertisement in a news web page or the prediction of the kind of news to suggest to a specific user are typical problems that can be solved with recommender systems. The most used techniques by recommender systems are collaborative filtering, content-based filtering and Matrix Factorization. However, the Netflix Price contest has shown that a Hybrid approach, which combines different techniques, is the one that achieves better results.

To get to know more about Recommender Systems we suggest reading the Recommender Systems Handbook or the research work I published in CLEI 2014 with my University of the Republic (UdelaR) colleagues, where we combined Slope One (one of the most simple and effective algorithms inside collaborative filtering) with a strategy known as Multiplicative Utilitarian to generate recommendations for transitory groups of people who go to the movies.


Here is the fun part! Until now we have used the example of the smart news site to introduce the main concepts, techniques and algorithms which come up when talking about smart applications and Machine Learning. Now, to broaden your understanding of the potential of these kinds of applications we will explore other common use cases.


As we previously mentioned, nowadays, systems based in serving up recommendations or suggestions can be seen anywhere. From streaming services such as Spotify or Netflix, to e-commerce platforms such as eBay and Amazon, an infinite number of applications search for better matches between its users and what they seek. You can find examples of matches in applications of the Information Retrieval (IR) area, which arrange search results taking into consideration the items which will probably be of greater preference to the user. Another example is new HR technology, which searches for the best candidates for a position or the best suggestions to job seekers to find a new job.


The news site is a clear example of personalizing content with the aid of Machine Learning. Based on a user’s profile, specific ads and articles can appear. Analyzing the user profiles to show them specific content is a technique employed in many applications, such as: consumption of content On-Demand services, or personalized advertisement or marketing campaigns. Besides the content, after analyzing the user’s profile, we can personalize other things, from the type of help shown, the font, colors, etc., up to the organization of the layout on the page. All this is done to maximize the experience of the user. At UruIT we are experts in usability; if you want to know how usable your application is, try our free usability evaluation tool.


Classifying emails as spam or not, recognizing the language of a text, knowing if a comment is good or bad (emotional analysis), identifying if a text was written or not by a specific person are only some daily examples in which techniques of Machine Learning have been used to solve text classification problems. In our news site, for example, the emotional analysis can be used to process our users’ comments and obtain implicit feedback, which can be used as input for the engine’s model which processes the recommendations.


Machine Learning techniques are also used in banking. In this area, these techniques deal with problems such as fraud detection, predicting customer churn and even the optimization of cash available in ATM’s.


Estimating the real value of a product on the market is another interesting application of Machine Learning. The problem here is predicting the value of the product due to its characteristics and taking into account the present and future contexts.

A specific example is the assessment of immovable property due to: location, years of construction, quantity of rooms, economic projections, etc. In its most simple way this could be set as a linear regression problem, but achieving highly efficient predictions could be more complex and, by far, more interesting to solve. So much so, that the popular platform Kaggle (recently included in Google Cloud) competes with “Zillow Prize: Zillow’s Home Value Prediction (Zestimate)” sponsored by Zillow Inc. In this, the data science community competes for a one million dollar prize for finding the ideal algorithms to improve the precision of the estimator Zestimate, and evolve its predictions for future sales prices in real estate. Would you try to do it?


You may or may not use it, but Snapchat is the perfect example of an app that leverages the use of facial recognition by creating filters that can alter the appearance of one’s face in real time, often with hilarious results. Because of Snapchat and other products, facial recognition or identification and image recognition is maybe one of the best known applications of deep learning. It is achieved by using a special type of artificial neuronal network (ANN), inspired by neuronal connections of the visual cortex known as Convolutional Neural Network (CNN).

A very high tech country, Uruguay is one of the best locations to find software development talent (as our CEO, Marcelo López, explains on this article about outsourcing to Uruguay) and is also very fond of football. The country launched a security system in 2017 to deny access to violent people in football stadiums through the use of facial identification technology.

Other examples of these types of applications are the recognition of text in images with Optical Character Recognition (OCR) techniques, Facebook’s tag suggestions, Google’s SafeSearch filter of pornographic images or its new Similar Items feature so users can get information or buy products straight from Google Images search results.


Some common tasks in which speech recognition is applied are dictating to a text processor, controlling computers or phones through voice commands, correcting pronunciation through language assistant, detecting if a voice belongs to a person or not, etc. Even though this is nothing new, huge progress was made inAugust 2017, when Microsoft Research managed to achieve a 5.1% error level, which is equal to the human error rate. This is the resultof many years of effort from both the academic community and theindustry. It represents a true milestone in the Artificial Intelligence


Machine Learning techniques for anomaly detection are used in several fields to detect “strange” patterns in data. Fraud detection, network intrusions detection, monitoring systems, amongst other things, are some of their most common uses.

These anomaly detection techniques can also be used during the pre-processing data stage (see Fig 2-The Machine Learning Process) before the execution of supervised learning to eliminate anomalous entries which could harm the model’s precision learned, for example, in the case it was caused by what is known as overfitting.

This article from the IBM blog and the course video about Machine Learning by Coursera are a good starting point to the overfitting problem. Also the article "Introduction to Anomaly Detection" is a good reference if you want more information about anomaly detection.


We can find examples or opportunities to apply Machine Learning in almost any field that pops up in our minds. For example:


Cloud solutions work quite well in general; we don’t need to know a lot to start using them. Many times they have visual aids which help create our models, some even allow starting for free and paying after increased usage.

One of the Cloud’s main advantages is that it provides a highly scalable and safe infrastructure. Not only does it offer the possibility to create predictive models to solve common problems, but it also provides specific services to work in cognitive areas such as image processing or natural language. This simplifies the task of adding features such as face or object detection or voice recognition. In this way, we aim to maximize the user experience of our application.

Here are some of the most common Cloud solutions:


Sometimes we cannot use cloud services because we need to manipulate our algorithms at a lower level or because we have the restriction of data confidentiality and we can’t train models in the cloud. In these cases it would be better to directly  implement our solution, whether it is to program a technique if it is simple, such as a simple collaborative filter or a naïve Bayes classifier; or using a framework or specialized library because they usually are optimized in their operations. Also, they are well tested, so we can considerably minimize the testing needed if we compare it to programming the techniques directly.

In the case of programming languages, if you have experience developing with NET and love LINQ as much as we do, it is possible that the frameworks, Accord.NET or NUml, are interesting options to consider, and also the library, math.net. Furthermore, we recommend using the functional language, F#, and reading the books, Machine Learning Projects for .NET developers and Mastering .NET Machine Learning. But, even though any general purpose programming language could be used, the two most popular languages are Python and R.


Furthermore, it is worth mentioning some of the principal products and tools that should be taken into account when implementing a deep learning solution.

My first approach to the deep learning libraries was with Tensorflow due to my Masters’ thesis which had an implementation of the word2vec family of algorithms. These algorithms (Skip-Gram and CBOW) are widely used in Natural Language Processing (NLP). Tensorflow was developed by Google and shared with the community as an open source project at the end of 2015. In less than two years, it has gained significant popularity, especially because of the fact that it’s used in the production environment by some Google products, which gives it good backing and reputation.

Among the main features of TensorFlow, we highlight its capacity to process in gpu cards, which optimizes both the vectorial & matrix operations and the deploy at great scale for distributed solutions. It comes with APIs for Python, C++ and java (among others) however, the Python API is the most complete and recommended. From my experience with this library, I’d like to point out the support for Windows systems which worked perfectly for me and was easy to use. The tools included in TensorBoard have also been of great help for the visualization of data generated in TensorFlow. Regarding its negative aspects, I believe the learning curve is not easy because it’s a low level library, and both the library and the community around it are starting to mature: the 1.0 version was launched in mid-February 2017.

Since it is so new, it can still be difficult to find documentation or examples to illustrate specific problems we might face. For this reason, it’s helpful to evaluate the usage of some wrapper over Tensorflow which provides a higher level of abstraction with the purpose of mitigating some of these problems. Among them, we highlight:

  • Keras (it also includes support for Theano)
  • Skflow (recommended if you have experience with scikit-learn, originally it was a separated project, but it was later included in the Learn module inside TensorFlow

Due to all this, if you already know Python and have a background in Machine Learning and Deep Learning, Tensorflow (or one of its wrappers) is probably one of the best options for smart solutions development. Also, there are other active deep learning frameworks that you can consider nowadays, besides Tensorflow:

On Wikipedia you can find a rather complete list (although not very updated) of deep learning frameworks and you could be interested in reading a very good analysis of R.G. Gomez (BEEVA) about several of them.


When I started working in the software development industry, in the 2000s, we witnessed a huge movement in which business information systems that were executed as desktop applications transformed into web applications. One could access them outside the office, at any time and from any web browser regardless of the operating system. Wow! (That is, theoretically, because who doesn’t remember those dear IE6 hacks!)

This was a real turning point and it implied that developers had to adapt to the new challenges that were cropping up, such as understanding the concept of distributed applications, stateless protocols such as http and how to simulate stateful applications over it, the new requirements for security or performance, the development concepts for the client or server side, etc.

Then, during 2007, with the arrival of the iPhone and then the first iPads, in 2010, the tendency was to start creating mobile applications or to adapt in some way the known web to work and look good on those devices. Such was the case that the concept “mobile first” was born in the world of software development. After that in 2014, mobile access overtook desktop. This was another turning point which made us learn to develop applications for mobile devices (or at least to work with daily concepts such as responsive design, native or hybrid development), and desktop or web applications.

In a similar way, we now face a new “normal” in software development: smart applications. Also, because of the abundance of data, it’s crucial for the recognition of patterns; the use of Machine Learning in the information age has been empowered by the Internet of Things (IoT) and big-data; seizing their capacity of generation, storage and processing of great volume of data in real time and at low cost. In some places the idea of “Machine Learning First” is already appearing.

Even though to a developer, this may seem like an Alfred Hitchcock or Stephen King movie, we should not be afraid, because it’s part of the natural evolution of the software industry and we extend it a warm welcome! Just like those developers that needed to learn and adapt in the previously mentioned turning points, now we must adapt again and be prepared for this new era.

We hope we helped you understand the potential that exists in the development of smart applications, the main concepts and definitions that appear when addressing this type of solutions, the typical fields of use and the tools and products used to build them. Also, if you reached this point, we suppose that it’s because you are excited about the topic. Are we right? If so, we encourage you to dig deeper on these topics by subscribing to our newsletter or contacting us today to learn more about how you too can put these new technologies to practical use.

Send us a quick message or comment.
We'd love to hear from you!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form

Back to