A Brief Introduction to Big Data Applications and Hadoop

Data Science

A Brief Introduction to Big Data Applications and Hadoop

Usually, big data applications are one of two types: data at rest and data in motion. The difference between them is the same as the difference between a lake and a rivera lake is a place that contains a lot of water while a river is a place where a lot of water is travelling through. 

For this reason, data at rest applications are often referred to as a data lake while data in motion applications are called data streams or data rivers. 

Fun Fact 

The “Rio de la plata”, located next to our office in Uruguay, is the widest river in the world, measuring 220km/140mi wide 🙂

We’re now going to give an overview of the most used technologies and tools for big data, some of the concepts and challenges behind it and a gentle introduction on how to start developing and testing big data applications in a virtualized environment.

For this article, we’ll focus mainly on data at rest applications and on the Hadoop ecosystem specifically.

Hadoop and big data platforms

Hadoop is a framework developed by Apache used for the distributed processing of big data sets across multiple computers (called a cluster). At its core, Handoop uses the MapReduce programming model to process and generate a large amount of data. A MapReduce program consists of a Map procedure used to filter and/or sort the element of the dataset and a Reduce procedure which performs aggregations.  

Other than providing a programming model suitable for distributed computing, Hadoop provides high availability as the data can be replicated easily across the node of the cluster. By default, Hadoop replicates the data three times.

A common Hadoop cluster is composed, at least, of the following components:

  • An Edge Node or Gateway node
  • A Name Node or Control/Master Node
  • A Data Node or Worker Node

Edge Node

The purpose of the edge node is to provide access to the cluster, that’s why it’s also known as a gateway node. On the edge node, you can find the services needed to retrieve or transform data; some of those services are Impala, Hive, Spark, and Sqoop. 

Name Node

Often called master node, the name node is used to store metadata of the Hadoop File System (HDFS) directory tree and all the files in the system. It doesn’t store the actual data, but it’s a sort of index for the distributed file system. The name node also executes the file system namespace operations like opening, closing, renaming, and deleting files and directories.

Data Node

As its name would suggest, the data node is where data is kept. Each node is responsible for serving read and write requests and performing data-block creation deletion and replication. Hadoop stores each file in block of data (default min size is 128MB).

References: 

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


Starting with Hadoop

Not everyone can afford a huge cluster, so how can you try Hadoop? Luckily, you can download and use the Cloudera quickstart Docker image that simulates a single node Hadoop “cluster”. 

The following steps require you to have docker and docker-compose installed. If you don’t have them, follow the installation instructions for your OS here: https://docs.docker.com/install/.

Step 1

Download the image for the cloudera quickstart using the following command:

docker pull cloudera/quickstart

Step 2

Create a file called docker-compose.yml and add the following content:

With this docker compose file, all the services are exposed, but you can decide which service you want to expose by simply removing the ports of the services you don’t need.

Step 3

Run the docker-compose file with the following command:

docker-compose up

Step 4

At this point, you can start using the Hadoop environment.

Tools and Services of the Hadoop ecosystem

Impala

Impala is a service that allows you to create distributed databases over the Hadoop File System. By using it, it’s possible to query Hadoop as if it were a relational database server using SQL syntax.

As data in Hadoop are documents, data are stored in a similar way to NoSQL databases. By using Impala, you can have a relational layer (relations are stored in the metadata). That said, Impala and Hadoop are not ACID databases as they do not support transactions natively due to the CAP theorem and the distributed nature of Hadoop.

Hadoop, instead, satisfies the BASE type property of databases:

Basically 

Available 

Soft-state 

Eventually-consistent

Note some functions may vary from MySQL/PostgreSQL or MsSQL specs.

To access the Impala service of the cloudera/quickstart image, click the following link: http://localhost:8888/impala 

Here you can use the typical SQL commands to create databases and tables or query the existing databases. Impala exposes port 21000 which can be used to connect with a Thrift Connector and also port 21050 to connect via a JDBC Connector. The following are some code examples to show how you can connect your NodeJS or Python app to Impala.

Python

Install the impyla library with pip or conda

pip install 

impyla

Create a file called app.py with the following content:

Execute the file with the Python interpreter: python app.py 

NodeJS

First, install the node-impala module:

npm i -s node-impala

Create a file called app.js with the following content:

Execute the app.js file using Node interpreter: node app.js

Although there are libraries that connect Node with Impala, we can’t recommend Node for this task, as the available connectors do not support Kerberos authentication, something essential for a production environment. But, if you’re a developer who’s willing to invest time in learning how to interact with a distributed system, this could be a good start.

Loading data in Hadoop

We’ve shown how to connect to a simple Hadoop cluster, but how do you ingest data?

The Hadoop ecosystem provides some utilities that help you do that; one of them being Apache Sqoop. Using Sqoop, you can import data from any compatible jdbc database into Hadoop.

The following is an example of how to run a Sqoop import command:

The command will import all the tables in the database, db_name, of the MySQL service running on localhost into a database called db_name in Impala.

Sometimes you just want to create a table from an exported file, like a CSV. In this case, you can use Python and the Ibis-framework library to do it in just six lines of code.

First, install the ibis-framework library using pip or conda:

pip install ibis-framework

Another way to load data into Hadoop is using the hdfs utilities. Basically, you can load the raw data using different file formats (csv, parquet, feather, avro, txt, images).

The following code demonstrates how to upload a file to hdfs using the python-hdfs library.

Install the library using the pip command:

pip install python-hdfs

Using this method, Hadoop can also be used as a distributed and highly available file system.

Summary: a big data introduction

Everyday, big data is becoming more and more relevant as companies are generating and collecting huge amounts of data. As an example, here at UruIT, we’re developing a system for multinational telecom Telefonica/Movistar to find anomalies in its databases. The system is based on a combination of the described big data technologies plus some web technologies like NodeJS and React in order to link the two worlds (web and big data) together and provide an elegant user experience for both technical and non-technical users. 

Moreover, big data helps in the development of Machine Learning and Deep Learning models, since having a large amount of data can increase the likelihood of finding hidden patterns. Taking our anomaly detection system as an example, once we collect enough data, we could introduce features such as error prediction in the system and autocorrection capabilities.

In this blog post, we’ve provided a brief introduction to the Hadoop ecosystem, showing how to set up and run a small instance of Hadoop using Cloudera quickstart docker image and how to connect to it in different ways and with programming languages. 

This is just the tip of the big data “iceberg,” and we’ve only focused on big data at rest. When handling large volumes of data, other problems come into place, like partitioning data in the most performant way or reducing the amount of data shuffling between the nodes of the cluster to improve performance.

We have to say that setting up a Hadoop cluster on premise from scratch involves a lot of infrastructure work in order to physically provision the machines. To avoid this, you can rely on cloud services like Google DataProc and Amazon EMR where you can spin-up your cloud distributed cluster.

Please, leave us a comment letting us know your experience with big data and Hadoop and happy data munging!!! Also, feel free to contact us if you’d like to know more about our experience building big data apps. If you are interested in learning about big data and Machine Learning, we recommend the post in the link 🙂

Davide Andreazzini

Davide Andreazzini

Hello, it's me Davide … I'm a Full Stack Dev but most of all I'm a tinkerer a hacker and I love coding. I've been coding Web Applications for 10 years and in the last 2 I started to get interested in Big Data Applications and applied Data Science. Here are some of the technologies that I like to work with: Javascript, Python, Scala, Clojure, Hadoop, Impala, Apache Spark, Tensorflow, Keras, MongoDB, RethinkDB, Redis, Docker, Kubernetes … I like to stay updated on new technologies and, in my free time, I play guitar and I'm always tinkering on projects involving microcontrollers like Arduino and RaspberryPi.

Leave a Reply

Thanks for signing up!

Stay Connected

Receive great content about building successful products!