{"id":9855,"date":"2020-03-25T21:36:43","date_gmt":"2020-03-25T21:36:43","guid":{"rendered":"https:\/\/blog.uruit.com\/?p=9855"},"modified":"2023-06-26T08:46:58","modified_gmt":"2023-06-26T11:46:58","slug":"big-data-introduction","status":"publish","type":"post","link":"https:\/\/uruit.com\/blog\/big-data-introduction\/","title":{"rendered":"A Brief Introduction to Big Data Applications and Hadoop"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Usually, big data applications are one of two types: data at rest and data in motion. <strong>The difference between them is the same as the difference between a lake and a river<\/strong><\/span><span style=\"font-weight: 400;\">\u2014<\/span><span style=\"font-weight: 400;\">a lake is a place that contains a lot of water while a river is a place where a lot of water is travelling through.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For this reason, data at rest applications are often referred to as a data lake while data in motion applications are called data streams or data rivers.\u00a0<\/span><\/p>\n<blockquote><p><b><i>Fun Fact<\/i><\/b><i><span style=\"font-weight: 400;\">\u00a0<\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400;\">The &#8220;Rio de la plata\u201d, located next to our office in Uruguay, is the widest river in the world, measuring 220km\/140mi wide \ud83d\ude42<\/span><\/i><\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">We\u2019re now going to give an overview of the most used technologies and tools for big data, some of the concepts and challenges behind it and a gentle introduction on how to start developing and testing big data applications in a virtualized environment.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For this article, we\u2019ll focus mainly on data at rest applications and on the Hadoop ecosystem specifically.<\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_17 counter-hierarchy counter-decimal ez-toc-grey\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" style=\"display: none;\"><i class=\"ez-toc-glyphicon ez-toc-icon-toggle\"><\/i><\/a><\/span><\/div>\n<nav><ul class=\"ez-toc-list ez-toc-list-level-1\"><li class=\"ez-toc-page-1 ez-toc-heading-level-3\"><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Hadoop_and_big_data_platforms\" title=\"Hadoop and big data platforms\">Hadoop and big data platforms<\/a><ul class=\"ez-toc-list-level-4\"><li class=\"ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Edge_Node\" title=\"Edge Node\">Edge Node<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Name_Node\" title=\"Name Node\">Name Node<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Data_Node\" title=\"Data Node\">Data Node<\/a><\/li><\/ul><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-3\"><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Starting_with_Hadoop\" title=\"Starting with Hadoop\">Starting with Hadoop<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-3\"><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Tools_and_Services_of_the_Hadoop_ecosystem\" title=\"Tools and Services of the Hadoop ecosystem\">Tools and Services of the Hadoop ecosystem<\/a><ul class=\"ez-toc-list-level-4\"><li class=\"ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Impala\" title=\"Impala\">Impala<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Python\" title=\"Python\">Python<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-4\"><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#NodeJS\" title=\"NodeJS\">NodeJS<\/a><\/li><\/ul><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-3\"><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Loading_data_in_Hadoop\" title=\"Loading data in Hadoop\">Loading data in Hadoop<\/a><\/li><li class=\"ez-toc-page-1 ez-toc-heading-level-3\"><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/uruit.com\/blog\/big-data-introduction\/#Summary_a_big_data_introduction\" title=\"Summary: a big data introduction\">Summary: a big data introduction<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"Hadoop_and_big_data_platforms\"><\/span><span style=\"font-weight: 400;\">Hadoop and big data platforms<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-9856 size-medium\" src=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo-300x163.png\" alt=\"\" width=\"300\" height=\"163\" srcset=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo-300x163.png 300w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo-768x417.png 768w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo-750x408.png 750w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo-20x11.png 20w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/hadoop-logo.png 920w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/hadoop.apache.org\/\" class=\"external\" rel=\"nofollow\">Hadoop<\/a> is a framework developed by Apache used for the distributed processing of big data sets across multiple computers (called a cluster).<strong> At its core, Handoop uses the MapReduce programming model to process and generate a large amount of data.<\/strong> A MapReduce program consists of a Map procedure used to filter and\/or sort the element of the dataset and a Reduce procedure which performs aggregations.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Other than providing a programming model suitable for distributed computing, Hadoop provides high availability as the data can be replicated easily across the node of the cluster. By default, Hadoop replicates the data three times.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A common Hadoop cluster is composed, at least, of the following components:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">An Edge Node or Gateway node<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">A Name Node or Control\/Master Node<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">A Data Node or Worker Node<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" class=\"aligncenter size-full wp-image-9857\" src=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521.jpg\" alt=\"\" width=\"1199\" height=\"848\" srcset=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521.jpg 1199w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-300x212.jpg 300w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-1024x724.jpg 1024w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-768x543.jpg 768w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-750x530.jpg 750w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-1140x806.jpg 1140w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/post-davide-scaled-e1585171413521-20x14.jpg 20w\" sizes=\"(max-width: 1199px) 100vw, 1199px\" \/><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Edge_Node\"><\/span><span style=\"font-weight: 400;\">Edge Node<\/span><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400;\">The purpose of the edge node is to provide access to the cluster, that\u2019s why it\u2019s also known as a gateway node. <strong>On the edge node, you can find the services needed to retrieve or transform data<\/strong>; some of those services are Impala, Hive, Spark, and Sqoop.\u00a0<\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Name_Node\"><\/span><span style=\"font-weight: 400;\">Name Node<\/span><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Often called master node, the name node is used to store metadata of the Hadoop File System (HDFS) directory tree and all the files in the system. It doesn\u2019t store the actual data, but <strong>it\u2019s a sort of index for the distributed file system<\/strong>. The name node also executes the file system namespace operations like opening, closing, renaming, and deleting files and directories.<\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Data_Node\"><\/span><span style=\"font-weight: 400;\">Data Node<\/span><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400;\">As its name would suggest, <strong>the data node is where data is kept<\/strong>. Each node is responsible for serving read and write requests and performing data-block creation deletion and replication. Hadoop stores each file in block of data (default min size is 128MB).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">References:\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\" class=\"external\" rel=\"nofollow\"><span style=\"font-weight: 400;\">https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html<\/span><\/a><\/p>\n<p><a href=\"https:\/\/hadoop.apache.org\/docs\/r1.2.1\/mapred_tutorial.html\" class=\"external\" rel=\"nofollow\"><span style=\"font-weight: 400;\">https:\/\/hadoop.apache.org\/docs\/r1.2.1\/mapred_tutorial.html<\/span><\/a><\/p>\n<hr \/>\n<h3><span class=\"ez-toc-section\" id=\"Starting_with_Hadoop\"><\/span><span style=\"font-weight: 400;\">Starting with Hadoop<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Not everyone can afford a huge cluster, so how can you try Hadoop? Luckily, you can download and use the Cloudera quickstart Docker image that simulates a single node Hadoop \u201ccluster\u201d.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following steps require you to have docker and docker-compose installed. If you don\u2019t have them, follow the installation instructions for your OS here: <\/span><a href=\"https:\/\/docs.docker.com\/install\/\" class=\"external\" rel=\"nofollow\"><span style=\"font-weight: 400;\">https:\/\/docs.docker.com\/install\/<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><b>Step 1<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Download the image for the cloudera quickstart using the following command:<\/span><\/p>\n<p><b>docker pull cloudera\/quickstart<\/b><\/p>\n<p><b>Step 2<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Create a file called <\/span><b>docker-compose.yml <\/b><span style=\"font-weight: 400;\">and add the following content:<\/span><\/p>\n<pre class=\"lang:default decode:true \">version: '2'\r\n \r\nservices:\r\n  cloudera:\r\n    image: cloudera\/quickstart:latest\r\n    restart: always\r\n    privileged: true\r\n    hostname: quickstart.cloudera\r\n    command: \/usr\/bin\/docker-quickstart\r\n    ports:\r\n      - \"8020:8020\"   # HDFS \r\n      - \"8022:22\"     # SSH\r\n      - \"7180:7180\"   # Cloudera Manager\r\n      - \"21050:21050\" # Impala\r\n      - \"21000:21000\" # Impala Thrift\r\n      - \"10001:10001\" # Hive\r\n      - \"8888:8888\"   # Hue\r\n      - \"11000:11000\" # Oozie\r\n      - \"50070:50070\" # HDFS Rest Namenode\r\n      - \"50075:50075\" # HDFS Rest Datanode\r\n      - \"2181:2181\"   # Zookeeper\r\n      - \"8088:8088\"   # YARN Resource Manager\r\n      - \"19888:19888\" # MapReduce Job History\r\n      - \"50030:50030\" # MapReduce Job Tracker\r\n      - \"8983:8983\"   # Solr\r\n      - \"16000:16000\" # Sqoop Metastore\r\n      - \"8042:8042\"   # YARN Node Manager\r\n      - \"60010:60010\" # HBase Master\r\n      - \"60030:60030\" # HBase Region\r\n      - \"9090:9090\"   # HBase Thrift\r\n      - \"8081:8080\"   # HBase Rest\r\n      - \"7077:7077\"   # Spark Master\r\n    tty: true\r\n    stdin_open: true<\/pre>\n<p><span style=\"font-weight: 400;\">With this docker compose file, all the services are exposed, but you can decide which service you want to expose by simply removing the ports of the services you don\u2019t need.<\/span><\/p>\n<p><b>Step 3<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Run the docker-compose file with the following command:<\/span><\/p>\n<p><b>docker-compose up<\/b><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-full wp-image-9865\" src=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/docker-compose.gif\" alt=\"\" width=\"1200\" height=\"675\" \/><\/p>\n<p><b>Step 4<\/b><\/p>\n<p><span style=\"font-weight: 400;\">At this point, you can start using the Hadoop environment.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Tools_and_Services_of_the_Hadoop_ecosystem\"><\/span><span style=\"font-weight: 400;\">Tools and Services of the Hadoop ecosystem<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h4><span class=\"ez-toc-section\" id=\"Impala\"><\/span><span style=\"font-weight: 400;\">Impala<\/span><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-9858 size-medium\" src=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/584808d2cef1014c0b5e48f3-125x300.png\" alt=\"\" width=\"125\" height=\"300\" srcset=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/584808d2cef1014c0b5e48f3-125x300.png 125w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/584808d2cef1014c0b5e48f3-8x20.png 8w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/584808d2cef1014c0b5e48f3.png 214w\" sizes=\"(max-width: 125px) 100vw, 125px\" \/><\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/impala.apache.org\/\" class=\"external\" rel=\"nofollow\">Impala<\/a> is a service that allows you to create distributed databases over the Hadoop File System. By using it, it\u2019s possible to query Hadoop as if it were a relational database server using SQL syntax.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As data in Hadoop are documents, data are stored in a similar way to NoSQL databases. <strong>By using Impala, you can have a relational layer (relations are stored in the metadata)<\/strong>. That said, Impala and Hadoop are not ACID databases as they do not support transactions natively due to the CAP theorem and the distributed nature of Hadoop.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Hadoop, instead, satisfies the BASE type property of databases:<\/span><\/p>\n<p><b>B<\/b><span style=\"font-weight: 400;\">asically\u00a0<\/span><\/p>\n<p><b>A<\/b><span style=\"font-weight: 400;\">vailable\u00a0<\/span><\/p>\n<p><b>S<\/b><span style=\"font-weight: 400;\">oft-state\u00a0<\/span><\/p>\n<p><b>E<\/b><span style=\"font-weight: 400;\">ventually-consistent<\/span><\/p>\n<p><b>Note<\/b><span style=\"font-weight: 400;\"> some functions may vary from MySQL\/PostgreSQL or MsSQL specs.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To access the Impala service of the cloudera\/quickstart image, click the following link: <\/span><a href=\"http:\/\/localhost:8888\/impala\" class=\"broken_link external\" rel=\"nofollow\"><span style=\"font-weight: 400;\">http:\/\/localhost:8888\/impala<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-full wp-image-9864\" src=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1.png\" alt=\"\" width=\"1255\" height=\"673\" srcset=\"https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1.png 1255w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-300x161.png 300w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-1024x549.png 1024w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-768x412.png 768w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-750x402.png 750w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-1140x611.png 1140w, https:\/\/uruit.com\/blog\/wp-content\/uploads\/2020\/03\/image-1-20x11.png 20w\" sizes=\"(max-width: 1255px) 100vw, 1255px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Here you can use the typical SQL commands to create databases and tables or query the existing databases. <\/span><span style=\"font-weight: 400;\">Impala exposes port 21000 which can be used to connect with a Thrift Connector and also port 21050 to connect via a JDBC Connector. <\/span><span style=\"font-weight: 400;\">The following are some code examples to show how you can connect your NodeJS or Python app to Impala.<\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Python\"><\/span><b>Python<\/b><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Install the <\/span><b>impyla<\/b><span style=\"font-weight: 400;\"> library with pip or conda<\/span><\/p>\n<p><b>pip install\u00a0<\/b><\/p>\n<p><b>impyla<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Create a file called <\/span><b>app.py<\/b><span style=\"font-weight: 400;\"> with the following content:<\/span><\/p>\n<pre class=\"lang:default decode:true \">from impala.dbapi import connect\r\nconn = connect(host='localhost', port=21050) # setup the connection\r\ncursor = conn.cursor() # instantiate a cursor\r\ncursor.execute('SELECT 1+1') # execute a query\r\nprint(cursor.description)  # prints the result set's schema\r\nresults = cursor.fetchall() \r\nprint(results) # print the results<\/pre>\n<p><span style=\"font-weight: 400;\">Execute the file with the Python interpreter: <\/span><b>python app.py<\/b><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"NodeJS\"><\/span><b>NodeJS<\/b><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400;\">First, install the <\/span><b>node-impala <\/b><span style=\"font-weight: 400;\">module:<\/span><\/p>\n<p><b>npm i -s node-impala<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Create a file called <\/span><b>app.js<\/b><span style=\"font-weight: 400;\"> with the following content:<\/span><\/p>\n<pre class=\"lang:default decode:true \">const { createClient } = require(\"node-impala\");\r\n \r\nconst client = createClient();\r\n \r\n\/\/ connect to the Impala thrift port 21000\r\nclient\r\n  .connect({\r\n    host: \"localhost\",\r\n    port: 21000,\r\n    resultType: \"json-array\"\r\n  })\r\n  .then(() =&gt; {    \r\n    \/\/ execute the query\r\n    client\r\n      .query(\"select 1+1\")\r\n      .then(console.log)\r\n      \r\n  })\r\n  .catch(error =&gt; console.error(error));<\/pre>\n<p><span style=\"font-weight: 400;\">Execute the app.js file using Node interpreter: <\/span><b>node app.js<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Although there are libraries that connect Node with Impala, we can\u2019t recommend Node for this task, as the available connectors do not support Kerberos authentication, something essential for a production environment. But, if you\u2019re a developer who\u2019s willing to invest time in learning how to interact with a distributed system, this could be a good start.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Loading_data_in_Hadoop\"><\/span><span style=\"font-weight: 400;\">Loading data in Hadoop<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">We\u2019ve shown how to connect to a simple Hadoop cluster, but how do you ingest data?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The Hadoop ecosystem provides some utilities that help you do that; one of them being <\/span><a href=\"https:\/\/sqoop.apache.org\/\" class=\"external\" rel=\"nofollow\"><b>Apache Sqoop<\/b><\/a><span style=\"font-weight: 400;\">. Using Sqoop, you can import data from any compatible jdbc database into Hadoop.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following is an example of how to run a Sqoop import command:<\/span><\/p>\n<pre class=\"lang:default decode:true \">sqoop \\\r\nimport-all-tables \\\r\n--hive-import \\\r\n--create-hive-table \\\r\n--connect \"jdbc:mysql:\/\/host_name\/db_name\" \\\r\n--username user \\\r\n--password secret \\\r\n-m 1<\/pre>\n<p><span style=\"font-weight: 400;\">The command will import all the tables in the database, <\/span><b>db_name,<\/b><span style=\"font-weight: 400;\"> of the MySQL service running on <\/span><b>localhost<\/b><span style=\"font-weight: 400;\"> into a database called <\/span><b>db_name<\/b><span style=\"font-weight: 400;\"> in Impala.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Sometimes you just want to create a table from an exported file, like a CSV. In this case, you can use Python and the <\/span><b>Ibis-framework<\/b><span style=\"font-weight: 400;\"> library to do it in just six lines of code.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">First, install the ibis-framework library using pip or conda:<\/span><\/p>\n<p><b>pip install ibis-framework<\/b><\/p>\n<pre class=\"lang:default decode:true\">import pandas as pd\r\nimport ibis\r\nimport re\r\n \r\n# create a connection with hdfs\r\nhdfs = ibis.hdfs_connect(host=\"localhost\", port=50070)\r\n# create a connection with impala\r\nc = ibis.impala.connect(host=\"localhost\", port=21050, hdfs_client=hdfs)\r\n# select the database\r\ndb = c.database('default')\r\n# read a csv file\r\ndata = pd.read_csv(\".\/data\/datasets\/sales.csv\")\r\n# remove whitespaces in column names\r\ndata.columns = [re.sub(\" \",\"_\", col) for col in data.columns]\r\n# create and load csv file into a table called \"sales\"\r\ndb.create_table('sales', data)<\/pre>\n<p><span style=\"font-weight: 400;\">Another way to load data into Hadoop is using the hdfs utilities. Basically, you can load the raw data using different file formats (csv, parquet, feather, avro, txt, images).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following code demonstrates how to upload a file to hdfs using the <\/span><b>python-hdfs<\/b><span style=\"font-weight: 400;\"> library.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Install the library using the pip command:<\/span><\/p>\n<p><b>pip install python-hdfs<\/b><\/p>\n<pre class=\"lang:default decode:true \">from hdfs import InsecureClient\r\n \r\n# connect to the webhdfs port 50070\r\nweb_hdfs_interface = InsecureClient('http:\/\/localhost:50070', user='cloudera')\r\n \r\n# upload the csv file in .\/data\/datasets\/sales.csv to \/user\/cloudera\/sales.csv\r\nweb_hdfs_interface.upload(\r\n    \"\/user\/cloudera\/sales.csv\", \r\n    \".\/data\/datasets\/sales.csv\",\r\n    overwrite=True)\r\n \r\n# list the files in \/user\/cloudera directory on hdfs\r\nfiles = web_hdfs_interface.list('\/user\/cloudera\/')\r\n \r\nprint(files)<\/pre>\n<p><span style=\"font-weight: 400;\">Using this method, Hadoop can also be used as a distributed and highly available file system.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Summary_a_big_data_introduction\"><\/span><span style=\"font-weight: 400;\">Summary: a big data introduction<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Everyday, <strong>big data is becoming more and more relevant as companies are generating and collecting huge amounts of data.<\/strong> As an example, here at UruIT, we\u2019re developing a system for multinational telecom Telefonica\/Movistar to find anomalies in its databases. The system is based on a combination of the described big data technologies plus some web technologies like NodeJS and React in order to link the two worlds (web and big data) together and provide an elegant user experience for both technical and non-technical users.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Moreover,<strong> big data helps in the development of Machine Learning and Deep Learning models<\/strong>, since having a large amount of data can increase the likelihood of finding hidden patterns. Taking our anomaly detection system as an example, once we collect enough data, we could introduce features such as error prediction in the system and autocorrection capabilities.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this blog post, we\u2019ve provided a brief introduction to the Hadoop ecosystem, showing how to set up and run a small instance of Hadoop using Cloudera quickstart docker image and how to connect to it in different ways and with programming languages.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is just the tip of the big data \u201ciceberg,\u201d and we\u2019ve only focused on big data at rest. <strong>When handling large volumes of data, other problems come into place<\/strong>, like partitioning data in the most performant way or reducing the amount of data shuffling between the nodes of the cluster to improve performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have to say that setting up a Hadoop cluster on premise from scratch involves a lot of infrastructure work in order to physically provision the machines. To avoid this, you can rely on cloud services like <\/span><a href=\"https:\/\/cloud.google.com\/dataproc\/\" class=\"external\" rel=\"nofollow\"><b>Google DataProc<\/b><\/a>\u00a0<span style=\"font-weight: 400;\">and<\/span><a href=\"https:\/\/aws.amazon.com\/emr\/\" class=\"external\" rel=\"nofollow\"><b> Amazon EMR <\/b><\/a><span style=\"font-weight: 400;\">where you can spin-up your cloud distributed cluster.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Please, leave us a comment letting us know your experience with big data and Hadoop and happy data munging!!! Also, feel free to contact us if you\u2019d like to know more about our experience building big data apps. If you are interested in learning about <a href=\"https:\/\/uruit.com\/blog\/introduction-to-machine-learning\/\">big data and Machine Learning<\/a>, we recommend the post in the link \ud83d\ude42<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Usually, big data applications are one of two types: data at rest and data in motion. The difference between them is the same as the difference between a lake and&#8230;<\/p>\n","protected":false},"author":35,"featured_media":9859,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[290],"tags":[],"_links":{"self":[{"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/posts\/9855"}],"collection":[{"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/comments?post=9855"}],"version-history":[{"count":6,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/posts\/9855\/revisions"}],"predecessor-version":[{"id":11337,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/posts\/9855\/revisions\/11337"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/media\/9859"}],"wp:attachment":[{"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/media?parent=9855"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/categories?post=9855"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uruit.com\/blog\/wp-json\/wp\/v2\/tags?post=9855"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}