{"id":54217,"date":"2018-01-17T12:06:38","date_gmt":"2018-01-17T06:36:38","guid":{"rendered":"https:\/\/www.whizlabs.com\/?p=54217"},"modified":"2024-05-17T17:39:09","modified_gmt":"2024-05-17T12:09:09","slug":"apache-hive-faster-better-sql-on-hadoop","status":"publish","type":"post","link":"https:\/\/www.whizlabs.com\/blog\/apache-hive-faster-better-sql-on-hadoop\/","title":{"rendered":"Apache Hive &#8211; A Faster and Better SQL on Hadoop"},"content":{"rendered":"<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Hadoop is a hot technology that primarily deals with petabytes of data for high-level analysis in enterprise applications. However, enterprises often work in a time-bound situation that requires fast analysis of collected data over a limited period. Hadoop MapReduce is a complicated tool for analysis purpose. Along with it, you need some programming language skills to explore data in MapReduce.<\/span><\/p>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Here comes the need for data query language like SQL for data extraction, analysis, and processing. Although tried and tested with big data front end, certainly SQL does not fit well with Hadoop data store. It is a general purpose database language and not built solely for analytical purpose. On the other hand, the query language HQL or HiveQL of Apache Hive works well on historical data for analytical querying. Moreover, Apache Hive gives you an opportunity to have a complete control over data in a better way.<\/span><\/p>\n<p><a href=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/Apache-hive-Architecture.png\"><img decoding=\"async\" class=\"size-full wp-image-55194 aligncenter\" src=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/Apache-hive-Architecture.png\" alt=\"Apache hive Architecture\" width=\"560\" height=\"315\" \/><\/a><\/p>\n<h2 style=\"text-align: justify;\"><span lang=\"EN-GB\">Why is Apache Hive Faster than SQL?<\/span><\/h2>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Organizations affinity towards open source<\/span> <span lang=\"EN-GB\">brings the addition of Apache Hive to Hadoop family for the analytical query. It is important to realize that Apache Hive follows a data warehouse infrastructure which is built on the top of Apache Hadoop. The primary use of Apache Hive in Hadoop is to summarize and to perform an ad-hoc query on data along with analyzing large datasets. In addition to that, you will get a chance to project a structure onto the collected data in Hadoop. Furthermore, Apache Hive provides a similar interface to receive data in Hadoop cluster. Hence, it is a great way to start faster data analyzing. <\/span><\/p>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Let\u2019s look into the technical aspects that make Hive faster during the processing of queries.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Partitioning of Table<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Hive significantly improves performance optimization using its partitioning techniques which can be static or dynamic. Hadoop data store (that is HDFS) stores petabytes of data that<b> <\/b>a Hadoop user needs to query for data analysis. No doubt, it is a significant burden! In Partitioning method, Hive divides all the stored table data into multiple partitions. Each partition targets a specific value of partition column. Here the values inside the column are the records present in the HDFS. Hence, when you query a table, in fact, you are going to query a partition of the table. <\/span><\/p>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Apache Hive converts the particular SQL query into\u00a0<\/span><span lang=\"EN-GB\">MapReduce<\/span><span lang=\"EN-GB\">\u00a0job and then submits it to the <\/span><span lang=\"EN-GB\">Hadoop cluster<\/span><span lang=\"EN-GB\">.<\/span><b> <\/b><span lang=\"EN-GB\">Consequently, you will return the query value. Hence, the partitioning process decreases the operational I\/O time and decreases execution load. As a result, the overall performance<b> <\/b>increases markedly.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Bucketing<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">While dealing with large table data, it is possible that even partition size does not match with expected file size. Partitioning does not work in this case. To manage this Hive allows users to divide data set into more manageable parts rather subdivisions. Hive uses the Hash function here to subdivide the partitions.<\/span><\/p>\n<p><a href=\"https:\/\/www.whizlabs.com\/big-data-certifications\/\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" class=\"size-full wp-image-55205 aligncenter\" src=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/FLAT-40-OFF-ON-PSM-I-EXAM-SIMULATOR-6.jpg\" alt=\"BIG DATA OFFER\" width=\"336\" height=\"280\" \/><\/a><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Use of TEZ Engine<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">You can use Apache TEZ instead of MapReduce as an execution engine for Apache Hive. TEZ which is a developer API and framework to write native YARN applications provides highly optimized data processing feature. TEZ with Apache Hive works as a faster query engine. To demonstrate, this is <\/span><span lang=\"EN-GB\">a <\/span><span lang=\"EN-GB\">shared-nothing architecture where each processing unit comprises of its memory and disk resources and works independently.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Use of ORCFILE<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">ORCFile is a new table storage format that helps in significant speed improvements through different techniques like predicate push-down and compression. Apache Hive supports ORCFile for every Hive table and using such ORCFiles is extremely helpful in speeding up the execution<\/span><span lang=\"EN-GB\">.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Implementing Vectorization<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Vectorization is an essential feature seen from Hive 0.13 onwards. This particular feature helps in batch execution of queries like joins, scans, filter, aggregations, etc.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Cost-based Optimization (CBO)<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Cost-based optimization is Hive\u2019s way to optimize physical and logical execution plan for each query. It uses query cost to optimize decisions for the order of joins, types of join to perform and others.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Dynamic Runtime Filtering<\/span><\/b><\/h4>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Dynamic Runtime Filtering of Apache Hive provides an enhanced fully dynamic solution for table data. With Dynamic Runtime Filtering, an automatic bloom filter works on actual dimension table values. In this case, the filter eliminates rows and skips the records that do not match. No join or shuffle operations happen further on that data. As a result, it saves a considerable amount of CPU and network consumption.<\/span><\/p>\n<figure id=\"attachment_54222\" aria-describedby=\"caption-attachment-54222\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/Hive_Pic2.png\"><img decoding=\"async\" class=\"wp-image-54222 size-full\" src=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/Hive_Pic2.png\" alt=\"Dynamic Runtime Filtering\" width=\"1024\" height=\"446\" \/><\/a><figcaption id=\"caption-attachment-54222\" class=\"wp-caption-text\">(Image Source: https:\/\/hortonworks.com<\/figcaption><\/figure>\n<h2 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Introduction of Hive LLAP<\/span><\/b><\/h2>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">LLAP (Low Latency Analytical Processing) is a second-generation big data system which combines optimized in-memory caching and persistent data query.<\/span> <span lang=\"EN-GB\">LLAP SSD Cache which is a combination of RAM and SSD create a giant pool of memory. It makes computation happen in memory rather than on disk. Furthermore, it intelligently caches memory and shares computed data among the clients. Hive 2.0 with LLAP implementation launches queries almost instantly and in memory caching avoids unnecessary disk I\/O. It makes Hive 2 practically 26x faster than Hive 1.<\/span><\/p>\n<h4 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Explore Apache Hive Career to become a Hadoop Professional<\/span><\/b><\/h4>\n<p><a href=\"https:\/\/www.whizlabs.com\/blog\/top-50-hadoop-interview-questions\/\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" class=\"wp-image-55201 alignright\" src=\"https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2024\/05\/FLAT-40-OFF-ON-PSM-I-EXAM-SIMULATOR-5.jpg\" alt=\"\" width=\"318\" height=\"265\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">Building a Hadoop career is everyone&#8217;s dream in today\u2019s IT industry. Hence, if you&#8217;re already familiar with SQL but not a programmer, this blog might have shown you a roadmap to achieve a Hadoop skill. In reality, you can learn Apache Hive to be skilled in Hadoop tool belt as an option for data processing. Although SQL and Hive have some working differences, you will find a smooth transition once you enter the Hive world. <\/span><\/p>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">However, proper Apache Hive preparation is a must to achieve success as a Hadoop professional.<\/span> <span lang=\"EN-GB\">There are many differences in constructs and syntaxes between Hive and SQL, and also you need to know architectural details with hands-on experiences. In this scenario, Hortonworks leads the industry with its data platform (HDP) for Hadoop. Its<b>\u00a0<\/b>certification for HDPCA (HDP certified Administrator) is an excellent choice if you want a rock solid Apache Hive preparation.<\/span><\/p>\n<h2 style=\"text-align: justify;\"><b><span lang=\"EN-GB\">Conclusion<\/span><\/b><\/h2>\n<p style=\"text-align: justify;\"><span lang=\"EN-GB\">To conclude, Whizlabs\u2019 aim is to provide a <\/span><span lang=\"EN-GB\"><a href=\"https:\/\/www.whizlabs.com\/hdpca-certification\/\" target=\"_blank\" rel=\"noopener noreferrer\">complete guide<\/a><\/span><span lang=\"EN-GB\"> for HDPCA exam with hands-on exercises. With this exam guide, you will receive full coverage on Automated Ambari Installation and Capacity Planning with which you will learn how to create Hive Metastore. Our core group of training service team assures you about the up to date training materials that sync up with Hortonworks certification syllabus. The entire guide not only gives you complete coverage of Apache Hive in Hadoop eco-system but also other components that work integrally with Hive. We assure the guide will help you in building your Apache hive career.<\/span><\/p>\n<p>[divider \/]<\/p>\n<p><i><span lang=\"EN-IN\">If you have any question\/doubt just write below in comment section or write\u00a0<a href=\"https:\/\/help.whizlabs.com\/hc\/en-us\" target=\"_blank\" rel=\"noopener nofollow external noreferrer\">here<\/a>, we will be happy to answer!<\/span><\/i><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hadoop is a hot technology that primarily deals with petabytes of data for high-level analysis in enterprise applications. However, enterprises often work in a time-bound situation that requires fast analysis of collected data over a limited period. Hadoop MapReduce is a complicated tool for analysis purpose. Along with it, you need some programming language skills to explore data in MapReduce. Here comes the need for data query language like SQL for data extraction, analysis, and processing. Although tried and tested with big data front end, certainly SQL does not fit well with Hadoop data store. It is a general purpose [&hellip;]<\/p>\n","protected":false},"author":220,"featured_media":55590,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"default","adv-header-id-meta":"","stick-header-meta":"default","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[6],"tags":[146,147,853,866,878,1485],"class_list":["post-54217","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","tag-apache-hive-career","tag-apache-hive-preparation","tag-hadoop-professional","tag-hdpca","tag-hive","tag-sql"],"uagb_featured_image_src":{"full":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive-150x150.png",150,150,true],"medium":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive-300x169.png",300,169,true],"medium_large":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"large":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"1536x1536":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"2048x2048":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"profile_24":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",24,14,false],"profile_48":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",48,27,false],"profile_96":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",96,54,false],"profile_150":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",150,84,false],"profile_300":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",300,169,false],"tptn_thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive-250x250.png",250,250,true],"web-stories-poster-portrait":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",560,315,false],"web-stories-publisher-logo":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",96,54,false],"web-stories-thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/01\/apachehive.png",150,84,false]},"uagb_author_info":{"display_name":"Aditi Malhotra","author_link":"https:\/\/www.whizlabs.com\/blog\/author\/aditi\/"},"uagb_comment_info":3,"uagb_excerpt":"Hadoop is a hot technology that primarily deals with petabytes of data for high-level analysis in enterprise applications. However, enterprises often work in a time-bound situation that requires fast analysis of collected data over a limited period. Hadoop MapReduce is a complicated tool for analysis purpose. Along with it, you need some programming language skills&hellip;","_links":{"self":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/54217","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/users\/220"}],"replies":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/comments?post=54217"}],"version-history":[{"count":4,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/54217\/revisions"}],"predecessor-version":[{"id":96185,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/54217\/revisions\/96185"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/media\/55590"}],"wp:attachment":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/media?parent=54217"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/categories?post=54217"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/tags?post=54217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}