{"id":66485,"date":"2018-06-05T09:48:41","date_gmt":"2018-06-05T09:48:41","guid":{"rendered":"https:\/\/www.whizlabs.com\/blog\/?p=66485"},"modified":"2019-05-09T08:19:19","modified_gmt":"2019-05-09T08:19:19","slug":"spark-rdd","status":"publish","type":"post","link":"https:\/\/www.whizlabs.com\/blog\/spark-rdd\/","title":{"rendered":"What is Spark RDD and Why Do We Need it?"},"content":{"rendered":"<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">Over the time, Big Data analysis has reached a new magnitude which, in turn, has changed its mode of operation and expectation as well. Today\u2019s big data analysis is not only dealing with massive data but also with a set target of fast turnaround time. Though Hadoop is the unbeatable technology behind the big data analysis, it has some shortfalls concerning fast processing. However, with the entry of Spark in the Hadoop world, data processing speed has met up most expectations.<\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\">New to the world of Apache Spark? Let&#8217;s dive deep with this exclusive <a href=\"https:\/\/www.whizlabs.com\/blog\/learn-apache-spark\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Spark Guide<\/a>!<\/p>\n<\/blockquote>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Moreover, when we talk about Spark, the first term comes into our mind is Resilient Distributed Datasets (RDD) or Spark RDD which makes data processing faster. Also, this is the key feature of Spark that enables logical partitioning of data sets during computation.\u00a0<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">In this blog, we will discuss technical aspects of Spark RDD which we believe will help you as a developer to understand Spark RDD more with its underlying technical details. In addition to that, this blog will provide you an overview of the use of RDD in Spark.<\/span><\/p>\n<h2 class=\"p4\" style=\"text-align: justify;\"><span class=\"s1\"><b>Spark RDD and its features<\/b><\/span><\/h2>\n<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">RDD stands for <b>Resilient Distributed Dataset <\/b>where each of the terms signifies its features. <\/span><\/p>\n<ul class=\"ul1\" style=\"text-align: justify;\">\n<li class=\"li1\"><span class=\"s1\"><strong><i>Resilient:<\/i><\/strong> means it is fault tolerant by using RDD lineage graph (DAG). Hence, it makes it possible to do recomputation in case of node failure.<\/span><\/li>\n<li class=\"li1\"><span class=\"s1\"><strong><i>Distributed:<\/i><\/strong><span class=\"Apple-converted-space\">\u00a0 <\/span>As datasets for Spark RDD resides in multiple nodes.<\/span><\/li>\n<li class=\"li1\"><span class=\"s1\"><strong><i>Dataset:<\/i><\/strong>\u00a0records of data that you will work with.<\/span><\/li>\n<\/ul>\n<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">In Hadoop designing, RDD is a challenge. However, with Spark RDD the solution seems very effective due to its lazy evaluation<\/span><span class=\"s4\">. <\/span><span class=\"s1\">RDDs in Spark works on-demand basis. Hence, it saves lots of data processing time as well efficiency of the whole process.<\/span><\/p>\n<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">Hadoop Map-reduce has many shortcomings which are overcome by Spark RDD through its features, and this is the main reason for the popularity of Spark RDD.<\/span><\/p>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b>Spark RDD Core Features in a Nutshell<\/b><\/span><\/h4>\n<ul class=\"ul2\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">In-memory Computation<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Lazy Evaluation\u00a0<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Fault Tolerance<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Immutability<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Partitioning<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Persistence<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Coarse-grained Operations<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Location-Stickiness<\/span><\/li>\n<\/ul>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">We will gradually discuss these points in next sections.<\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\"><b style=\"color: #222222; font-family: 'Open Sans', arial, sans-serif; font-size: 27px; letter-spacing: -0.02em;\">Understanding Spark RDD Technical Features<\/b><\/p>\n<\/blockquote>\n<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">Spark RDD is the technique of representing datasets distributed across multiple nodes, which can operate in parallel. In other words, Spark RDD is the main fault tolerant abstraction of Apache Spark and also its fundamental data structure. The RDD in Spark is an immutable distributed collection of objects which works behind data caching following two methods \u2013<\/span><\/p>\n<ul class=\"ul2\" style=\"text-align: justify;\">\n<li class=\"li1\"><span class=\"s1\">cache()<\/span><\/li>\n<li class=\"li1\"><span class=\"s1\">persist()<\/span><\/li>\n<\/ul>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">The in-memory caching technique of Spark RDD makes logical partitioning of datasets in Spark RDD. The beauty of in-memory caching is if the data doesn\u2019t fit it sends the excess data to disk for recalculation. So, this is why it is called resilient. As a result, you can extract RDD in Spark as and when you require it. Hence, it makes the overall data processing faster.<\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\"><span class=\"s1\"><i>Spark is 100 times faster than Hadoop in terms of data processing. Here are the <a href=\"https:\/\/www.whizlabs.com\/blog\/why-is-apache-spark-faster\/\" target=\"_blank\" rel=\"noopener noreferrer\">factors that make Apache Spark faster<\/a>!\u00a0<\/i><\/span><\/p>\n<\/blockquote>\n<h2 class=\"p5\" style=\"text-align: justify;\"><span class=\"s1\"><b>Operations Supported by Spark RDD<\/b><\/span><\/h2>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">RDD in Spark supports two types of operations: <\/span><\/p>\n<ol class=\"ol1\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">Transformations<\/span><\/li>\n<li class=\"li2\"><i><\/i><span class=\"s1\">Actions<\/span><\/li>\n<\/ol>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b> Transformation<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">In case of <i>transformation,<\/i> Spark RDD creates a new dataset from an existing dataset. To refer a Spark RDD example for <i>transformation<\/i>, we can say a <\/span><span class=\"s7\">map<\/span><span class=\"s1\"> is a transformation which passes each dataset element through a function. As a return value, it sends new RDD which represents the result.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">The programmatic view of the above example in different languages would be:<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">In Scala:<\/span><\/p>\n<table class=\"t1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td class=\"td1\" valign=\"top\">\n<p class=\"p7\"><span class=\"s1\">val l = sc.textFile(&#8220;example.txt&#8221;)<\/span><\/p>\n<p class=\"p7\"><span class=\"s1\">val lLengths = l.map(s =&gt; s.length)<\/span><\/p>\n<p class=\"p7\"><span class=\"s1\">val totalLength = lLengths.reduce((a, b) =&gt; a + b)<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"p1\" style=\"text-align: justify;\"><span class=\"s1\">Now if you want to use lLengths later you can use the persist<i> ()<\/i> function as below:<\/span><\/p>\n<table class=\"t1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td class=\"td2\" valign=\"top\">\n<p class=\"p9\"><span class=\"s1\">lLengths. persist()<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"p10\" style=\"text-align: justify;\"><span class=\"s1\">You can refer API docs for the detail list of transformations supported by Spark RDD from <a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"s2\">https:\/\/spark.apache.org\/.<\/span><\/a><\/span><\/p>\n<p class=\"p10\" style=\"text-align: justify;\"><span class=\"s1\">There are two types of transformations supported by Spark RDD:<\/span><\/p>\n<ol class=\"ol1\" style=\"text-align: justify;\">\n<li class=\"li10\"><span class=\"s1\">Narrow transformation<\/span><\/li>\n<li class=\"li10\"><span class=\"s1\">Wide transformation<\/span><\/li>\n<\/ol>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">In case of Narrow transformation, the parent RDD of output RDD is associated with a single partition of data. Whereas in Wide transformation, the output RDD is the result of many parent RDD partitions. In another word, it is known as <i>shuffle<\/i> transformation.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">All Spark RDD transformations are <i>lazy<\/i> as they do not compute the results right away. Instead, they remember the applied transformations to some base datasets which refers some files as shown in the example. When any action requires a result, then only the transformations are computed in Spark RDD. This, in turn, results in the faster and efficient data processing.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Recomputation in Spark RDD happens every time for each transformed RDD whenever you run an action on it. However, with <\/span><span class=\"s5\"><i>persist<\/i><\/span><span class=\"s1\"> method Spark can keep the elements around on the cluster for much faster access the next time you query it. There is also a support for persisting Spark RDDs on disk or replication across multiple nodes.<\/span><\/p>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b>Actions<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">During <\/span><span class=\"s5\"><i>actions<\/i><\/span><span class=\"s1\">, RDD returns a value to the driver program after performing the computation on the dataset. For example,<\/span><span class=\"s7\"> reduce<\/span><span class=\"s1\"> is an action which aggregates all the RDD elements using some function and returns the final result to the main program.<\/span><\/p>\n<h2 class=\"p5\" style=\"text-align: justify;\"><span class=\"s1\"><b>How to Create RDD in Spark?<\/b><\/span><\/h2>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">There are following three processes to create Spark RDD.<\/span><\/p>\n<ol class=\"ol1\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">Using parallelized collections<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">From external datasets (viz . other external storage systems like a shared file system, HBase or HDFS)<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">From existing Apache Spark RDDs<\/span><\/li>\n<\/ol>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Next, we will discuss on each of these methods to see how they are used to create Spark RDDs.<\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\"><span class=\"s1\">Resilient Distributed Datasets (<\/span>RDD) is the important feature of Apache Spark that makes it important. Let&#8217;s understand the <a href=\"https:\/\/www.whizlabs.com\/blog\/importance-of-apache-spark\/\" target=\"_blank\" rel=\"noopener noreferrer\">importance of Apache Spark<\/a> in Big Data industry.<\/p>\n<\/blockquote>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b>Parallelized Collections<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">You can create parallelized collections by calling parallelize method of SparkContext interface<\/span> <span class=\"s1\">on the existing collection of driver program in Java, Scala or Python. In this case, the copied collection elements form a distributed dataset which can operate in parallel. <\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Spark RDD example of parallelized collections in Scala:<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">To hold the numbers 2 to 6 as parallelized collection <\/span><\/p>\n<table class=\"t1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td class=\"td3\" valign=\"top\">\n<p class=\"p12\"><span class=\"s1\">val collection = Array(2, 3, 4, 5,6)<\/span><\/p>\n<p class=\"p12\"><span class=\"s1\">val prData = spark.sparkContext.parallelize(collection)<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Here the created distributed dataset prData is able to operate in parallel. Hence, you can call prData.reduce () to add up the elements in the array.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">One of the key parameters for parallelized collections is deciding the <\/span><span class=\"s5\"><i>partition<\/i><\/span><span class=\"s1\"> numbers to cut the dataset into. In this case, Spark runs single task for individual partition of the cluster. Usually, 2-4 partitions are ideal for a single CPU in your cluster. Though Spark automatically sets the number of partitions based on the cluster. However, the users can also set it themselves manually by passing it as the second parameter of parallelization.<\/span><\/p>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b>External Datasets<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Apache Spark can create distributed datasets from any Hadoop supported file storage which may include:<\/span><\/p>\n<ul class=\"ul2\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">Local file system<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">HDFS<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Cassandra<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">HBase<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Amazon S3<\/span><\/li>\n<\/ul>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Spark supports file formats like <\/span><\/p>\n<ul class=\"ul2\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">Text files<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Sequence Files<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">CSV<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">JSON<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">Any Hadoop Input Format<\/span><\/li>\n<\/ul>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">For example, you can create a Text file Spark RDDs by using a textFile method of the SparkContext interface. This method takes the URL for the file whether it could be the local path on the system, or even a hdfs:\/\/, etc.). Finally, it reads the file as a collection of lines. <\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\">Seeking a better career in Apache Spark? Choose the one out of 5 Best <a href=\"https:\/\/www.whizlabs.com\/blog\/5-best-apache-spark-certification\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Spark Certifications<\/a> to boost your career.<\/p>\n<\/blockquote>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Here the important factor is if you are using a path on the local file system, the file must be accessible at the same path on slave nodes. Hence, either you have to copy the data file to all slave nodes or need to use a shared file system which is network-mounted.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">You can use the data frame reader interface to load external datasets and then use <i>the .rdd <\/i>method to convert the Dataset &lt;Row&gt; into RDD &lt;Row&gt;.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Let&#8217;s see the below example of a text file conversion which returns string dataset later.<\/span><\/p>\n<table class=\"t1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td class=\"td4\" valign=\"top\">\n<p class=\"p13\"><span class=\"s1\">val exDataRDD = spark.read.textFile(&#8220;path\/of\/text\/file&#8221;).rdd<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\"><b>From Existing RDDs<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">RDD is immutable; hence you can&#8217;t change it. However, using transformation, you can create a new RDD from an existing RDD. As no change takes place due to mutation, it maintains the consistency over the cluster. Few of the operations used for this purpose are:<\/span><\/p>\n<ul class=\"ul1\" style=\"text-align: justify;\">\n<li class=\"li2\"><span class=\"s1\">map<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">filter<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">count<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">distinct<\/span><\/li>\n<li class=\"li2\"><span class=\"s1\">flatmap<\/span><\/li>\n<\/ul>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Example:<\/span><\/p>\n<table class=\"t1\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td class=\"td5\" valign=\"top\">\n<p class=\"p14\"><span class=\"s1\">val seasons =spark.sparkContext.parallelize(Seq(&#8220;summer&#8221;, &#8220;monsoon&#8221;, &#8220;spring&#8221;, &#8220;winter&#8221;))<\/span><\/p>\n<p class=\"p14\"><span class=\"s1\"> val seasons1= seasons.map(s =&gt; (s.charAt(0), s))<\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h4 class=\"p2\" style=\"text-align: justify;\"><span class=\"s10\"><b>Conclusion<\/b><\/span><\/h4>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">If you are an aspiring candidate preparing for\u00a0<a href=\"https:\/\/www.whizlabs.com\/spark-developer-certification\/\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"s2\">Hortonworks Spark developer certification (HDPCD)<\/span><\/a> then covering all the above RDD features are must as part of the theoretical and practical aspects of the certification.<\/span><\/p>\n<p class=\"p2\" style=\"text-align: justify;\"><span class=\"s1\">Whizlabs offers you complete coverage of RDD features for the certification exam with its training videos and study materials. The training guide offers programming aspects mostly with Scala. So, join the course today and get yourself acquainted with Spark RDD.<\/span><\/p>\n<p><em><strong>Have any question\/suggestion? Just mention below in the comment box or write <a href=\"https:\/\/help.whizlabs.com\/hc\/en-us\/requests\/new\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>, we&#8217;ll be happy to answer!<\/strong><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Over the time, Big Data analysis has reached a new magnitude which, in turn, has changed its mode of operation and expectation as well. Today\u2019s big data analysis is not only dealing with massive data but also with a set target of fast turnaround time. Though Hadoop is the unbeatable technology behind the big data analysis, it has some shortfalls concerning fast processing. However, with the entry of Spark in the Hadoop world, data processing speed has met up most expectations. New to the world of Apache Spark? Let&#8217;s dive deep with this exclusive Apache Spark Guide! Moreover, when we [&hellip;]<\/p>\n","protected":false},"author":220,"featured_media":66529,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[6],"tags":[898,1358,1479,1574,1608],"class_list":["post-66485","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","tag-how-to-create-rdd-in-spark","tag-rdd-in-spark","tag-spark-rdd-example","tag-understanding-spark-rdd","tag-what-is-rdd"],"uagb_featured_image_src":{"full":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd-150x150.png",150,150,true],"medium":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd-300x148.png",300,148,true],"medium_large":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"large":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"1536x1536":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"2048x2048":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"profile_24":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",24,12,false],"profile_48":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",48,24,false],"profile_96":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",96,47,false],"profile_150":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",150,74,false],"profile_300":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",300,148,false],"tptn_thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd-250x250.png",250,250,true],"web-stories-poster-portrait":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",640,315,false],"web-stories-publisher-logo":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",96,47,false],"web-stories-thumbnail":["https:\/\/www.whizlabs.com\/blog\/wp-content\/uploads\/2018\/06\/use-of-spark-rdd.png",150,74,false]},"uagb_author_info":{"display_name":"Aditi Malhotra","author_link":"https:\/\/www.whizlabs.com\/blog\/author\/aditi\/"},"uagb_comment_info":4,"uagb_excerpt":"Over the time, Big Data analysis has reached a new magnitude which, in turn, has changed its mode of operation and expectation as well. Today\u2019s big data analysis is not only dealing with massive data but also with a set target of fast turnaround time. Though Hadoop is the unbeatable technology behind the big data&hellip;","_links":{"self":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/66485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/users\/220"}],"replies":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/comments?post=66485"}],"version-history":[{"count":1,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/66485\/revisions"}],"predecessor-version":[{"id":71894,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/posts\/66485\/revisions\/71894"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/media\/66529"}],"wp:attachment":[{"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/media?parent=66485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/categories?post=66485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.whizlabs.com\/blog\/wp-json\/wp\/v2\/tags?post=66485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}