Apache beam tutorial

Introduction to Apache Beam

Apache Beam is one of the top big data tools used for data management. Check out this Apache beam tutorial to learn the basics of the Apache beam.

With the rising prominence of DevOps in the field of cloud computing, enterprises have to face many challenges. The management of various technologies and their maintenance is a noticeable pain point for developers as well as enterprises. One of the prominent burdens on enterprises in the DevOps era is the management of Big Data. You can find many tools for the management of big data such as Apache Spark, Apache Flink, Apache Hadoop, Apache Beam, and many others.

Therefore, it is very easy to get lost in the search for an ideal tool for processing big data. If you want to avoid all ambiguities in selecting a reliable processing tool for big data, then Apache Beam could be the right choice for you. The following discussion takes you through a brief Apache Beam tutorial, explaining its definition, features, and basic concepts related to it.

Enroll Now: Apache Beam Basics Training Course

Why is Apache Beam Important?

Some of the initial questions that arise when you select a tool for big data management can include the following.

  • Which tool would be suitable for real-time streaming?
  • What are the available options for integrating different data sources?
  • Can the speed of one specific tool cope with your use case requirements?

The only solution to these questions lies in Apache Beam, and you can find enough reasons for the same in this Apache Beam tutorial. The first pointer in our discussion would be the definition of Apache Beam. It is an open-source unified programming model that can define and execute streaming data as well as batch processing pipelines.

Apache Beam is the culmination of a series of events that started with the Dataflow model of Google, which was tailored for processing huge volumes of data. The name of Apache Beam itself signifies its functionalities as a unified platform for batch and stream data processing (Batch + strEAM). Check out Apache Beam documentation to learn more about Apache Beam.

Google donated the Dataflow SDK to Apache Software Foundation alongside a set of connectors for accessing Google Cloud Platform in 2016. As a result, the Apache incubator started, and Beam soon became a top-level project in the early half of 2017. As of then, the project has continuously been through potential growth in terms of features as well as its community.

You can find a software development kit (SDK) for defining and developing data processing pipelines alongside runners for ensuring their execution. It is capable of providing a portable programming layer. Beam Pipeline Runners help in translation of the data processing pipeline into API that is compatible with the backend of the user’s preference. Now, Apache Beam supports the following distributed processing backends.

  • Apache Apex
  • Apache Flink
  • Apache Spark
  • Apache Gearpump
  • Apache Samza
  • Hazelcast Jet
  • Google Cloud Dataflow

Also Read: Top Real-time Data Streaming Tools

Important Concepts in Apache Beam

Now, let us reflect on some of the important concepts pertaining to Apache Beam in this Apache Beam tutorial. You would need basic knowledge of the following concepts to get started with Apache Beam.

  • Pipeline

The pipeline in Apache Beam is the data processing task you want to specify. You can define all the components of the processing task in the scope of the pipeline. Most important of all, the pipeline also provides execution options for specifying the location and method for running Apache Beam.

  • PCollection

PCollection generally stands for a data set on which the pipeline works in Apache Beam. The data set can be bounded or unbounded, depending on the source. For example, a bounded data set comes from a fixed source such as a database table or a file. The unbounded data set, as the name implies, could imply the arrival of new data at any moment. The PCollections serve as inputs and outputs for every PTransform.

  • PTransform

The PTransform in Apache Beam is the definition of a particular data processing operation. PTransform could take multiple PCollections as input and then perform a defined operation on every element in PCollection. It then returns either zero or more PCollections in the form of output. You can find the in-built basic PTransforms such as the following,

  • ParDo
  • GroupByKey
  • CoGroupByKey
  • Combine
  • Flatten
  • Partition

Users should understand that these PTransforms in Apache Beam tutorial could help you familiarize yourself with the process of writing transforms. Gradually, you can develop fluency in writing your own transforms for different processing operations.

Enhance your Big Data skills with the experts. Here is the Complete List of Big Data Blogs where you can find the latest news, trends, updates, and concepts of Big Data.

Apache Beam SDKs

The most important pointer that can answer the question “why use Apache Beam” refers to Apache Beam SDKs.Beam SDKs give a unified programming model capable of representation and transformation of data sets of varying sizes. The interesting factor is that the type of data set in the input could be an infinite or finite data set. Beam presently supports language-specific SDKs in Java, Python, and Go languages.

Pipeline Runners

As discussed above, pipeline runners are essential for the functioning of Apache Beam. They are important for the translation of the Apache Beam streaming and batch processing pipelines. The pipeline runners are defined as the API that supports the distributed processing backend of your selection. As you run the Apache Beam program, you should specify a relevant runner for the backend available for the execution of your pipeline.

We have already outlined some of the supported distributed processing backends supported by pipeline runners in this Apache Beam tutorial above. Beam provides support for enabling pipelines to ensure portability across various runners. On the other hand, despite the differences in capabilities of every runner, they also have a feature for the implementation of core concepts in the Beam model.

Now, you must be wondering about any additional reasons to opt for Apache Beam for other than the Apache Beam datastream feature. First of all, Apache Beam resolved the problems due to the lack of a unified API that associates all frameworks and data sources together. In addition, you can also find abstraction for the application logic of the big data ecosystem.

The abstraction between the application logic and big data technology improves its usability. The next important reason to use Apache Beam that you must have noticed in this Apache Beam tutorial is that you have to write the application logic once. Just make sure note to mix up or scramble the code with runner specific or input specific parameters.

Also Read: Why is Big Data Analytics so Important?

Example Code for Using Apache Beam

The next important step in an introduction to Apache Beam must be the outline of an example. You should know the basic approach to start using Apache Beam. Here is an example of a pipeline written in Python SDK for reading a text file. The task for the pipeline in this Apache Beam tutorial would also include calculating the frequency of letters in the text. Here is the example code.

from __future__ import print_function

from string import ascii_lowercase

import apache_beam as beam

class CalculateFrequency(beam.DoFn):

  def process(self, element, total_characters):

            letter, counts = element

            yield letter, '{:.2%}'.format(counts / float(total_characters))

def run():

  with beam.Pipeline() as p:

            letters = (p | beam.io.ReadFromText('romeojuliet.txt')

            | beam.FlatMap(lambda line: (ch for ch in line.lower() if ch

            in ascii_lowercase))

            | beam.Map(lambda x: (x, 1)))

            counts = (letters | beam.CombinePerKey(sum))

            total_characters = (letters | beam.MapTuple(lambda x, y: y)

            | beam.CombineGlobally(sum))

            (counts | beam.ParDo(CalculateFrequency(),

            beam.pvalue.AsSingleton(total_characters))

            | beam.Map(lambda x: print(x)))

if __name__ == '__main__':

run()

Now, a step by step evaluation of the above-mentioned code can help us understand the basic use of Apache Beam.

letters = (p | beam.io.ReadFromText('romeojuliet.txt')
  • In this line, you specify the data source. The ReadFromText transform provides a PCollection as output that contains all lines from the file.
| beam.FlatMap(lambda line: (ch for ch in line.lower() if ch

            in ascii_lowercase))
  • In this step, Apache Beam processes all the lines and returns the English lowercase letters, as a single element each.
| beam.Map(lambda x: (x, 1)))
  • This step is for returning a two-tuple containing the letter and one for every letter. The Map transform is similar to FlatMap, although it is capable of returning only one element upon being called.
counts = (letters | beam.CombinePerKey(sum))
  • This step combines all pairs with the same key, followed by the calculation of the sum of ones. The results are returned to the PCollection, “counts”.
total_characters = (letters | beam.MapTuple(lambda x, y: y)

            | beam.CombineGlobally(sum))
  • In the above line, CombineGlobally transform takes all elements from the PCollection input for applying sum to them. The sum is in-built in Python for this example, and it accepts only integers. So, let us ignore the first part of the tuple.
(counts | beam.ParDo(CalculateFrequency(),

            beam.pvalue.AsSingleton(total_characters))

            | beam.Map(lambda x: print(x)))
  • The above step finally arrives at the process for frequency calculation. The transform takes two PCollections such as ‘counts’ and ‘total_characters’. In the case of each ‘count’ record, the transform basically divides the ‘count’ by ‘total_characters’. As a result, you can find the following output on the screen that shows effective Apache Beam logging.
(u'n', '6.19%')

(u'o', '8.20%')

(u'l', '4.58%')

(u'm', '3.29%')

(u'j', '0.27%')

(u'k', '0.81%')

(u'h', '6.60%')

(u'i', '6.42%')

(u'f', '2.00%')

(u'g', '1.77%')

(u'd', '3.74%')

(u'e', '11.89%')

(u'b', '1.66%')

(u'c', '2.05%')

(u'a', '7.78%')

(u'z', '0.03%')

(u'x', '0.13%')

(u'y', '2.50%')

(u'v', '1.01%')

(u'w', '2.47%')

(u't', '9.12%')

(u'u', '3.42%')

(u'r', '6.20%')

(u's', '6.33%')

(u'p', '1.46%')

(u'q', '0.06%')

Preparing for a Big Data Interview? Prepare with these top Big Data Interview Questions and get ready to ace the interview.

Final Words

Therefore, we have clearly observed that Apache Beam is a perfect unified tool for simpler big data management. Data processing becomes simpler as you may have noticed in the above-mentioned Apache Beam tutorial. It supports the integration of various data processing engines and SDKs, thereby providing exceptional opportunities for enterprises to boost their productivity.

However, Apache Beam is still developing, and certain features are not compatible with all runners. On the contrary, the remarkable efforts of the Beam community are all slated to address these issues effectively. Apart from the guidance in this tutorial, you should also explore the official Apache Beam documentation for an in-depth understanding.

Enroll now into the Apache Beam Basics Training Course and start learning more about Apache Beam right now!

About Aditi Malhotra

Aditi Malhotra is the Content Marketing Manager at Whizlabs. Having a Master in Journalism and Mass Communication, she helps businesses stop playing around with Content Marketing and start seeing tangible ROI. A writer by day and a reader by night, she is a fine blend of both reality and fantasy. Apart from her professional commitments, she is also endearing to publish a book authored by her very soon.

Leave a Comment

Your email address will not be published. Required fields are marked *


Scroll to Top