AWS Data Analyst Specialty

25 Free Questions on AWS Data Analytics Specialty 

If you’re considering pursuing the AWS Data Analytics Specialty certification, you’re in the right place! I have compiled a list of 25 free questions that will help you test your knowledge and prepare for the exam.

If you’re considering a career in AWS data analytics, you need to be comfortable with statistical analysis, data visualization, and machine learning. You also need to have strong problem-solving skills and be able to think creatively.

What does a AWS Data Analyst will do?

AWS Data Analytics Specialists are responsible for managing and analyzing data on the AWS platform. They work with data from a variety of sources, including relational databases, NoSQL databases, and streaming data.

AWS Data Analytics Specialists use a variety of tools and techniques to gain insights into data and to make recommendations to businesses.

What to expect in AWS Data Analytics Specialty exam?

If you’re consider taking the AWS Data Analytics Specialty exam, here’s what you can expect.

The exam is divided into five sections: collection, storage & data management ,processing, analysis & visualization and security. Each section has a different weightage.

AWS data analytics specialty badge

To pass the exam, you’ll need to demonstrate your knowledge and skills in collection system, as well as analytics and visualization. The section on processing will test your ability to data processing solution. The analytics and visualization section will test your ability to use data to generate insights and visualize those insights in a way that is easy to understand.

To prepare for the AWS Data Analytics Specialty exam, it is recommended that you have experience working with AWS data analytics services, such as Amazon Kinesis, Amazon Redshift, and Amazon Athena. You should also be familiar with common data analysis techniques, such as regression, classification, and clustering.

The AWS Data Analytics Specialty exam is a challenging exam, but if you prepare correctly by taking the AWS Data Analytics Specialty practice exam, you can pass it with flying colors

Let us start learning through these AWS Data Analytics Specialty exam free questions and answers !

Domain : Security

Question 1 : You work as a data engineer for an international banking firm where you are responsible for building a Redshift data warehouse to allow the bank management team to produce business insights via dashboards and reports. Much of the data stored in your Redshift data warehouse is highly confidential, for example Personally Identifiable Information (PII). Also, some of the data needed to produce the management insights is stored in S3 and accessed via Redshift Spectrum. To achieve the highest level of security, the Glue data catalog used by Redshift Spectrum to access your tables on S3 is encrypted. 
What must you do to gain access to the S3 tables via Redshift Spectrum?

A. Use the KMS key for Redshift to access the Glue data catalog
B. Nothing, Redshift and Redshift Spectrum can access the data in S3 via the Glue data catalog regardless of whether the Glue data catalog is encrypted or not
C. Create a KMS key for Redshift Spectrum and use it to access the Glue data catalog
D. Use the KMS key for Glue to access the Glue data catalog

Correct Answer: D

Explanation:

Option A is incorrect. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. A key associated with Redshift will not allow you to access the encrypted Glue data catalog.

Option B is incorrect. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog.

Option C is incorrect.  If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. A key associated with Redshift will not allow you to access the encrypted Glue data catalog.

Option D is correct. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog.

Reference: Please see the Amazon Redshift database developer guide titled Querying external data using Amazon Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html)  

 

Domain : Processing

Question 2 :  You work as a data engineer for a financial services company that receives near real-time streaming data for stock and derivative security master data including prices, symbol, contract info, etc. You receive these data stream feeds from several data providers. You have set up Kinesis Data Streams delivery streams to receive the data from your providers. Currently you have your delivery streams configured to send the streaming data to an S3 destination.
Your management team wants you to change the delivery stream destination for one of your feeds from S3 to Redshift. How would you do this in the least disruptive manner?

A. Use the StopDeliveryStream API call to stop the delivery stream, then change the destination to Redshift using the UpdateDestination API call, then use the StartDeliveryStream API to restart the delivery stream.
B. Use the UpdateDestination API call to change the destination from S3 to Redshift. The target delivery stream remains active while the configuration is updated; data writes to the delivery stream can continue during the change. The updated configuration completes within a few minutes.
C. Create a new delivery stream using the CreateDeliveryStream API call that has Redshift as its destination, use the StopDeliveryStream API call to stop the delivery stream that writes to S3, and start the new delivery stream using the StartDeliveryStream API call.
D. Use the ChangeDeliveryStream API call to change the destination from S3 to Redshift. The target delivery stream remains active while the configuration is updated; data writes to the delivery stream can continue during the change. The updated configuration completes within a few minutes.

Correct Answer: B

Explanation:

Option A is incorrect. There is no StopDeliveryStream API call. Also, there is no StartDeliveryStream API call. 

Option B is correct. You can change the delivery stream destination without interrupting the flow of data through the delivery stream by using the UpdateDestination API call.

Option C is incorrect.  There is no StopDeliveryStream or StartDeliveryStream API call.

Option D is incorrect. There is no ChangeDeliveryStream API call.


References: Please see the Amazon Kinesis Data Firehose developer guide titled Creating an Amazon Kinesis Data Firehose Delivery Stream (https://docs.aws.amazon.com/firehose/latest/dev/basic-create.html), and the Amazon Kinesis Data Firehose API reference titled UpdateDestination (https://docs.aws.amazon.com/firehose/latest/APIReference/API_UpdateDestination.html), and the Amazon Kinesis Data Firehose API reference titled Actions (https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html

 

Domain : Collection

Question 3 : You work as a data engineer for an online retailer that wishes to capture customer clickstream activity for its website and mobile platforms. Your marketing department plans to gain insights from the clickstream data through the use of your data warehouse. You have built a Kinesis Data Streams pipeline that streams your data to Redshift through the use of a Kinesis Producer Library (KPL) application and a Kinesis Client Library (KCL) application that uses the Kinesis Connector Library to write the clickstream data to Redshift.
As your KPL code writes your clickstream data to your Kinesis Data stream, you need to monitor to ensure even load distributions across your fleet of EC2 instances running your KPL application. How can you monitor the load distribution of your KPL fleet of EC2 instances most efficiently?

A. Use the CloudWatch metrics published by the KPL with a metric level of DETAILED and a granularity of STREAM to monitor your load distribution
B. Use the CloudWatch metrics published by the KPL with a metric level of SUMMARY and a granularity of SHARD to monitor your load distribution
C. Use the CloudWatch metrics published by the KPL with a metric level of DETAILED and a granularity of SHARD to monitor your load distribution; add the EC2 hostname as a dimension
D. Use the CloudWatch metrics published by the KPL with the default metric level and granularity

Correct Answer: C

Explanation:

Option A is incorrect. The granularity level of STREAM is not a granular enough level of metric to efficiently monitor for uneven distribution across your EC2 fleet. You need to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet.

Option B is incorrect. The metric level of SUMMARY will not send granular-level metrics to CloudWatch. You need granular-level metrics to efficiently monitor for uneven distribution across your EC2 fleet. You need to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet. 

Option C is correct.  The DETAILED metric level and the granularity of SHARD will allow you to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet. Adding the hostname as a dimension to your CloudWatch metrics will allow you to identify the distribution of your load.
Option D is incorrect. The default metric level (DETAILED) and granularity (SHARD) will get you the level of granularity needed to monitor for load distribution, however, adding the hostname as a dimension to your CloudWatch metrics is a more efficient way to identify uneven load distribution.

References: Please see the Amazon Kinesis Data Streams developer guide titled Developing Producers Using the Amazon Kinesis Producer Library (https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html), and the Amazon Kinesis Data Streams developer guide titled Writing Data to Amazon Kinesis Data Stream (https://docs.aws.amazon.com/streams/latest/dev/building-producers.html), and the Amazon Kinesis Data Streams developer guide titled Monitoring Amazon Kinesis Data Streams (https://docs.aws.amazon.com/streams/latest/dev/monitoring.html), and the Amazon Kinesis Data Streams developer guide titled Monitoring the Kinesis Producer Library with Amazon CloudWatch (https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kpl.html), and the Amazon Kinesis Data Streams developer guide titled Using the Kinesis Client Library (https://docs.aws.amazon.com/streams/latest/dev/shared-throughput-kcl-consumers.html), and the Amazon Kinesis Data Streams page titled Getting started with Amazon Kinesis Data Streams (https://aws.amazon.com/kinesis/data-streams/getting-started/#:~:text=Amazon%20Kinesis%20Client%20Library%20(KCL,S3%2C%20and%20Amazon%20Elasticsearch%20Service.) 

 

Domain : Security

Question 4 : You work as a data engineer for a government agency that compiles data on the Gross Domestic Product (GDP) for the country. To facilitate the building of the data lake that houses the GDP data, your team is responsible for managing the configuration and deployment of all EMR clusters used by your government analysts. You are responsible for centralizing governance and compliance requirements, and providing a common set of policies on how EMR instances should be set up. Your goal is to enable your analysts to be able to quickly deploy only your agency’s approved EMR cluster configurations on a self-service basis while staying within the governance and compliance requirements of your agency.
What is the most efficient way to implement your EMR cluster management system?

A. Create a set of CloudFormation templates, one for each configuration to used as a self-service deployment configuration
B. Use AWS Systems Manager to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters.
C. Use AWS OpsWorks and build a Puppet master server to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters.
D. Use AWS Service Catalog to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters.

Correct Answer: D

Explanation:

Option A is incorrect. This approach is very inefficient. Your team would have to write and maintain all of the templates. Using AWS Service Catalog to create and manage your deployment configurations as products is much more efficient.

Option B is incorrect. AWS Systems Manager is not used to create portfolios of products for systematic distribution, AWS Service Catalog is used for this purpose.

Option C is incorrect. AWS OpsWorks Puppet gives you a set of tools for enforcing the desired state of your infrastructure, and automating on-demand tasks. However, it would be far more time consuming to use this approach than using AWS Service Catalog.

Option D is correct. You can use AWS Service Catalog to centrally manage your analysts’ commonly deployed EMR cluster configurations. This approach helps you achieve consistent governance and meet your compliance requirements, while at the same time enabling your analysts to quickly deploy only the approved EMR cluster configurations on a self-service basis.

References: Please see the AWS Big Data blog titled Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog (https://aws.amazon.com/blogs/big-data/build-a-self-service-environment-for-each-line-of-business-using-amazon-emr-and-aws-service-catalog/), and the AWS OpsWorks user guide titled What Is AWS OpsWorks? (https://docs.aws.amazon.com/opsworks/latest/userguide/welcome.html

 

Domain : Security

Question 5 : You work as a data engineer for an automobile manufacturer. Your company is building a data lake using EMR as the big data platform to enable company analysts to run petabyte-scale analysis on the company’s car sales and manufacturing data. For your analysts’ access to your EMR cluster nodes, you need to provide strong authentication so that passwords or other credentials aren’t sent over the network in an unencrypted format, therefore you have chosen to use Kerberos authentication. You also need to allow your analysts to connect to your EMR cluster nodes, however you do not want to have your analysts use an EC2 private key file when connecting to your EMR cluster.
Which Kerberos architecture options allow you to meet your security requirements? (Select TWO)

A. Cluster-dedicated KDC (KDC on master node)
B. Cross-realm trust
C. External KDC – MIT KDC
D. External KDC – master node on a different cluster
E. External KDC – cluster KDC on a different cluster with Active Directory cross-realm trust

Correct Answers: B and E

Explanation:

Option A is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster.

Option B is correct. Cross-realm trusts are most commonly implemented using Active Directory. With this architecture, if your analysts are in your Active Directory domain they can use kinit credentials to access your clusters that are protected via kerberos, without using an EC2 private key file.

Option C is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster.

Option D is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster.

Option E is correct. An Active Directory cross-realm trust is implemented using Active Directory. With this architecture analysts in the Active Directory domain can access your Kerberized clusters using kinit credentials, without the EC2 private key file.

References: Please see the Amazon EMR management guide titled Kerberos architecture options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-options.html), and the Amazon EMR management guide titled Use Kerberos authentication (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html)  

 

Domain : Analysis and Visualization

Question 6 : You work as a data scientist for a data analytics firm. Your firm collects data about product usage, consumer behavior, global supply chains, etc. As you gather the data from your sources, you need to transform and aggregate the data for use by your clients. You have written AWS Glue jobs to perform the transformation and aggregation. You need to gather metrics from your Glue jobs to ensure they are performing as expected by tracking runtime metrics such as bytes read and written, memory usage and CPU load of the driver and executors, and data shuffles among executors. How do you enable the gathering of Glue metrics?

A. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueContext class
B. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueTransform class
C. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a DynamicFrame class
D. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueMetrics class

Correct Answer: A

Explanation:

Option A is correct. When you enable job metrics in your Glue job definition, the job script initializes a GlueContext class which is then used to initialize the Spark session.

Option B is incorrect. The GlueTransform class is used to transform data, not to gather metrics.

Option C is incorrect. The DynamicFrame class is used to manipulate your data in a dataframe.

Option D is incorrect. There is no GlueMetrics class.

References: Please see the AWS Glue developer guide titled Monitoring AWS Glue Using Amazon CloudWatch Metrics (https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html), and the AWS announcement titled AWS Glue now provides additional ETL job metrics (https://aws.amazon.com/about-aws/whats-new/2018/07/aws-glue-now-provides-additional-ETL-job-metrics/), and the AWS Glue developer guide titled Job Monitoring and Debugging (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), and the AWS Glue developer guide titled GlueContext Class (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html

 

Domain : Collection

Question 7 : You work as a data engineer for an international airline. Your company is building a data lake to house flight data including airplane travel routes, passenger capacity, weather patterns, fuel consumption, etc. You and your data engineering team have decided to use AWS Lake Formation to build your data lake. You are loading data from your operational systems and you need to decide which Lake Formation blueprint to use. If you choose to use a Database Snapshot type of Lake Formation blueprint, which of the following characteristics describe your requirements? (Select TWO)

A. Schema evolution is flexible, e.g. columns are re-named, previous columns are deleted, and new columns are added in their place
B. Schema evolution is incremental, e.g. there is only successive addition of columns
C. Complete consistency is needed between the source and the destination
D. Only new rows are added; previous rows are not updated
E. Only new rows are added; previous rows are updated

Correct Answers: A and C

Option A is correct. A Database Snapshot blueprint loads or reloads data from all tables into the data lake from a JDBC source. Therefore, schema evolution can be flexible.

Option B is incorrect. Incremental schema evolution is more suited to the Incremental Database blueprint.

Option C is correct. A Database Snapshot blueprint loads or reloads data from all tables into the data lake from a JDBC source. Therefore, you can achieve complete consistency between your source and your destination.

Option D is incorrect. When only adding new rows and not updating previous rows, the Incremental Database blueprint is a better choice. 

Option E is incorrect. This option does not match a use case for any of the three Lake Formation blueprints: Database Snapshot, Incremental Database, or Log File.

References: Please see the AWS Lake Formation developer guide titled AWS Lake Formation: How It Works (https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html), and the AWS Lake Formation developer guide titled Importing Data Using Workflows in Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/workflows.html), and the AWS Lake Formation developer guide titled Blueprints and Workflows in Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/workflows-about.html

 

Domain : Processing

Question 8 : You work as a data engineer for a security surveillance company that provides video security for business and residential properties. You are building a live video streaming service that will be used for real-time video analysis of security camera footage. The camera devices you are using run a proprietary operating system, also they don’t run a Java virtual machine. You and your team are writing your video processing code that extracts data from your camera video and sends the video fragments to your Kinesis Video stream. Which coding approach is the most efficient for you to use?

A. Use the Kinesis Video Streams Producer Client
B. Use the Kinesis Producer Library (KPL)
C. Use the Kinesis Video Streams Producer Library
D. Use the Kinesis Video Streams Media Source Library 

Correct Answer: C

Explanation:

Option A is incorrect. The Kinesis Video Streams Producer Client is used when your producing device, or camera in your case, runs either Java or Android applications. Your devices run a proprietary operating system and don’t run a Java virtual machine.

Option B is incorrect. Since you are sending your data to a Kinesis Video stream, you should use one of the Kinesis Video Streams producer libraries. Using these libraries that are built for Kinesis Video Streams will be more efficient than using the KPL.

Option C is correct. You should use the Kinesis Video Streams Producer Library directly when the device on which you are running the application doesn’t have a Java virtual machine, and when your application is running on a device with a proprietary operating system.

Option D is incorrect. There is no Kinesis Video Streams Media Source Library. 

References: Please see the Amazon Kinesis Video Streams developer guide titled Amazon Kinesis Video Streams: How It Works (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html), and the Amazon Kinesis Video Streams developer guide titled Kinesis Video Streams API and Producer Libraries Support (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works-kinesis-video-api-producer-sdk.html), and the Amazon Kinesis Video Streams developer guide titled Kinesis Video Streams Producer Libraries (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producer-sdk.html), and the Amazon Kinesis Video Streams developer guide titled Using the C++ Producer Library (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producer-sdk-cpp.html

 

Domain : Storage and Data Management

Question 9 : You work as a data scientist for an online retailer where you are responsible for managing the company’s product catalog. The catalog data of approximately 500,000 products is stored in their DynamoDB database. The DynamoDB tables that hold the product catalog data use 2GB of storage. You are building an Elasticsearch cluster to allow your analysts to efficiently search the product catalog. You are assuming a compression ratio of 1.0 for your indexed data. Also, you plan to use an m4.2xlarge Elasticsearch instance node type. You need to make sure the Elasticsearch cluster is highly available and that it is configured for optimal performance. How many Elasticsearch shards should you set for your index shard count and how many nodes should you create?

A. 1 shard and 2 nodes
B. 2 shards and 2 nodes
C. 2 shards and 1 node
D. 1 shard and 1 node 

Correct Answer: A

Explanation:

Option A is correct. The calculation for your shards and nodes: total storage to be indexed 2GB multiplied by your compression ratio of 1.0 gives you 2GB for your index size. To get the number of required shards, divide your index storage by 30GB. Therefore, 2GB/30GB means you can get away with having only one shard. Also, you need to have one replica for redundancy so your index storage becomes 2 X 2GB = 4GB. Your m4.2xlarge instance has 8GB storage so you could use one node. However, you need a highly available solution so you should add an additional node. Therefore, you need 2 nodes.

Option B is incorrect. Based on the shard calculation number of shards = index size / 30GB, your calculation is 2GB/30GB. Therefore, you only need one shard. 

Option C is incorrect. Based on the shard calculation number of shards = index size / 30GB, your calculation is 2GB/30GB. Therefore, you only need one shard. You should have two nodes for high availability.

Option D is incorrect. You should have two nodes for high availability. 

References: Please see the AWS Database blog titled Get Started with Amazon Elasticsearch Service: How Many Shards Do I Need? (https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-how-many-shards-do-i-need/), and the AWS Database blog titled Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need? (https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-how-many-data-instances-do-i-need/)  

 

Domain : Analysis and Visualization

Question 10 :  You work as a data engineer for a marketing firm. You and your engineering team have been given the task of creating a dashboard of the data behind the marketing firm’s Objectives and Key Results (OKRs) and the progress toward achieving those OKRs. The source data for the OKR dashboard comes from many of the firm’s operational systems. You have created the initial visualization of your dashboard in QuickSight. How would you construct an architecture to visualize your dashboard in QuickSight that refreshes the data in your dashboard as soon as the data is available?

A. Operational data loaded into an S3 bucket; an EventBridge rule triggers a Lambda function which uses the CreateIngestion API operation to refresh the data in QuickSight SPICE
B. Operational data loaded into an S3 bucket; use the options on Datasets page to refresh the data in QuickSight SPICE
C. Operational data loaded into an S3 bucket; an EventBridge rule triggers a Lambda function which uses the CreateDataSet API operation to refresh the data in QuickSight SPICE
D. Operational data loaded into an S3 bucket; schedule refreshes in the dataset settings to refresh the data in QuickSight SPICE 

Correct Answer: A

Explanation:

Option A is correct. To have the latest data displayed in your dashboard, you need to refresh the SPICE data. There are 4 ways to refresh the SPICE data: use the options on Datasets page in the QuickSight UI, refresh the dataset by editing the dataset, schedule refreshes in the dataset settings, or use the CreateIngestion API operation to refresh the data. The best option to have the data refresh as soon as the data is available is to trigger a Lambda function to run the CreateIngestion API operation to refresh the data in QuickSight SPICE.

Option B is incorrect. This option requires manual intervention, therefore making it very inefficient and very unlikely that it would allow you to visualize the data as soon as it’s available. 

Option C is incorrect. The CreateDataSet API operation creates a new SPICE dataset, you wouldn’t use this API operation to refresh an existing dataset.

Option D is incorrect. Scheduling refreshes would eventually visualize your data, however your requirement is to visualize the data as soon as the data is available. 


References: Please see the Amazon QuickSight user guide titled Refreshing Data (https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html), and the Amazon QuickSight API reference titled CreateIngestion (https://docs.aws.amazon.com/quicksight/latest/APIReference/API_CreateIngestion.html), and the AWS Big Data blog titled Event-driven refresh of SPICE datasets in Amazon QuickSight (https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/

 

Domain : Storage and Data Management

Question 11 : You work as a data engineer for a shipping company. Your company tracks shipments, shipping containers, shipping contractors, and other related operational data in your data warehouse. You and your engineering team have chosen to use Redshift to house your data warehouse. Your company ingests its operational data every day but the initial storage requirement is relatively small. You have estimated that your data warehouse will grow over time, but will never exceed 1 petabyte in size. Your management team has mandated that you build the most cost effective storage and processing architecture for your Redshift cluster. Which storage node type gives you the best price/performance ratio?

A. ra3.4xlarge nodes
B. ra3.xlplus nodes
C. ds2.xlarge nodes
D. ds2.8xlarge nodes

Correct Answer: B

Explanation:

Option A is incorrect. The rs3.4xlarge node type gives you a total managed storage capacity of 8 petabytes, but you don’t expect that your storage requirement will ever exceed 1 petabyte. Also, the ra3.4xlarge node type costs $3.26 per/hour. This is far more expensive than the ra3.xlplus node type.

Option B is correct. The ra3.xlplus node type gives you a total managed storage capacity of 1 petabyte, and it costs $1.086 per/hour. Also, the ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type. Therefore, the ra3.xlplus node type gives you the best price/performance ratio.

Option C is incorrect. The ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type.

Option D is incorrect. The ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type. 

References: Please see the AWS Big Data blog titled Introducing Amazon Redshift RA3.xlplus nodes with managed storage (https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/), and the Amazon Redshift cluster management guide titled Amazon Redshift clusters (https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html

 

Domain : Security

Question 12 : You work as a data engineer for a financial services firm where you are responsible for the firm’s data warehouse and the associated security of the data warehouse. The data warehouse contains information about the firm’s clients and information about the firm’s trading activity. This data must be monitored for security auditing, specifically authentication attempts and connections/disconnections to the data warehouse. You have enabled audit logging on your Redshift cluster. Which Redshift audit log captures the authentication method for user activity in the data warehouse? 

A. User Activity log
B. User log
C. Connection log
D. Security log

Correct Answer: C

Explanation:

Option A is incorrect. The User Activity log captures information about the types of queries that users and system tasks perform in the database.

Option B is incorrect. The User log captures information about changes to database user definitions.

Option C is correct. The Connection log captures information about the users who are connecting to the database, including related connection information, such as the type of authentication used.

Option D is incorrect. There is no Redshift audit log named Security log. 

References: Please see the Amazon Redshift cluster management guide titled Database audit logging (https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html), and the Amazon Redshift cluster management guide titled Logging and monitoring in Amazon Redshift (https://docs.aws.amazon.com/redshift/latest/mgmt/security-incident-response.html)  

 

Domain : Collection

Question 13 : You work as a data engineer for a large regional medical insurance firm. Your firm gathers medical and insurance data from several sources that is loaded into your data lake. The data needs to be transformed as you load it into your data lake. You are designing your data loading process and you have identified the need for several small to medium-sized generic tasks that will be part of your ETL (extract, transform, load) workflow. You have chosen to use AWS Glue for your ETL workflow. Which of the types of AWS Glue jobs is the most cost effective, in terms of DPUs (data processing units), for your design?

A. Apache Spark
B. Python shell
C. Spark Streaming
D. Scala shell

Correct Answer: B

Explanation:

Option A is incorrect. Using Apache Spark would be more expensive than Python shell scripts. An Apache Spark job run in Glue requires a minimum of 2 DPUs. Each DPU costs $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum billing duration. For Python scripts, Glue allocates 0.0625 DPU to each Python shell job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum duration for each job of type Python shell.

Option B is correct. The type of job you are running, small to medium-sized generic tasks, is best suited to Python shell scripts. Also, Python scripts are less expensive as far as DPU allocation per job. Python shell scripts use either 1 or 0.0625 DPUs, where a Spark Streaming or Apache Spark job requires a minimum of 2 DPUs.

Option C is incorrect. A Spark Streaming job run in Glue requires a minimum of 2 DPUs. Each DPU costs $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum billing duration.

Option D is incorrect. There is no Scala shell script type of Glue job.

References: Please see the AWS Glue product page titled AWS Glue pricing (https://aws.amazon.com/glue/pricing/), and the AWS Announcement titled Introducing Python Shell Jobs in AWS Glue (https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/)   

 

Domain : Processing

Question 14 : You work as a data engineer for a transportation company. Your company streams data from several operational sources and data providers to build a data lake. Your management team uses the data in the data lake to create business intelligence dashboards. Your machine learning specialists also use the data lake as the source data for their machine learning models. You have built a real-time streaming data pipeline using Amazon Managed Streaming for Apache Kafka (Amazon MSK). You have created your MSK cluster and have configured MSK to create broker nodes in each Availability Zone in your region. Which of the Amazon KSK components coordinates cluster tasks and maintains state for resources interacting with your Apache Kafka cluster? 

A. Broker Nodes
B. Zookeeper Nodes
C. Data Producer
D. Cluster Operator

Correct Answer: B

Option A is incorrect. In Amazon MSK, Apache Kafka partitions topics and replicates the partitions across multiple nodes called broker nodes. Apache Kafka runs the broker nodes.

Option B is correct. In Amazon MSK, the Zookeeper nodes coordinate cluster tasks and maintain state for resources interacting with an Apache Kafka cluster.

Option C is incorrect. In Amazon MSK, Data Producers are the applications that produce streaming data and send it to the cluster.

Option D is incorrect. There is no Cluster Operator component in Amazon MSK.

References: Please see the Amazon MSK product page titled Amazon Managed Streaming for Apache Kafka (Amazon MSK) (https://aws.amazon.com/msk/), and the Amazon Managed Streaming for Apache Kafka developer guide titled What Is Amazon MSK?(https://docs.amazonaws.cn/en_us/msk/latest/developerguide/what-is-msk.html), and the Amazon MSK FAQs (https://aws.amazon.com/msk/faqs/

 

Domain : Security

Question 15 : You work as a data engineer for a company that offers a property rental service app. Your company’s data analysts need access to your data lake of property information to analyze the rental property data and produce dashboards and operational intelligence visualizations. The analysts need to be able to search through your property information using a fast search engine, so you have set up Elasticsearch for search and Kinbana as your visualization tool. Your company uses single sign-on (SSO) technology for access to your internal applications. You need to control access to your Kibana service, which access control is the most efficient for your organization?

A. IP-based access policy
B. IAM users and roles
C. SAML authentication
D. Cognito authentication

Correct Answer: C

Explanation:

Option A is incorrect. An IP-based access policy is used for public access domains. You are running your Kibana service for your internal analysts. Also, Kibana is a JavaScript application that originates its requests from the user’s IP address. IP-based access control is impractical due to the sheer number of IP addresses you would need to allow in order for each user to have access to Kibana. You could solve this with a proxy server, but using SAML authentication is a simpler approach that also allows for single sign-on.

Option B is incorrect. Kibana does not natively support IAM users and roles.

Option C is correct. SAML authentication for Kibana lets you use your existing identity provider to offer single sign-on (SSO) for Kibana on your Elasticsearch domain.

Option D is incorrect. While you could use Cognito user and identity pools, it will be more efficient for you to use SAML authentication since your company already uses SSO.

References: Please see the Amazon Elasticsearch Service developer guide titled Using Kibana with Amazon Elasticsearch Service (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html), and the Amazon Elasticsearch Service developer guide titled Configuring Amazon Cognito authentication for Kibana (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-cognito-auth.html), and the Amazon Elasticsearch Service developer guide titled SAML authentication for Kibana (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/saml.html)  

 

Domain : Analysis and Visualization

Question 16 : You work as a data engineer for an online retailer where you are setting up sessionalization to track the clickstream data of your users. Your marketing department plans to use clickstream data analysis to help assess the effectiveness of your company’s new online features and marketing campaigns. Your clickstream data arrives at the rate of thousands of messages per second. Your marketing department wants to assess the data in real-time so that they can be very nimble in their use of targeted marketing and features. How should you perform your sessionalization?

A. Send the clickstream data through Kinesis Data Streams to Glue, use Glue to perform data sessionalization
B. Send the clickstream data through Kinesis Data Streams to EMR, use EMR to perform data sessionalization
C. Send the clickstream data through Kinesis Data Firehose to S3, use Athena to perform data sessionalization
D. Send the clickstream data through Kinesis Data Streams to Kinesis Data Analytics, use Kinesis Data Analytics to perform data sessionalization

Correct Answer: D

Explanation:

Option A is incorrect. You could perform the sessionization in batch jobs using Glue or Amazon EMR. However, this will not give you real-time access to the data. 

Option B is incorrect. You could perform the sessionization in batch jobs using Glue or Amazon EMR. However, this will not give you real-time access to the data.

Option C is incorrect. Streaming the data directly to S3 using Kinesis Firehose and accessing the data with Athena will not allow you to efficiently sessionalize your clickstreams data. 

Option D is correct. Using Kinesis Data Analytics to sessionalize your clickstream data is much faster than the other options. This configuration allows you to provide real-time seasonalized data.

References: Please see the AWS Big Data blog titled Create real-time clickstream sessions and run analytics with Amazon Kinesis Data Analytics, AWS Glue, and Amazon Athena (https://aws.amazon.com/blogs/big-data/create-real-time-clickstream-sessions-and-run-analytics-with-amazon-kinesis-data-analytics-aws-glue-and-amazon-athena/), and the AWS What’s New page titled New Kinesis Analytics stream processing functions for time series analytics, real time sessionization, and more (https://aws.amazon.com/about-aws/whats-new/2017/09/new-kinesis-analytics-stream-processing-functions-for-time-series-analytics-real-time-sessionization-and-more/)

 

Domain: Analysis and Visualization

Question 17 : You work as a data scientist for a logistics company. Your company has a fleet of thousands of trucks on the road at any given time delivering temperature-sensitive freight. You are responsible for building a dashboard that shows any anomaly in the temperatures of any of the trucks on the road. Each truck has a temperature sensor onboard that streams the current temperature of the onboard freight at 1 minute intervals. Which option gives you a streaming data pipeline, including real-time analytics, anomaly detection, and visualization in the most efficient manner?

A. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose performs anomaly detection using a Lambda function, Kinesis Data Firehose writes the processed anomaly data to S3, use QuickSight to visualize the processed anomaly data
B. Sensors send temperature data to Kinesis Data Streams, Kinesis Data Streams sends the temperature data to Kinesis Data Analytics, use the built-in Random Cut Forest function in Kinesis Data Analytics to detect anomalies in real time, Kinesis Data Analytics sends the processed anomaly data to a Kinesis Data Firehose delivery stream, Kinesis Data Firehose sends the processed anomaly data to an Elasticsearch cluster where you use Kibana to visualize the anomaly data
C. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose streams the data to an S3 bucket, a SageMaker Random Cut Forest model detects anomalies in the data and writes the resulting processed anomaly data to another S3 bucket, use QuickSight to visualize the processed anomaly data
D. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose streams the data to an S3 bucket, a SageMaker Random Cut Forest model detects anomalies in the data and writes the resulting processed anomaly data to an Elasticsearch cluster where you use Kibana to visualize the anomaly data

Correct Answer: B

Explanation:

Option A is incorrect. This option writes your streaming temperature data to an S3 bucket. This step will add latency in the processing, so you won’t get real-time anomaly detection. Also, writing a Lambda function to do anomaly detection is far less efficient than using a Random Cut Forest machine learning model. 

Option B is correct. Using Kinesis Data Analytics and its built-in Random Cut Forest feature you can detect temperature anomalies in real-time. Using Elasticsearch and Kibana you can easily visualize the anomaly data and provide a real-time dashboard.

Option C is incorrect. This option writes your streaming temperature data and your processed anomaly data to S3 buckets. These steps will add latency in the processing, so you won’t get real-time anomaly detection. 

Option D is incorrect. This option writes your streaming temperature data to an S3 bucket. This step will add latency in the processing, so you won’t get real-time anomaly detection.

References: Please see the AWS Big Data blog titled Perform Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch Service (https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/), and the AWS Machine Learning blog titled Building a visual search application with Amazon SageMaker and Amazon ES (https://aws.amazon.com/blogs/machine-learning/building-a-visual-search-application-with-amazon-sagemaker-and-amazon-es/)

 

Domain: Storage and Data Management

Question 18 : You work as a data engineer for an international airline. Your data engineering team is responsible for the company’s data warehouse, which you have built on a Redshift cluster. The data warehouse stores information about the airline’s travel patterns, customer preferences, miles programs, etc. It is important that the Redshift cluster remains very highly available so you have configured automatic snapshots and automatic snapshot copy from your corporate headquarters (your source region) to your European regional office (your destination region). Due to changes in your corporate strategy, you now need to change your destination region to the Asia Pacific region. Which are the most efficient options (Select TWO)?

A. In the AWS console, select your Redshift cluster and specify the new destination AWS Region
B. Use the AWS CLI to select your Redshift cluster and specify the new destination AWS Region
C. Use the AWS console to disable the automatic copy feature, then re-enable it, specifying the new destination AWS Region
D. Use the AWS console to disable the automatic snapshot feature, then re-enable it, specifying the new destination AWS Region
E. Use the AWS CLI to disable the automatic copy feature, then re-enable it, specifying the new destination AWS Region

Correct Answers: C and E

Explanation:

Option A is incorrect. Through the console or the CLI, you can’t change the destination region while the Redshift automatic copy feature is enabled. You must first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region.

Option B is incorrect. Through the console or the CLI, you can’t change the destination region while the Redshift automatic copy feature is enabled. You must first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region.

Option C is correct. You can use the AWS console to first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region. 

Option D is incorrect. You need to change the automatic copy feature, not the automatic snapshot feature.

Option E is correct. You can use the AWS CLI to first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region.

References: Please see the Amazon Redshift cluster management guide titled Amazon Redshift snapshots (https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html#cross-region-snapshot-copy), and the AWS CLI Command Reference titled disable-snapshot-copy (https://docs.aws.amazon.com/cli/latest/reference/redshift/disable-snapshot-copy.html), and the AWS CLI Command Reference titled enable-snapshot-copy (https://docs.aws.amazon.com/cli/latest/reference/redshift/enable-snapshot-copy.html)   

 

Domain : Processing

Question 19 : You work as a data engineer for a global travel agency. Your company collects data from resorts across the globe to ingest into your data lake. Your marketing analysts use the data lake to generate insights to help produce the most effective marketing campaigns. You and your engineering team use AWS Glue to ingest your travel data into your data lake. You have configured your Glue jobs and development endpoints to use the Glue Data Catalog as an external Apache Hive metastore by checking the Use AWS Glue Data Catalog as the Hive metastore check box in the Catalog options group on the Add job and Add endpoint pages on the console. Which permissions should the IAM role used for your jobs and development endpoints have to allow use of the Glue Data Catalog as the Hive metastore?

A. glue:CreateDatabase
B. glue:CreateConnection
C. glue:CreateEndpoint
D. glue:CreateJob

Correct Answer: A

Explanation:

Option A is correct. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions.

Option B is incorrect. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions, not  the glue:CreateConnection permissions.

Option C is incorrect. There is no glue:CreateEndpoint permissions defined in IAM.

Option D is incorrect. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions, not  the glue:CreateJob permissions.

References: Please see the AWS Glue developer guide titled AWS Glue Data Catalog Support for Spark SQL Jobs (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html), and the AWS Glue developer guide titled AWS Glue API Permissions: Actions and Resources Reference (https://docs.aws.amazon.com/glue/latest/dg/api-permissions-reference.html

 

Domain : Collection

Question 20 : You work as a data engineer for a data analytics company that sells data analytics solutions to marketing companies interested in targeted online marketing. Your data engineering department ingests real-time data streams from multiple sources to produce your data lake. You are building a data ingestion pipeline where you need to split data between multiple S3 buckets in your data lake in near real-time. Which is the best option for implementing your data ingestion requirement?

A. S3 Replication
B. S3 Batch
C. Snowball Edge
D. DataSync

Correct Answer: D

Explanation:

Option A is incorrect. You should use S3 Replication for continuous replication of data to a specific destination bucket, not to split data across multiple S3 buckets in your data lake.

Option B is incorrect. S3 Batch will not meet your near real-time requirement.

Option C is incorrect. Snowball Edge is used for offline data transfers where you are transferring data from remote or disconnected environments. 

Option D is correct. DataSync is the best option for splitting data between multiple buckets.

References: Please see the AWS DataSync FAQs page, specifically the question “To transfer objects between my buckets, when do I use AWS DataSync, when do I use S3 Replication, and when do I use S3 Batch Operations?” (https://aws.amazon.com/datasync/faqs/), and the AWS DataSync user guide titled What is AWS DataSync? (https://docs.aws.amazon.com/datasync/latest/userguide/what-is-datasync.html)  

 

Domain : Analysis and Visualization

Question 21 : You work as a data analyst for an online retail service. Your company uses a data lake housed on S3 to store user clickstream data, client account information, product information, etc. You have been given the assignment of creating a dashboard in QuickSight using your clickstream data to gain insight into user behavior. Which of the following data sources are NOT valid choices to build your QuickSight dashboard? (Select TWO)

A. Hive
B. Presto
C. DynamoDB
D. Redshift Spectrum
E. S3 Analytics

Correct Answers: A and C

Explanation:

Option A is correct. Hive is NOT a supported data source for QuickSight. 

Option B is incorrect. Presto is a supported data source for QuickSight.

Option C is correct. DynamoDB is NOT a supported data source for QuickSight.

Option D is incorrect. Redshift Spectrum is a supported data source for QuickSight.

Option E is incorrect. S3 Analytics is a supported data source for QuickSight.

References: Please see the Amazon QuickSight user guide titled Supported Data Sources (https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html), and the AWS Database blog titled How to perform advanced analytics and build visualizations of your Amazon DynamoDB data by using Amazon Athena (https://aws.amazon.com/blogs/database/how-to-perform-advanced-analytics-and-build-visualizations-of-your-amazon-dynamodb-data-by-using-amazon-athena/

 

Domain : Security

Question 22 : You work as a data analyst for a global financial services company. Your company stores client information in their data lake for clients located in different countries around the world. In order to comply with data sovereignty laws you are required to store data in separate AWS accounts and you are barred from letting your client data leave their specific region. How can you ensure your data in your data lake is highly available? 

A. Use S3 Cross-Region Replication
B. Use S3 Same-Region Replication
C. Use  S3 Time Control Replication
D. Use S3 Batch Replication

Correct Answer: B

Explanation:

Option A is incorrect. You have the requirement to keep your client data in the AWS region that is within the client’s country of origin. Cross-region replication could move the client data out of the client’s country of origin. 

Option B is correct. Same-region replication allows you to replicate data between buckets within the same region, thus satisfying the requirement to keep your client data in the AWS region that is within the client’s country of origin while also giving you high availability.

Option C is incorrect. S3 Time Control replication allows you to meet replication service level agreements (SLAs).

Option D is incorrect. There is no S3 Batch Replication.

Reference: Please see the Amazon S3 features page titled Amazon S3 Replication (https://aws.amazon.com/s3/features/replication/#:~:text=When%20to%20use%20S3%20Replication,and%20data%20sharing%20across%20accounts.)

 

Domain: Processing

Question 23 : You work as a data engineer for a hedge fund that trades on the global derivatives markets. Your firm gathers data from various streaming data services to populate its data lake on S3. The data frequently needs to be transformed before it’s stored in your data lake. You and your engineering team have built a data ingestion pipeline using Kinesis Data Firehose. Your Kinesis Data Firehose stream leverages lambda functions to perform the necessary transformations. Sometimes your pipeline processes so much data at such a high rate that your AWS account reaches the Lambda invocation limit. What happens when your pipeline reaches the Lambda invocation limit?

A. Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the records are lost
B. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the records are lost.
C. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the the unsuccessfully processed records are delivered to your S3 bucket in the processing-failed folder.
D. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the the unsuccessfully processed records are delivered to your SQS queue and tagged with the processing-failed label.

Correct Answer: C

Explanation:

Option A is incorrect. The records are not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets.

Option B is incorrect. The records are not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets.

Option C is correct. Kinesis Data Firehose ensures that your data is not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets.

Option D is incorrect. Kinesis Data Firehose delivers your unsuccessfully processed records to one of your S3 buckets, not an SQS queue.

Reference: Please see the Amazon Kinesis Data Firehose developer guide titled Amazon Kinesis Data Firehose Data Transformation (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html

 

Domain : Storage and Data Management

Question 24 : You work as a data engineer for a social media software company. You stream data from the company’s websites and mobile apps into your data lake. You also stream data from marketing analytics firms into your data lake. This data is transformed and aggregated and then loaded into your Redshift data warehouse for use in business intelligence dashboards and queries. You are now streaming a new data source (which is in the CSV format) using Kinesis Data Firehose and you have decided that the best format for this new data is parquet, since the source data is large and you can take advantage of partitioning and columnar query performance. Which option describes the most optimal way to transform the data and then load it from your data lake to your data warehouse?

A. Use Kinesis Data Firehose to transform the streaming data from CSV to parquet and set the destination of the transformed parquet data to your Redshift cluster.
B. Have your Kinesis Data Firehose stream leverage a Lambda function to transform the CSV data to JSON, then have your Kinesis Data Firehose stream convert the JSON data to paquet and set the destination of the transformed parquet data to your Redshift cluster.
C. Use Kinesis Data Firehose to transform the streaming data from CSV to parquet, then set the destination of the transformed parquet data to an S3 bucket, then use the Redshift COPY command to copy your parquet data to your Redshift cluster.
D. Have your Kinesis Data Firehose stream leverage a Lambda function to transform the CSV data to JSON, then have your Kinesis Data Firehose stream convert the JSON data to paquet and set the destination of the transformed parquet data to an S3 bucket, then use the Redshift COPY command to copy your parquet data to your Redshift cluster.

Correct Answer: D

Explanation:

Option A is incorrect. Kinesis Data Firehose cannot convert from CSV directly to parquet. It needs to leverage a Lambda function to first convert the data to JSON.

Option B is incorrect. When you enable record format conversion in Kinesis Data Firehose, you can’t set your Kinesis Data Firehose destination to your Redshift cluster. With format conversion enabled, S3 is the only destination that you can use for your Kinesis Data Firehose delivery stream.

Option C is incorrect. Kinesis Data Firehose cannot convert from CSV directly to parquet. It needs to leverage a Lambda function to first convert the data to JSON.

Option D is correct. Kinesis Data Firehose needs to leverage a Lambda function to first convert the data to JSON. You then need to set your destination to S3 because when you enable record format conversion in Kinesis Data Firehose, you can’t set your Kinesis Data Firehose destination to your Redshift cluster. With format conversion enabled, S3 is the only destination that you can use for your Kinesis Data Firehose delivery stream. Once your data is on S3, you can use the Redshift COPY command to load your data into your Redshift tables.

References: Please see the Amazon Kinesis Data Firehose developer guide titled Converting Your Input Record Format in Kinesis Data Firehose (https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html), and the AWS What’s New page titled Amazon Redshift Can Now COPY from Parquet and ORC File Formats (https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/), and the Amazon Redshift database developer guide titled Tutorial: Loading data from Amazon S3 (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html

 

Domain: Storage and Data Management  

Question 25: You work as a data scientist for an analytics company that receives streaming data from a marketing data source, which is combined with reference data from your S3 data lake. The streaming data is received via a Kinesis Data Firehose delivery stream which is fed into your Kinesis Data Analytics in-application input stream. You need to define the schema for your in-application input stream to deliver the required data needed for your analytics application. As you use the schema selection criteria for determining what part of the streaming input will be transformed into a data column in your in-application input stream, which are valid selection criteria input types for defining what part of the streaming input is transformed into a data column in the in-application input stream? (Select TWO)

A. A JSONPath expression

B. A column number for input streams in CSV format

C. A row number for input streams in CSV format

D. A column number and a SQL data type for input streams in CSV format

E. An XPath expression

Answers: A, B

Explanation:

Option A is correct. There are three streaming input types that are valid for selection criteria when defining the schema selection criteria for a Kinesis Data Analytics in-application input stream: a JSONPath expression, a column number for input streams in CSV format, and a column name and a SQL data type for input streams in CSV format.

Option B is correct. There are three streaming input types that are valid for selection criteria when defining the schema selection criteria for a Kinesis Data Analytics in-application input stream: a JSONPath expression, a column number for input streams in CSV format, and a column name and a SQL data type for input streams in CSV format.

Option C is incorrect. A row number for CSV input streams is not a valid streaming input type.

Option D is incorrect. A column number and a SQL data type for input streams in CSV format is not a valid streaming input type. A column name and a SQL data type for input streams in CSV format is a valid input type.

Option E is incorrect. An XPath expression is not a valid streaming input type.

Reference:

Please see the Amazon Kinesis Data Analytics for SQL Applications developer guide titled Working with the Schema Editor (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/console-summary-edit-schema.html), and the Amazon Kinesis Data Analytics for SQL Applications developer guide titled DiscoverInputSchema (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/API_DiscoverInputSchema.html), and the Amazon Kinesis Data Analytics for SQL Applications developer guide titled Configuring Application Input (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works-input.html)

FAQ

1. What is the job market like for AWS data analytics specialists?

The job market for AWS data analytics specialists is strong. According to the latest data from the U.S. Bureau of Labor Statistics, the median salary for this occupation is $86,010. The job market is expected to grow at a rate of 21 percent through 2026, which is much faster than the average for all occupations.

2. What are the most common AWS data analytics tools?

The most common AWS data analytics tools are Amazon Redshift, Amazon Athena, and Amazon EMR. These tools are used to manage and analyze data in the cloud.

3. What are the most common use cases for AWS data analytics?

Common use cases for AWS data analytics include data warehousing, data lakes, data mining, and business intelligence.

Summary

Hope you have enjoyed learning all these AWS Data Analytics Specialty exam questions and answers. Also, It is recommended not to try any AWS data analytics specialty dumps available online. Those questions are quite out-of-date, and Microsoft has the right to permanently ban you and cancel your certification at any moment.

Hence more time on learning the exam objectives and try out the AWS practice exam on AWS data analytics specialty.

We at Whizlabs provides you the AWS Data Analytics Specialty exam preparation guidance with all of the training resources like video courses, practice tests and Hands-on-labsAWS sandboxes for real-time experiments that you need to pass the AWS Data Analytics Specialty certification exam successfully.

Happy Learning !

About Dharmendra Digari

Dharmalingam carries years of experience as a product manager. He pursued his MBA, which honed his skills of seeing products differently than others perceive. He specialises in products from the information technology and services domain, with a proven history of expertise. His skills include AWS, Google Cloud Platform, Customer Relationship Management, IT Business Analysis and Customer Service Operations. He has specifically helped many companies in the e-commerce domain establish themselves with refined and well-developed products, carving a niche for themselves.
Scroll to Top