Domain: Spark ML
Question 1: What is the reason behind the compatibility of pandas API syntax within a Pandas UDF function when applied to a Spark DataFrame?
A. The Pandas UDF invokes Pandas Function APIs internally
B. The Pandas UDF utilizes pandas API on Spark within its function
C. The pandas API syntax cannot be implemented within a Pandas UDF function on a Spark DataFrame
D. The Pandas UDF automatically translates the function into Spark DataFrame syntax
E. The Pandas UDF leverages Apache Arrow to convert data between Spark and pandas formats
Correct Answer: E
Explanation
Apache Arrow is used by Pandas UDF to efficiently transfer data between Spark and pandas formats. This enables the Pandas UDF to perform operations using the pandas API on data in Spark DataFrames. It’s a critical component in providing high performance and user-friendly interoperability between pandas and Spark.
Option A is incorrect as this is not the main reason for the compatibility. While Pandas UDF does internally use Pandas functions, the key is that it utilizes the pandas API on Spark within its function.
Option B is incorrect as the user-defined functions are executed by Pandas inside the function, to work with Pandas instances and APIs.
Option C is incorrect as pandas UDF allows you to use pandas API syntax within the function applied to a Spark DataFrame, making it compatible.
Option D is incorrect as The Pandas UDF does not automatically translate the function into Spark DataFrame syntax. It leverages Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost and Pandas inside the function, to work with Pandas instances and APIs.
Domain: ML Workflows
Question 2: A data scientist is carrying out hyperparameter optimization using an iterative optimization algorithm. Each assessment of unique hyperparameter values is being trained on a distinct compute node. They are conducting eight evaluations in total on eight compute nodes. Although the accuracy of the model varies across the eight evaluations, they observe that there’s no consistent pattern of enhancement in the accuracy.
What modifications could the data scientist make to enhance their model’s accuracy throughout the tuning process?
A. Adjust the count of compute nodes to be half or fewer than half of the number of evaluations.
B. Switch the iterative optimization algorithm used to aid the tuning process.
C. Adjust the count of compute nodes to be double or more than double the number of evaluations.
D. Alter both the number of compute nodes and evaluations to be considerably smaller.
E. Adjust both the number of compute nodes and evaluations to be substantially larger.
Correct Answer: B
Explanation
If there is no trend of improvement in model accuracy during the hyperparameter tuning process, it could mean that the optimization algorithm being used is not suitable for the given problem. Changing the optimization algorithm may help in better navigating the hyperparameter space and lead to improvements in model accuracy.
The other options that involve changing the number of compute nodes are not likely to improve the model’s accuracy, as this would mostly affect the speed of computations, not necessarily the quality of the hyperparameter search.
Reference: https://spark.apache.org/docs/latest/tuning.html
Domain: Spark ML
Question 3: What is the primary use case for mapInPandas() in Databricks?
A. Executing multiple models in parallel
B. Applying a function to each partition of a DataFrame
C. Applying a function to grouped data within a DataFrame
D. Applying a function to co-grouped data from two DataFrames
Correct Answer: B
Explanation
mapInPandas() is used for applying a function to each partition of a DataFrame in Databricks. This function allows you to efficiently process large datasets by dividing the data into smaller partitions and applying the function in parallel.
Option A is incorrect as mapInPandas() is not specifically designed for executing multiple models in parallel. It is primarily used for applying a Python function to each partition of a Spark DataFrame. If you want to execute multiple models in parallel, other Spark functionalities or distributed computing approaches might be more suitable.
Option C is incorrect as this functionality is more aligned with methods like groupBy() and apply(), where you perform operations on grouped data. mapInPandas() is focused on applying a function to each partition of a DataFrame, not necessarily on grouped data within a DataFrame.
Option D is incorrect as mapInPandas() is specifically designed for a single DataFrame, not for co-grouping data from two DataFrames. If you need to work with co-grouped data from two DataFrames, you might consider other Spark operations like join or merge.
Reference: https://docs.databricks.com/en/pandas/pandas-function-apis.html
Domain: ML Workflows
Question 4: In which of the following scenarios should you put the CrossValidator inside the Pipeline?
A. When there are estimators or transformers in the pipeline
B. When there is a risk of data leakage from earlier steps in the pipeline
C. When you want to refit in the pipeline
D. When you want to train models in parallel
Correct Answer: C
Explanation
The primary reason for putting the Cross Validator inside the Pipeline is when you want to refit the entire pipeline, including both the estimator and any preceding transformers. Cross Validator involves training and evaluating models with different hyperparameter combinations, and placing it inside the Pipeline ensures that both transformers and the estimator are consistently refitted during the cross-validation process.
Refitting the transformers is crucial to avoid data leakage or inconsistencies between training and validation data. If you have transformers in the pipeline that learn from the data during the fitting process (e.g., imputing missing values, scaling), it’s important to refit them for each cross-validation fold to ensure a fair evaluation of the model.
Therefore, when the goal is to perform hyperparameter tuning and refit the entire pipeline, including transformers and the estimator, placing the Cross Validator inside the Pipeline is the appropriate approach.
Option A is incorrect as this scenario is generally true, as you often use Cross Validator within a Pipeline when there are both transformers and an estimator to ensure a consistent application of the entire pipeline during cross-validation.
Option B is incorrect as the risk of data leakage is a consideration, it’s not the primary reason for putting Cross Validator inside the Pipeline. Cross Validator is more concerned with hyperparameter tuning and model evaluation. However, having transformers in the pipeline ensures that they are appropriately refitted during cross-validation, minimizing the risk of data leakage.
Option D is incorrect as parallel training of models is a characteristic of Cross Validator itself, not necessarily dependent on whether it’s placed inside a Pipeline. Cross Validator can train models in parallel, but the decision to put it inside a Pipeline is more related to the need for consistent refitting of transformers and the estimator during cross-validation.
In summary, while options A, B, and D are considerations when using Cross Validator and Pipelines, option C specifically addresses the primary reason for placing Cross Validator inside the Pipeline, which is to ensure consistent refitting of the entire pipeline during the hyperparameter tuning process
Reference: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html
Domain: Databricks Machine Learning
Question 5: In Databricks Model Registry, how are different versions of a model with the same model name distinguished from each other?
A. By assigning a unique version number to each model
B. By appending a timestamp to the model name
C. By appending the user’s name to the model name
D. By using a unique model ID
Correct Answer: C
Explanation
In Databricks Model Registry, different versions of a model with the same model name are distinguished from each other by assigning a unique version number to each model. This version number is incremented for each new model registered under the same model name, allowing users to easily identify and manage different versions of the same model.
Other options are not relevant in the context of Databricks Model Registry.
Reference: https://docs.databricks.com/en/mlflow/models.html
Domain: Databricks Machine Learning
Question 6: A novice data scientist has recently joined an ongoing machine learning project. The project operates as a daily retraining scheduled job, housed in a Databricks Repository. The scientist’s task is to enhance the feature engineering of the pipeline’s preprocessing phase. They aim to amend the code in a way that will seamlessly integrate into the project without altering the daily operations.
Which strategy should the data scientist adopt to successfully execute this task?
A. Clone the project’s notebooks into a separate Databricks Repository and implement the required alterations there
B. Temporarily halt the project’s automatic daily operations and modify the existing code in its original location
C. Generate a new branch in Databricks, commit the changes there, and then push these modifications to the associated Git provider
D. Duplicate the project’s notebooks into a Databricks Workspace folder and implement the necessary adjustments there
E. Initiate a new Git repository, integrate this into Databricks, and transfer the original code from the current repository to this new one before making modifications
Correct Answer: C
Explanation
The most efficient approach for the data scientist to implement changes without disrupting the project’s daily operations would be to establish a new branch within Databricks, commit the modifications to this branch, and then push these updates to the associated Git provider. This methodology permits the review and testing of changes before they are merged with the main codebase, ensuring the project’s daily routine remains unaffected until the changes are confirmed and incorporated.
Other options are not correct with respect to Databricks Repos.
Reference: https://docs.databricks.com/repos/index.html
Domain: ML Workflows
Question 7: A data scientist has constructed a random forest regressor pipeline and integrated it as the final stage in a Spark ML Pipeline. They’ve initiated a cross-validation process, setting the pipeline with Random forest regressor method inside of it. What potential downside could arise from making a pipeline inside the cross-validation process?
A. fixed_price_df.filter(col(“price”) > 0)
B. fixed_price_df.filter(“price” > 0)
C. fixed_price_df.contains(col(“price”) > 0)
D. fixed_price_df.where(“price” > 0)
Correct Answer: A
Explanation
In Databricks, you can use the filter method with a column expression to filter a DataFrame based on a specific condition. In this case, the code snippet “fixed_price_df.filter(col(“price”) > 0)” filters the DataFrame “fixed_price_df” to include only rows with a “price” column value greater than 0.
Other options are syntactically incorrect.
Domain: ML Workflows
Question 8: What is the primary advantage of parallelizing hyperparameter tuning?
A. It improves model performance
B. It reduces the dimensionality of the dataset
C. It speeds up the tuning process by evaluating multiple configurations simultaneously
D. It ensures the model is deployed correctly
Correct Answer: C
Explanation
The primary advantage of parallelizing hyperparameter tuning is that it speeds up the tuning process by evaluating multiple configurations simultaneously. This can significantly reduce the time required to identify the best-performing set of hyperparameters, particularly for large search spaces or computationally expensive models.
Other options are not relevant in the context of hyperparameter tuning.
Reference: https://docs.databricks.com/en/machine-learning/automl-hyperparam-tuning/index.html
Domain: Databricks Machine Learning
Question 9: In Databricks AutoML, how can you navigate to the best model code across all of the model iterations?
A. Click on the “View Best Model” link after running automl experiment
B. Click on the “View notebook for best model” link after running automl experiment
C. Click on the “Get Best Model” link after running automl experiment
D. Click on the “Top Model” link after running automl experiment
Correct Answer: B
Explanation
In Databricks AutoML, you can navigate to the best model page using the user interface by clicking on the “View notebook for best model” link. This action opens the notebook containing the best trial, allowing you to review the model’s details, hyperparameters, and performance metrics.
Other options are not available in Auto ML.
Reference: https://docs.databricks.com/en/machine-learning/automl/index.html
Domain: ML Workflows
Question 10: How to Reduce Overfitting?
A. Early Stopping of epochs– form of regularization while training a model with an iterative method, such as gradient descent
B. Data Augmentation (increase the amount of training data using information only in our training data); Eg – Image scaling, rotation to find dog in image
C. Regularization – technique to reduce the complexity of the model
D. Dropout is a regularization technique that prevents overfitting
E. All of the above
Correct Answer: E
Explanation
All the mentioned techniques (Early stopping, Data augmentation, Regularization, and Dropout) can be used to reduce overfitting in a machine learning model.
Domain: Databricks Machine Learning
Question11: A machine learning engineer has evaluated a new Staging version of a model in the MLflow Model Registry. After passing all the tests, the engineer would like to move this model to production by transitioning it to the Production stage in the Model Registry. From which section in Databricks Machine Learning can the engineer achieve this?
A. From the Run page in the Experiments section
B. From the Model page in the MLflow Model Registry
C. From the comment feature on the notebook page where the model was developed
D. From the Model Version page in the MLflow Model Registry
E. From the Experiment page in the Experiments section
Correct Answer: D
Explanation
The Model Version page in the MLflow Model Registry provides an interface for managing different versions of the model. It is from here that an engineer can transition a model from the Staging stage to the Production stage.
Other options are not relevant in the context of Databricks Model Registry.
Reference: https://docs.databricks.com/mlflow/index.html
Domain: ML Workflows
Question 12: A machine learning engineer attempts to scale an ML pipeline by distributing its single-node model tuning procedure. After broadcasting the entire training data onto each core, each core in the cluster is capable of training one model at once. As the tuning process is still sluggish, the engineer plans to enhance the parallelism from 4 to 8 cores to expedite the process. Unfortunately, the total memory in the cluster can’t be increased. Under which conditions would elevating the parallelism from 4 to 8 cores accelerate the tuning process?
A. When the data has a lengthy shape
B. When the data has a broad shape
C. When the model can’t be parallelized
D. When the tuning process is randomized
E. When the entire data can fit on each core
Correct Answer: E
Explanation
Increasing the number of cores from 4 to 8 could speed up the tuning process if the entire data can fit on each core. By doing so, each core can independently train a model at the same time as the others, effectively doubling the speed of the tuning process. However, if the data cannot fit into each core or if increasing the number of cores would cause memory overflow, this could slow down the process or even cause it to fail.
Option A is incorrect as the length of the data (number of rows) is not directly related to the parallelism in this context. The crucial factor is whether the entire data can fit on each core, regardless of the shape.
Option B is incorrect as the shape of the data (number of columns) is not the primary consideration for increasing parallelism. The key is whether the entire data can fit on each core.
Option C is incorrect because if the model cannot be parallelized, increasing the number of cores might not significantly accelerate the tuning process. The bottleneck could still be the model training itself, rather than the parallelism.
Option D is incorrect as the randomness in the tuning process does not necessarily depend on the number of cores. It’s more related to the nature of the algorithm being used for tuning. Increasing parallelism might not have a direct impact on the randomness of the tuning process.
Reference: https://spark.apache.org/docs/latest/tuning.html
Domain: ML Workflows
Question 13: Binning is the process of converting numeric data into categorical data. State True or False.
A. True
B. False
Correct Answer: A
Explanation
Binning is the process of converting numeric data into categorical data by grouping continuous data into discrete bins or intervals.
Reference: https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/mllib/tree/model/Bin.html
Domain: Scaling ML Models
Question 14: How does Spark ML tackle a linear regression problem for an extraordinarily large dataset? Which one of the options is correct?
A. Brute Force Algorithm
B. Matrix decomposition
C. Singular value decomposition
D. Least square method
E, Gradient descent
Correct Answer: E
Explanation
Gradient descent: It is a simple but effective iterative optimization method that can be used to find the minimum of a function. Spark ML uses gradient descent to minimize the error of the linear regression model.
Option A is incorrect as Brute force algorithms typically involve exhaustive search and are not efficient for large datasets. Linear regression on extraordinarily large datasets requires more scalable and optimized methods, like iterative optimization algorithms.
Option B is incorrect as matrix decomposition techniques can be used in certain scenarios, they may not be the most suitable for extremely large datasets. Matrix decomposition might become computationally expensive and memory-intensive, making it less practical for big data situations.
Option C is incorrect as singular value decomposition is a matrix factorization technique and is not the primary method used for linear regression on large datasets in Spark ML. It may not scale well for extraordinarily large datasets.
Option D is incorrect as the least squares method is a general approach for solving linear regression problems, but for extraordinarily large datasets, iterative optimization algorithms like gradient descent are often preferred due to their efficiency and scalability.
Reference: https://spark.apache.org/docs/latest/ml-classification-regression.html
Domain: Scaling ML Models
Question 15: Which of the following tools can be used to parallelize the hyperparameters tuning process for single node machine learning models using a Spark cluster?
A. MLFlow Experiment tracking
B. Delta Lake
C. HyperOpt
D. AutoScaling Clusters
Correct Answer: C
Explanation
HyperOpt works with both distributed ML algorithms such as Apache spark ML as well as with single machine ML models such as scikit learn and TensorFlow.
Option A is incorrect as MLFlow Experiment tracking is not a fine tuning tool but a tracking tool.
Option B is incorrect as Delta Lake is an open formaat storage layer that delivers reliability, security and performance on your data lake. It is not a tuning tool.
Option D is incorrect as AutoScaling clusters is an auto scaling tool – it helps to scale up and scale down computing resources.
Reference: https://docs.databricks.com/en/machine-learning/automl-hyperparam-tuning/index.html
Domain: Spark ML
Question16: Which of the following is a benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?
A. The vectorized pandas UDFs process data in memory rather than splitting to task
B. The vectorized pandas UDFs allow the use of type hints
C. The vectorized pandas UDFs allow for pandas API use inside of the function
D. The vectorized pandas UDFs work on distributed dataframe
E. The vectorized pandas UDFs process data in batches rather than one row at a time
Correct Answer: E
Explanation
Vectorized pandas UDFs take advantage of Pandas built-in optimizations which can result in substantial performance improvements compared to row at a time processing that standard PySpark UDFs follow. The batch processing allows the UDFs to benefit from the performance advantages of utilizing Pandas and Numpy libraries which are optimized for vectorized operations.
Option A is incorrect because vectorized pandas UDFs do not necessarily process data in memory. They can be used to process data that is stored in a distributed manner across multiple machines and the data is split into tasks and processed in parallel.
Option B is incorrect because vectorized pandas UDFs and PySpark UDFs support type hints.
Option C is not the best correct answer: while vectorized pandas UDFs are designed to allow users to use the pandas API inside of their functions, and this means that users can use pandas functions like apply, groupBy and rolling inside of their UDFs, which can be useful for certain types of data processing tasks, its not the primary benefit over PySpark UDFs.
Option D is incorrect as both vectorized pandas UDFs and standard PySpark UDFs work on distributed dataframe.
Domain: Spark ML
Question 17: Which of the following describes the relationship between the native spark Dataframe and pandas API on the spark Dataframe?
A. pandas API on Spark Dataframes are single-node versions of Spark Dataframe with additional metadata
B. pandas API on Spark Dataframes are unrelated to Spark Dataframes
C. pandas API on Spark Dataframes are less mutable versions of Spark Dataframes
D. pandas API on Spark Dataframes are more performant than Spark Dataframes
E. pandas API on Spark Dataframes are made up of Spark Dataframes and additional metadata
Correct Answer: E
Explanation
The pandas API on Spark Dataframes allows users to utilize the familiar pandas API to perform operations on data stored in Spark Dataframes. The pandas API on Spark Dataframes is implemented as a thin wrapper around the Spark Dataframes API, which means that it is built on top of the Spark Dataframes API and shares the same underlying data and metadata. As a result, the pandas API on spark Dataframes can be thought of as a combination of a Spark Dataframe and additional metadata that allows the API to behave in a way that is familiar to users of pandas.
This can make it easier for users who are familiar with the pandas API to work with data stored in Spark Dataframes, and can also make it easier to integrate pandas based code with Spark based code.
Option A is incorrect as pandas API on spark is designed to work in a distributed environment just like native spark dataframes.
Option B is incorrect as pandas API on spark is built on top of Spark dataframes to provide a more pandas like API
Option C is incorrect as this is not accurate because both are designed to be immutable by default.
Option D is incorrect as this is not generally true as the performance of pandas API on Spark is largely dependent on the underlying spark dataframe operations. However, the pandas API on Spark could introduce some overhead because of the additional layer.
Reference: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/pandas_pyspark.html
Domain: Databricks Machine Learning
Question 18: In PySpark, _________ library is provided which makes integrating Python with Apache Spark easy.
A. Py3j
B. Py5j
C. Py2j
D. Py4j
Correct Answer: D
Explanation
In PySpark, Py4j library is provided which makes integrating Python with Apache Spark.
Reference: https://spark.apache.org/docs/latest/api/python/index.html
Domain: Databricks Machine Learning
Question 19: A Data Scientist is using a feature store. In one of the feature tables Data Scientist wants to replace missing values with each respective feature variable’s median value.
A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the feature set?
A. Create a binary feature variable for each feature that contains missing values indicating where each row’s value has been imputed
B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
C. Create a constant feature variable for each feature that contains missing values indicating the percentage of rows from the feature that was originally missing
D. Impute the missing values using each respective feature variable’s mean value instead of the median value
E. Remove all feature variables that originally contained missing values from the feature set
Correct Answer: A
Explanation
This option is a good choice because it retains information about which values have been imputed, allowing the model to potentially learn the impact of imputation.
Option B is incorrect as this might lead to issues as many machine learning algorithms struggle with missing data. It could result in biased or inaccurate model predictions.
Option C is incorrect because this would provide some information about the extent of missing values, but it might not be as informative as option A, which explicitly indicates whether a value has been imputed or not.
Option D is incorrect as using the mean might be sensitive to outliers, and the median is generally a better choice for imputation, especially if the distribution of the feature is skewed.
Option E is incorrect as this approach discards potentially valuable information and reduces the size of the feature set, which may negatively impact the model’s performance.
References: https://machinelearningmastery.com/binary-flags-for-missing-values-for-machine-learning/ , https://docs.databricks.com/en/machine-learning/feature-store/index.html
Domain: Databricks Machine Learning
Question 20: A Data Scientist is calculating the importance of features as an MLFlow Run. The feature importance values are stored in the pandas dataframe importance_df and are written as a CSV to DBFS location importance_path.
They would like to log these values with their active MLFlow run. Which of the following lines of code can the data scientist use to log the feature importance values with their MLFlow run?
A. mlflow.log_artifact(importance_df)
B. mlflow.log_metric(importance_path, “importance.csv”)
C. mlflow.log_artifact(importance_df, “importance_df”)
D. mlflow.log_metric(importance_df, “importance_df”)
Correct Answer: B
Explanation
The syntax for logging artifacts in mlflow is as follows
mlflow.log_artifact(local_path: str, artifact_path: Optional[str])
To log a local file or directory as an artifact of the currently active run, the syntax in mlflow is as follows
mlflow.log_artifact(local_path: str, artifact_path: Optional[str]).
Parameters
local_path – Path to the file to write.
artifact_path – If provided, the directory in artifact_uri to write to.
This method will create a new active run if no run is active.
Other options are syntactically incorrect
Domain: ML Workflows
Question 21: A Data scientist has created two regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable.
When evaluating the Root Mean Squared Error (RMSE ) of each model by comparing the label prediction to the actual price values, the data scientist notices that the RMSE for the second model is much longer than the RMSE of the first model.
Which of the following explanations for these differences is valid?
A. The second model is much more accurate than the first model
B. The data scientist failed to take the log of the predictions in the first model before computing the RMSE
C. The data scientist failed to exponentiate the predictions in the second model before computing the RMSE
D. The RMSE is an invalid evaluation metric for regression problems
E. The first model is much more accurate than the second model
Correct Answer: C
Explanation
When the second model uses log(price) as the label variable, the predictions made by the model are in the log scale. To compare the RMSE accurately with the actual price values, the predictions need to be exponentiated (reverse of taking the log) so that they are on the same scale as the original prices. If the data scientist failed to exponentiate the predictions in the second model, it could result in a much larger RMSE compared to the first model.
Option A is incorrect as this cannot be concluded solely based on the difference in RMSE. The scale of the label variable (price vs. log(price)) impacts the interpretation of RMSE, and a direct comparison may not accurately reflect model accuracy.
Option B is incorrect as this is not a valid explanation for the observed difference. If the first model uses price as the label variable, there’s no need to take the log of the predictions before computing RMSE.
Option D is incorrect as RMSE is a valid evaluation metric for regression problems. However, the interpretation might be affected if the scale of the label variable is transformed (e.g., using log) and predictions are not appropriately adjusted.
Option E is incorrect as similar to option A, a direct comparison of RMSE may not be valid, as the scale of the label variable differs between the two models. It doesn’t necessarily imply the first model is more accurate.
Domain: ML Workflows
Question 22: A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization’s leaders want to maximize the number of cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. Accuracy
B. Precision
C. Recall
D. RMSE
E. Area under the ROC curve
Correct Answer: C
Explanation
In a medical context, especially when dealing with infections, maximizing the number of identified cases (minimizing false negatives) is often crucial. Recall (also known as sensitivity or true positive rate) is a metric that measures the ability of a classification model to capture all the positive cases. In this scenario, a high recall value would mean that the model is good at identifying individuals who have the specific type of infection, which aligns with the organization’s goal of maximizing the number of cases identified.
Option A is incorrect as accuracy is not the best metric in this case because it considers both true positives and true negatives and in imbalanced datasets (which is common in medical contexts), high accuracy can be achieved by simply predicting the majority class. This might not reflect the model’s effectiveness in identifying cases of the specific infection.
Option B is incorrect as Precision is the ratio of true positives to the total predicted positives. While precision is important, in this context, it may not be the primary focus. Maximizing precision might lead to a lower recall, which means missing some positive cases.
Option D is incorrect as RMSE (Root Mean Squared Error) is a metric used for regression problems, not classification. It measures the average magnitude of the errors between predicted and actual values on a continuous scale.
Option E is incorrect while AUC-ROC is a good metric for binary classification, it doesn’t directly emphasize maximizing the number of positive cases. It provides a balance between sensitivity and specificity but may not align perfectly with the organization’s goal of maximizing identified cases.
Reference: https://www.geeksforgeeks.org/metrics-for-machine-learning-model/
Domain: ML Workflows
Question 23: In which of the following scenarios should you put the Pipeline inside the Cross Validator?
A. When there are estimators or transformers in the pipeline
B. When there is a risk of data leakage from earlier steps in the pipeline
C. When you want to refit the estimators in the pipeline every time
D. When you want to train models in parallel
Correct Answer: B
Explanation
When we put a pipeline inside the cross validator, the cross validator first splits the data and then fits the pipeline, ensuring that there is no leakage of information.
Option A is incorrect as this scenario is generally true, as you often use Cross Validator within a Pipeline when there are both transformers and an estimator to ensure a consistent application of the entire pipeline during cross-validation.
Option C is incorrect as when the goal is to perform hyperparameter tuning and refit the entire pipeline, including transformers and the estimator, placing the CrossValidator inside the Pipeline is the appropriate approach.
Option D is incorrect as parallel training of models is a characteristic of CrossValidator itself, not necessarily dependent on whether it’s placed inside a Pipeline. CrossValidator can train models in parallel, but the decision to put it inside a Pipeline is more related to the need for consistent refitting of transformers and the estimator during cross-validation.
Reference: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html
Domain: Databricks Machine Learning
Question 24: In Databricks, what information can you find on the run detail page?
A. The input parameters used for the run
B. The performance metrics recorded during the run
C. The model artifacts generated by the run
D. All of the above
Correct Answer: D
Explanation
In Databricks, the run detail page provides various information about a specific run, including the input parameters used for the run, the performance metrics recorded during the run, and the model artifacts generated by the run. This information can be useful for understanding the performance of different runs and comparing their results.
Reference: https://docs.databricks.com/en/mlflow/index.html
Domain: ML Workflows
Question 25: A data analyst is working on a project where they need to generate a detailed report on a DataFrame to be presented to stakeholders. The report should include count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each numerical column. Which Databricks command should they use?
A. Databricks Describe
B. Databricks Summary
C. Both Databricks Describe and Databricks Summary would work
D. Neither Databricks Describe nor Databricks Summary
E. Databricks Aggregation
Correct Answer: B
Explanation
In this scenario, the data analyst needs to generate a detailed report on a Data Frame, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each numerical column. They should use the Databricks Summary command, which provides a more detailed summary of numerical columns in a Data Frame compared to Databricks Describe.
The following are options for obtaining summary statistics on a Spark Data Frame:
- describe(): Provides count, mean, standard deviation, minimum, and maximum values.
- summary(): Incorporates all the features of describe() with the addition of interquartile range (IQR) values.
- dbutils.data.summarize: Calculates and displays summary statistics of an Apache Spark Data Frame or pandas Data Frame. This command is available for Python, Scala and R.
References: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.summary.html, https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.describe.html & https://docs.databricks.com/en/dev-tools/databricks-utils.html#summarize-command-dbutilsdatasummarize
Conclusion
Hope this collection of free questions serves as a valuable resource for individuals preparing for the Databricks Certified Machine Learning Associate certification exam. By engaging with these questions, you can have the opportunity to assess their knowledge and skills in utilizing Databricks for fundamental machine learning tasks.
Additionally, supplementing your preparation with further study materials and practical hands-on labs will contribute to your readiness to tackle the challenges of the certification exam.
- Free 25 Databricks Machine Learning Associate Exam Questions - March 21, 2024
- A Tour of Google Cloud Hands-on Labs - December 12, 2023
- Mastering Azure Basics: A Deep Dive into AZ-900 Exam Domains - December 4, 2023
- Exploring the Benefits of Validation Feature in Hands-on Labs - October 10, 2023
- 20+ Free MD-102 Exam Questions on Microsoft Endpoint Administrator - September 27, 2023
- 20+ Free MS-102 Exam Questions on Microsoft 365 Administrator Certification - September 25, 2023
- AWS Certified Developer Salary in 2024 - September 19, 2023
- Guide to SharePoint, OneDrive, and Teams External Sharing in Teams - September 10, 2023