
Pass Your Databricks Exam with Associate-Developer-Apache-Spark Exam Dumps (Updated 179 Questions)
Associate-Developer-Apache-Spark Exam Dumps - Databricks Practice Test Questions
NEW QUESTION 27
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
- A. spark.createDataFrame(throughputRates, FloatType)
- B. spark.createDataFrame((throughputRates), FloatType)
- C. spark.createDataFrame(throughputRates)
- D. spark.DataFrame(throughputRates, FloatType)
- E. spark.createDataFrame(throughputRates, FloatType())
Answer: E
Explanation:
Explanation
spark.createDataFrame(throughputRates, FloatType())
Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses.
Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object.
spark.createDataFrame((throughputRates), FloatType)
No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail.
spark.createDataFrame(throughputRates, FloatType)
Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights.
spark.DataFrame(throughputRates, FloatType)
Wrong. There is no SparkSession.DataFrame() method in Spark.
spark.createDataFrame(throughputRates)
False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 28
Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?
- A. itemsDf.cache(eager=True)
- B. itemsDf.cache().count()
- C. itemsDf.cache().filter()
- D. itemsDf.rdd.storeCopy()
- E. cache(itemsDf)
Answer: B
Explanation:
Explanation
Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the DataFrame. Since count() is an action (while filter() is not), it triggers the caching process.
More info: pyspark.sql.DataFrame.cache - PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7 Static notebook | Dynamic notebook: See test 2
NEW QUESTION 29
Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?
- A. 1.counter = 0
2.
3.for index, row in itemsDf.iterrows():
4. if 'Inc.' in row['supplier']:
5. counter = counter + 1
6.
7.print(counter) - B. print(itemsDf.foreach(lambda x: 'Inc.' in x))
- C. 1.accum=sc.accumulator(0)
2.
3.def check_if_inc_in_supplier(row):
4. if 'Inc.' in row['supplier']:
5. accum.add(1)
6.
7.itemsDf.foreach(check_if_inc_in_supplier)
8.print(accum.value) - D. 1.counter = 0
2.
3.def count(x):
4. if 'Inc.' in x['supplier']:
5. counter = counter + 1
6.
7.itemsDf.foreach(count)
8.print(counter) - E. print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())
Answer: C
Explanation:
Explanation
Correct code block:
accum=sc.accumulator(0)
def check_if_inc_in_supplier(row):
if 'Inc.' in row['supplier']:
accum.add(1)
itemsDf.foreach(check_if_inc_in_supplier)
print(accum.value)
To answer this question correctly, you need to know both about the DataFrame.foreach() method and accumulators.
When Spark runs the code, it executes it on the executors. The executors do not have any information about variables outside of their scope. This is whhy simply using a Python variable counter, like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly that counter is a special shared variable, an Accumulator, which is managed by the driver and can be accessed by all executors for the purpose of adding to it.
If you have used Pandas in the past, you might be familiar with the iterrows() command. Notice that there is no such command in PySpark.
The two examples that start with print do not work, since DataFrame.foreach() does not have a return value.
More info: pyspark.sql.DataFrame.foreach - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 30
Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
- A. transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
- B. transactionsDf.select(sqrt(predError))
- C. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
- D. transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
- E. transactionsDf.select(sqrt("predError"))
Answer: D
Explanation:
Explanation
transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just using the column name as a string, "predError".
The question asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of DataFrame transactionsDf expressed through col("predError").
transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way - to Spark it looks as if you are trying to refer to the non-existent Python variable predError.
You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead.
transactionsDf.select(sqrt(predError))
Wrong. Here, the explanation just above this one about how to refer to predError applies.
transactionsDf.select(sqrt("predError"))
No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the question asks for a column to be added to the original DataFrame transactionsDf.
transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 31
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:
- A. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
- B. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
- C. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")
- D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
- E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
Answer: B
Explanation:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 32
Which of the following is a problem with using accumulators?
- A. Only numeric values can be used in accumulators.
- B. Accumulator values can only be read by the driver, but not by executors.
- C. Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
- D. Only unnamed accumulators can be inspected in the Spark UI.
- E. Accumulators do not obey lazy evaluation.
Answer: B
Explanation:
Explanation
Accumulator values can only be read by the driver, but not by executors.
Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good way to do that.
Only numeric values can be used in accumulators.
No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).
Accumulators do not obey lazy evaluation.
Incorrect - accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run.
Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be repeated.
Only unnamed accumulators can be inspected in the Spark UI.
No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.
More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator - PySpark 3.1.2 documentation, and pyspark.AccumulatorParam - PySpark 3.1.2 documentation
NEW QUESTION 33
Which of the following statements about stages is correct?
- A. Tasks in a stage may be executed by multiple machines at the same time.
- B. Different stages in a job may be executed in parallel.
- C. Stages ephemerally store transactions, before they are committed through actions.
- D. Stages may contain multiple actions, narrow, and wide transformations.
- E. Stages consist of one or more jobs.
Answer: A
Explanation:
Explanation
Tasks in a stage may be executed by multiple machines at the same time.
This is correct. Within a single stage, tasks do not depend on each other. Executors on multiple machines may execute tasks belonging to the same stage on the respective partitions they are holding at the same time.
Different stages in a job may be executed in parallel.
No. Different stages in a job depend on each other and cannot be executed in parallel. The nuance is that every task in a stage may be executed in parallel by multiple machines.
For example, if a job consists of Stage A and Stage B, tasks belonging to those stages may not be executed in parallel. However, tasks from Stage A may be executed on multiple machines at the same time, with each machine running it on a different partition of the same dataset. Then, afterwards, tasks from Stage B may be executed on multiple machines at the same time.
Stages may contain multiple actions, narrow, and wide transformations.
No, stages may not contain multiple wide transformations. Wide transformations mean that shuffling is required. Shuffling typically terminates a stage though, because data needs to be exchanged across the cluster. This data exchange often causes partitions to change and rearrange, making it impossible to perform tasks in parallel on the same dataset.
Stages ephemerally store transactions, before they are committed through actions.
No, this does not make sense. Stages do not "store" any data. Transactions are not "committed" in Spark.
Stages consist of one or more jobs.
No, it is the other way around: Jobs consist of one more stages.
More info: Spark: The Definitive Guide, Chapter 15.
NEW QUESTION 34
Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?
- A. transactionsDf.sample(True, 0.5)
- B. transactionsDf.take(1000).distinct()
- C. transactionsDf.sample(True, 0.5, force=True)
- D. transactionsDf.sample(False, 0.5)
- E. transactionsDf.take(1000)
Answer: A
Explanation:
Explanation
To solve this question, you need to know that DataFrame.sample() is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's withReplacement argument should be set to True. A force= argument for the operator does not exist.
While the take argument returns an exact number of rows, it will just take the first specified number of rows (1000 in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using take(), so the correct answer cannot involve take().
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 35
Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.drop("predError", "value")
- B. transactionsDf.drop(value, predError)
- C. transactionsDf.drop(["predError", "value"])
- D. transactionsDf.drop([col("predError"), col("value")])
- E. transactionsDf.drop(col("value"), col("predError"))
Answer: A
Explanation:
Explanation
Output of correct code block:
+-------------+-------+---------+----+
|transactionId|storeId|productId| f|
+-------------+-------+---------+----+
| 1| 25| 1|null|
| 2| 2| 2|null|
| 3| 25| 3|null|
+-------------+-------+---------+----+
To solve this question, you should be fmailiar with the drop() API. The order of column names does not matter
- in this question the order differs in some answers just to confuse you. Also, drop() does not take a list. The *cols operator in the documentation means that all arguments passed to drop() are interpreted as column names.
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 36
The code block shown below should return a copy of DataFrame transactionsDf with an added column cos.
This column should have the values in column value converted to degrees and having the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))
- A. 1. withColumn
2. col("cos")
3. cos
4. degrees
5. col("value")
E
. 1. withColumn
2. "cos"
3. degrees
4. cos
5. col("value") - B. 1. withColumn
2. "cos"
3. cos
4. degrees
5. transactionsDf.value - C. 1. withColumn
2. col("cos")
3. cos
4. degrees
5. transactionsDf.value - D. 1. withColumnRenamed
2. "cos"
3. cos
4. degrees
5. "transactionsDf.value"
Answer: B
Explanation:
Explanation
Correct code block:
transactionsDf.withColumn("cos", round(cos(degrees(transactionsDf.value)),2)) This question is especially confusing because col, "cos" are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to the details to identify what the correct answer option is.
The first answer option to throw out is the one that starts with withColumnRenamed: The question NO:
speaks specifically of adding a column. The withColumnRenamed operator only renames an existing column, however, so you cannot use it here.
Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn().
Looking at the documentation (linked below), you can find out that the first argument of withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col("cos") as the option for gap 2 can be disregarded.
This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the question you can find out that the new column should have "the values in column value converted to degrees and having the cosine of those converted values taken". This prescribes you a clear order of operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then, logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 37
Which of the following describes Spark's standalone deployment mode?
- A. Standalone mode uses a single JVM to run Spark driver and executor processes.
- B. Standalone mode is how Spark runs on YARN and Mesos clusters.
- C. Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.
- D. Standalone mode uses only a single executor per worker per application.
- E. Standalone mode means that the cluster does not contain the driver.
Answer: D
Explanation:
Explanation
Standalone mode uses only a single executor per worker per application.
This is correct and a limitation of Spark's standalone mode.
Standalone mode is a viable solution for clusters that run multiple frameworks.
Incorrect. A limitation of standalone mode is that Apache Spark must be the only framework running on the cluster. If you would want to run multiple frameworks on the same cluster in parallel, for example Apache Spark and Apache Flink, you would consider the YARN deployment mode.
Standalone mode uses a single JVM to run Spark driver and executor processes.
No, this is what local mode does.
Standalone mode is how Spark runs on YARN and Mesos clusters.
No. YARN and Mesos modes are two deployment modes that are different from standalone mode. These modes allow Spark to run alongside other frameworks on a cluster. When Spark is run in standalone mode, only the Spark framework can run on the cluster.
Standalone mode means that the cluster does not contain the driver.
Incorrect, the cluster does not contain the driver in client mode, but in standalone mode the driver runs on a node in the cluster.
More info: Learning Spark, 2nd Edition, Chapter 1
NEW QUESTION 38
The code block shown below should return a DataFrame with all columns of DataFrame transactionsDf, but only maximum 2 rows in which column productId has at least the value 2. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__).__3__
- A. 1. where
2. productId >= 2
3. limit(2) - B. 1. filter
2. col("productId") >= 2
3. limit(2) - C. 1. filter
2. productId > 2
3. max(2) - D. 1. where
2. transactionsDf[productId] >= 2
3. limit(2) - E. 1. where
2. "productId" > 2
3. max(2)
Answer: B
Explanation:
Explanation
Correct code block:
transactionsDf.filter(col("productId") >= 2).limit(2)
The filter and where operators in gap 1 are just aliases of one another, so you cannot use them to pick the right answer.
The column definition in gap 2 is more helpful. The DataFrame.filter() method takes an argument of type Column or str. From all possible answers, only the one including col("productId") >= 2 fits this profile, since it returns a Column type.
The answer option using "productId" > 2 is invalid, since Spark does not understand that "productId" refers to column productId. The answer option using transactionsDf[productId] >= 2 is wrong because you cannot refer to a column using square bracket notation in Spark (if you are coming from Python using Pandas, this is something to watch out for). In all other options, productId is being referred to as a Python variable, so they are relatively easy to eliminate.
Also note that the question asks for the value in column productId being at least 2. This translates to a
"greater or equal" sign (>= 2), but not a "greater" sign (> 2).
Another thing worth noting is that there is no DataFrame.max() method. If you picked any option including this, you may be confusing it with the pyspark.sql.functions.max method. The correct method to limit the amount of rows is the DataFrame.limit() method.
More info:
- pyspark.sql.DataFrame.filter - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.limit - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 39
Which of the following describes characteristics of the Spark driver?
- A. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
- B. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
- C. The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
- D. The Spark driver processes partitions in an optimized, distributed fashion.
- E. The Spark driver's responsibility includes scheduling queries for execution on worker nodes.
Answer: B
Explanation:
Explanation
The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
No, the Spark driver transforms operations into DAG computations itself.
If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
No. There is always a single driver per application, but one or more executors.
The Spark driver processes partitions in an optimized, distributed fashion.
No, this is what executors do.
In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
Wrong. In a non-interactive Spark application, you need to create the SparkSession object. In an interactive Spark shell, the Spark driver instantiates the object for you.
NEW QUESTION 40
Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type double?
- A. spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])
- B. 1. from pyspark.sql import types as T
2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", - C. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})
- D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])
- E. spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})
- F. CharType()), T.StructField("season", T.DoubleType())]))
Answer: A
Explanation:
Explanation
spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.
The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column
"season" contain only strings, Spark will cast the column appropriately as string.
Find out more about SparkSession.createDataFrame() via the link below.
spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method.
from pyspark.sql import types as T
spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season",
T.CharType()), T.StructField("season", T.DoubleType())]))
No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.
spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.
Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.
spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 41
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:
- A. save() is evaluated lazily and needs to be followed by an action.
- B. The code block is missing a bucketBy command that takes care of partitions.
- C. The mode option should be omitted so that the command uses the default mode.
- D. The code block is missing a reference to the DataFrameWriter.
- E. transactionsDf.format("parquet").option("mode", "append").save(path)
- F. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
Answer: D
Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").option("mode", "append").save(path)
NEW QUESTION 42
The code block shown below should return the number of columns in the CSV file stored at location filePath.
From the CSV file, only lines should be read that do not start with a # character. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
- A. 1. len
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns - B. 1. size
2. spark
3. read()
4. escape='#'
5. columns - C. 1. DataFrame
2. spark
3. read()
4. escape='#'
5. shape[0] - D. 1. len
2. spark
3. read
4. comment='#'
5. columns - E. 1. size
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns
Answer: D
Explanation:
Explanation
Correct code block:
len(spark.read.csv(filePath, comment='#').columns)
This is a challenging question with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a question of this difficulty level appears in the exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.
Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1, returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard this answer option.
Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but this method only returns the length of an array or map stored within a column (documentation linked below).
So, using a size() method is not an option here. This leaves us with two potentially valid answers.
We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql, which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session (pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.
More info:
- pyspark.sql.functions.size - PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.csv - PySpark 3.1.2 documentation
- pyspark.sql.SparkSession.read - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 43
Which of the following describes Spark's way of managing memory?
- A. Storage memory is used for caching partitions derived from DataFrames.
- B. Spark uses a subset of the reserved system memory.
- C. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
- D. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
- E. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
Answer: A
Explanation:
Explanation
Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
No, it is either execution or storage.
As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
No, Spark's garbage collection runs faster on fewer big objects than many small objects.
Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
The opposite is true - serialization reduces the memory footprint, but may impact performance in a negative way.
Spark uses a subset of the reserved system memory.
No, the reserved system memory is separate from Spark memory. Reserved memory stores Spark's internal objects.
More info: Tuning - Spark 3.1.2 Documentation, Spark Memory Management | Distributed Systems Architecture, Learning Spark, 2nd Edition, Chapter 7
NEW QUESTION 44
Which of the following are valid execution modes?
- A. Server, Standalone, Client
- B. Standalone, Client, Cluster
- C. Client, Cluster, Local
- D. Kubernetes, Local, Client
- E. Cluster, Server, Local
Answer: C
Explanation:
Explanation
This is a tricky question to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.
There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect to each other.
In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer which then also includes the driver.
Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN, Apache Mesos and Kubernetes.
Client, Cluster, Local
Correct, all of these are the valid execution modes in Spark.
Standalone, Client, Cluster
No, standalone is not a valid execution mode. It is a valid deployment mode, though.
Kubernetes, Local, Client
No, Kubernetes is a deployment mode, but not an execution mode.
Cluster, Server, Local
No, Server is not an execution mode.
Server, Standalone, Client
No, standalone and server are not execution modes.
More info: Apache Spark Internals - Learning Journal
NEW QUESTION 45
Which of the following statements about data skew is incorrect?
- A. Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.
- B. In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
- C. Salting can resolve data skew.
- D. Spark will not automatically optimize skew joins by default.
- E. To mitigate skew, Spark automatically disregards null values in keys when joining.
Answer: E
Explanation:
Explanation
To mitigate skew, Spark automatically disregards null values in keys when joining.
This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.
In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small partitions) of the non-null-key records.
Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link below).
In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.
Salting can resolve data skew.
This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.
A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.
Spark does not automatically optimize skew joins by default.
This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE).
By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled configuration option needs to be set to true instead of leaving it at the default false.
To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.
When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.
Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.
This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.
The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data, the amount of data, and thus the slowdown, is particularly big.
Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative to the sort-merge join.
It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger than the 10 MB set by default.
More info:
- Performance Tuning - Spark 3.0.0 Documentation
- Data Skew and Garbage Collection to Improve Spark Performance
- Section 1.2 - Joins on Skewed Data * GitBook
NEW QUESTION 46
The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is at least 5. Find the error.
Code block:
transactionsDf.where("col(predError) >= 5")
- A. The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").
- B. Instead of where(), filter() should be used.
- C. The argument to the where method should be "predError >= 5".
- D. Instead of >=, the SQL operator GEQ should be used.
- E. The argument to the where method cannot be a string.
Answer: C
Explanation:
Explanation
The argument to the where method cannot be a string.
It can be a string, no problem here.
Instead of where(), filter() should be used.
No, that does not matter. In PySpark, where() and filter() are equivalent.
Instead of >=, the SQL operator GEQ should be used.
Incorrect.
The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").
No, Spark returns a new DataFrame.
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/27.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 47
Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?
Code block:
1.json_schema = """
2.{"type": "struct",
3. "fields": [
4. {
5. "name": "itemId",
6. "type": "integer",
7. "nullable": true,
8. "metadata": {}
9. },
10. {
11. "name": "supplier",
12. "type": "string",
13. "nullable": true,
14. "metadata": {}
15. }
16. ]
17.}
18."""
- A. spark.read.json(filePath, schema=json_schema)
- B. spark.read.json(filePath, schema=spark.read.json(json_schema))
- C. spark.read.schema(json_schema).json(filePath)
1.schema = StructType.fromJson(json.loads(json_schema))
2.spark.read.json(filePath, schema=schema) - D. spark.read.json(filePath, schema=schema_of_json(json_schema))
Answer: C
Explanation:
Explanation
Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this question is beneficial to your exam preparation, since it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam.
The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.
With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.
The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.
Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.
In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.
Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects.
Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.
More info:
- pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation
- pyspark.sql.functions.schema_of_json - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 48
Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?
- A. transactionsDf.union(transactionsNewDf).distinct()
- B. transactionsDf.concat(transactionsNewDf).unique()
- C. transactionsDf.join(transactionsNewDf, how="union").distinct()
- D. transactionsDf.union(transactionsNewDf).unique()
- E. spark.union(transactionsDf, transactionsNewDf).distinct()
Answer: A
Explanation:
Explanation
DataFrame.unique() and DataFrame.concat() do not exist and union() is not a method of the SparkSession. In addition, there is no union option for the join method in the DataFrame.join() statement.
More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 49
The code block shown below should read all files with the file ending .png in directory path into Spark.
Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)
- A. 1. read()
2. format
3. "binaryFile"
4. "recursiveFileLookup"
5. load - B. 1. open
2. as
3. "binaryFile"
4. "pathGlobFilter"
5. load - C. 1. read
2. format
3. binaryFile
4. pathGlobFilter
5. load - D. 1. open
2. format
3. "image"
4. "fileType"
5. open - E. 1. read
2. format
3. "binaryFile"
4. "pathGlobFilter"
5. load
Answer: E
Explanation:
Explanation
Correct code block:
spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator - the open operator shown in one of the answers does not exist.
NEW QUESTION 50
......
Pass Your Associate-Developer-Apache-Spark Exam Easily with Accurate PDF Questions: https://testprep.dumpsvalid.com/Associate-Developer-Apache-Spark-brain-dumps.html