[Sep 12, 2022] Databricks-Certified-Professional-Data-Engineer Exam Dumps - Try Best Databricks-Certified-Professional-Data-Engineer Exam Questions - PassReview [Q11-Q28]

[Sep 12, 2022] Databricks-Certified-Professional-Data-Engineer Exam Dumps - Try Best Databricks-Certified-Professional-Data-Engineer Exam Questions - PassReview

Verified Databricks-Certified-Professional-Data-Engineer exam dumps Q&As with Correct 61 Questions and Answers

NEW QUESTION 11
A data engineer has set up two Jobs that each run nightly. The first Job starts at 12:00 AM, and it usually
completes in about 20 minutes. The second Job depends on the first Job, and it starts at 12:30 AM. Sometimes,
the second Job fails when the first Job does not complete by 12:30 AM.
Which of the following approaches can the data engineer use to avoid this problem?

A. They can set up a retry policy on the first Job to help it run more quickly
B. They can set up the data to stream from the first Job to the second Job
C. They can utilize multiple tasks in a single job with a linear dependency
D. They can use cluster pools to help the Jobs run more efficiently
E. They can limit the size of the output in the second Job so that it will not fail as easily

Answer: C

NEW QUESTION 12
A data engineering team needs to query a Delta table to extract rows that all meet the same condi-tion.
However, the team has noticed that the query is running slowly. The team has already tuned the size of the
data files. Upon investigating, the team has concluded that the rows meeting the condition are sparsely located
throughout each of the data files.
Based on the scenario, which of the following optimization techniques could speed up the query?

A. Z-Ordering
B. Data skipping
C. Write as a Parquet file
D. Tuning the file size
E. Bin-packing

Answer: A

NEW QUESTION 13
A data engineer wants to horizontally combine two tables as a part of a query. They want to use a shared
column as a key column, and they only want the query result to contain rows whose value in the key column is
present in both tables.
Which of the following SQL commands can they use to accomplish this task?

A. UNION
B. MERGE
C. OUTER JOIN
D. INNER JOIN
E. LEFT JOIN

Answer: D

NEW QUESTION 14
Question-26. There are 5000 different color balls, out of which 1200 are pink color. What is the maximum
likelihood estimate for the proportion of "pink" items in the test set of color balls?

A. 2.4
B. 24 0
C. .48
D. 4.8
E. .24

Answer: E

Explanation:
Explanation
Given no additional information, the MLE for the probability of an item in the test set is exactly its frequency
in the training set. The method of maximum likelihood corresponds to many well-known estimation methods
in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to
measure the height of every single penguin in a population due to cost or time constraints. Assuming that the
heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance
can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE
would accomplish this by taking the mean and variance as parameters and finding particular parametric values
that make the observed results the most probable (given the model).
In general, for a fixed set of data and underlying statistical model the method of maximum likelihood selects
the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes
the "agreement" of the selected model with the observed data, and for discrete random variables it indeed
maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood
estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution
and many other problems. However in some complicated problems, difficulties do occur: in such problems,
maximum-likelihood estimators are unsuitable or do not exist.

NEW QUESTION 15
Suppose there are three events then which formula must always be equal to P(E1|E2,E3)?

A. P(E1,E2,E3)P(E2)P(E3)
B. P(E1,E2|E3)P(E2|E3)P(E3)
C. P(E1,E2|E3)P(E3)
D. P(E1,E2;E3)/P(E2,E3)
E. P(E1,E2,E3)P(E1)/P(E2:E3)

Answer: D

Explanation:
Explanation
This is an application of conditional probability: P(E1,E2)=P(E1|E2)P(E2). so
P(E1|E2) = P(E1.E2)/P(E2)
P(E1,E2,E3)/P(E2,E3)
If the events are A and B respectively, this is said to be "the probability of A given B"
It is commonly denoted by P(A|B):or sometimes PB(A). In case that both "A" and "B" are categorical
variables, conditional probability table is typically used to represent the conditional probability.

NEW QUESTION 16
A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need
SELECT permission on the Delta table.
Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Plat-form can
the data engineer use to grant the data analysts the appropriate access?

A. Jobs
B Dashboards
B. Repos
C. Data Explorer
D. Databricks Filesystem

Answer: B

NEW QUESTION 17
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex,
Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of
the clusters, you notice that there is significant overlap between the clusters. What should you do?

A. Decrease the number of clusters
B. Increase the number of clusters
C. Remove one of the measures
D. Identify additional measures to add to the analysis

Answer: A

NEW QUESTION 18
You are working on a email spam filtering assignment, while working on this you find there is new word e.g.
HadoopExam comes in email, and in your solutions you never come across this word before, hence probability
of this words is coming in either email could be zero. So which of the following algorithm can help you to
avoid zero probability?

A. Logistic Regression
B. All of the above
C. Laplace Smoothing
D. Naive Bayes

Answer: C

Explanation:
Explanation
Laplace smoothing is a technique for parameter estimation which accounts for unobserved events. It is more
robust and will not fail completely when data that has never been observed in training shows up.

NEW QUESTION 19
A junior data engineer has ingested a JSON file into a table raw_table with the following schema:
1. cart_id STRING,
2. items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new table with the
following schema:
1.cart_id STRING,
2.item_id STRING
Which of the following commands should the junior data engineer run to complete this task?

A. 1. SELECT cart_id, filter(items) AS item_id
2. FROM raw_table;
B. 1. SELECT cart_id, slice(items) AS item_id
2. FROM raw_table;
C. 1. SELECT cart_id, explode(items) AS item_id
2. FROM raw_table;
D. 1. SELECT cart_id, reduce(items) AS item_id
2. FROM raw_table;
E. 1. SELECT cart_id, flatten(items) AS item_id
2. FROM raw_table;

Answer: C

NEW QUESTION 20
If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that
E1 has occurred?

A. P(E2)/(P(E1+E2)
B. P(E1+E2)/P(E1)
C. P(E1)/P(E2)
D. P(E2)/P(E1)

Answer: D

NEW QUESTION 21
A table customerLocations exists with the following schema:
1. id STRING,
2. date STRING,
3. city STRING,
4. country STRING
A senior data engineer wants to create a new table from this table using the following command:
1. CREATE TABLE customersPerCountry AS
2. SELECT country,
3. COUNT(*) AS customers
4. FROM customerLocations
5. GROUP BY country;
A junior data engineer asks why the schema is not being declared for the new table. Which of the following
responses explains why declaring the schema is not necessary?

A. CREATE TABLE AS SELECT statements result in tables where schemas are optional
B. CREATE TABLE AS SELECT statements result in tables that do not support schemas
C. CREATE TABLE AS SELECT statements adopt schema details from the source table and query
D. CREATE TABLE AS SELECT statements infer the schema by scanning the data
E. CREATE TABLE AS SELECT statements assign all columns the type STRING

Answer: C

NEW QUESTION 22
An engineering manager uses a Databricks SQL query to monitor their team's progress on fixes related to
customer-reported bugs. The manager checks the results of the query every day, but they are manually
rerunning the query each day and waiting for the results.
Which of the following approaches can the manager use to ensure the results of the query are up-dated each
day?

A. They can schedule the query to run every 1 day from the Jobs UI
B. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL
C. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL
D. They can schedule the query to run every 12 hours from the Jobs UI
E. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL

Answer: B

NEW QUESTION 23
A data engineer wants to create a relational object by pulling data from two tables. The relational object must
be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to
avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?

A. Spark SQL Table
B. Temporary view
C. Delta Table
D. Database
E. View

Answer: E

NEW QUESTION 24
Which of the following Structured Streaming queries is performing a hop from a Bronze table to a Silver
table?

A. 1. (spark.table("sales")
2. .agg(sum("sales"),
3. sum("units"))
4. .writeStream
5. .option("checkpointLocation", checkpointPath)
6. .outputMode("complete")
7. .table("aggregatedSales")
8. )
B. 1. (spark.table("sales")
2. .withColumn("avgPrice", col("sales") / col("units"))
3. .writeStream
4. .option("checkpointLocation", checkpointPath)
5. .outputMode("append")
6. .table("cleanedSales")
7.)
C. 1. (spark.readStream.load(rawSalesLocation)
2. .writeStream
3. .option("checkpointLocation", checkpointPath)
4. .outputMode("append")
5. .table("uncleanedSales")
6. )
D. 1. (spark.read.load(rawSalesLocation)
2. .writeStream
3. .option("checkpointLocation", checkpointPath)
4. .outputMode("append")
5. .table("uncleanedSales")
6. )
E. 1. (spark.table("sales")
2. .groupBy("store")
3. .agg(sum("sales"))
4. .writeStream
5. .option("checkpointLocation", checkpointPath)
6. .outputMode("complete")
7. .table("aggregatedSales")
8.)

Answer: B

NEW QUESTION 25
A dataset has been defined using Delta Live Tables and includes an expectations clause:
1. CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')
What is the expected behaviour when a batch of data containing data that violates these constraints is
processed?

A. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table
B. Records that violate the expectation cause the job to fail
C. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log
D. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log
E. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset

Answer: C

NEW QUESTION 26
A data engineer has ingested data from an external source into a PySpark DataFrame raw_df. They need to
briefly make this data available in SQL for a data analyst to perform a quality assurance check on the data.
Which of the following commands should the data engineer run to make this data available in SQL for only
the remainder of the Spark session?

A. raw_df.createTable("raw_df")
B. raw_df.createOrReplaceTempView("raw_df")
C. raw_df.write.save("raw_df")
D. There is no way to share data between PySpark and SQL
E. raw_df.saveAsTable("raw_df")

Answer: B

NEW QUESTION 27
Which of the following data workloads will utilize a Bronze table as its source?

A. A job that ingests raw data from a streaming source into the Lakehouse
B. A job that aggregates cleaned data to create standard summary statistics
C. A job that enriches data by parsing its timestamps into a human-readable format
D. A job that queries aggregated data to publish key insights into a dashboard
E. A job that develops a feature set for a machine learning application

Answer: C

NEW QUESTION 28
......

Databricks Databricks-Certified-Professional-Data-Engineer Test Engine PDF - All Free Dumps: https://www.passreview.com/Databricks-Certified-Professional-Data-Engineer_exam-braindumps.html

[Sep 12, 2022] Databricks-Certified-Professional-Data-Engineer Exam Dumps - Try Best Databricks-Certified-Professional-Data-Engineer Exam Questions - PassReview [Q11-Q28]

Related Articles