The Best Practice Test Preparation for the Databricks-Certified-Professional-Data-Engineer Certification Exam
Databricks-Certified-Professional-Data-Engineer Exam Dumps, Practice Test Questions BUNDLE PACK
Databricks is a powerful data engineering and machine learning platform that is widely used across various industries. To ensure that professionals have the necessary skills and expertise to work with Databricks, the company offers a certification program. One of the certifications offered is the Databricks Certified Professional Data Engineer exam, also known as the Databricks Databricks-Certified-Professional-Data-Engineer exam.
Databricks Certified Professional Data Engineer exam is a hands-on exam that requires the candidate to complete a set of tasks using Databricks. Databricks-Certified-Professional-Data-Engineer exam evaluates the candidate's ability to design and implement data pipelines, work with data sources and sinks, and perform transformations using Databricks. Databricks-Certified-Professional-Data-Engineer exam also tests the candidate's ability to optimize and tune data pipelines for performance and reliability.
Databricks is a cloud-based data engineering platform that allows organizations to process large amounts of data quickly and efficiently. The platform leverages Apache Spark to perform data processing tasks and offers a wide range of tools and services to support data engineering workflows. Databricks also provides certification programs for data professionals who want to demonstrate their expertise in using the platform. One of these certifications is the Databricks Certified Professional Data Engineer exam.
NEW QUESTION # 38
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day.
At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
- A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
- B. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
- C. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
- D. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
- E. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
Answer: C
Explanation:
Explanation
The adjustment that will meet the requirement of processing records in less than 10 seconds is to decrease the trigger interval to 5 seconds. This is because triggering batches more frequently may prevent records from backing up and large batches from causing spill. Spill is a phenomenon where the data in memory exceeds the available capacity and has to be written to disk, which can slow down the processing and increase the execution time1. By reducing the trigger interval, the streaming query can process smaller batches of data more quickly and avoid spill. This can also improve the latency and throughput of the streaming job2.
The other options are not correct, because:
Option A is incorrect because triggering batches more frequently does not allow idle executors to begin processing the next batch while longer running tasks from previous batches finish. In fact, the opposite is true. Triggering batches more frequently may cause concurrent batches to compete for the same resources and cause contention and backpressure2. This can degrade the performance and stability of the streaming job.
Option B is incorrect because increasing the trigger interval to 30 seconds is not a good practice to ensure no records are dropped. Increasing the trigger interval means that the streaming query will process larger batches of data less frequently, which can increase the risk of spill, memory pressure, and timeouts12. This can also increase the latency and reduce the throughput of the streaming job.
Option C is incorrect because the trigger interval can be modified without modifying the checkpoint directory. The checkpoint directory stores the metadata and state of the streaming query, such as the offsets, schema, and configuration3. Changing the trigger interval does not affect the state of the streaming query, and does not require a new checkpoint directory. However, changing the number of shuffle partitions may affect the state of the streaming query, and may require a new checkpoint directory4.
Option D is incorrect because using the trigger once option and configuring a Databricks job to execute the query every 10 seconds does not ensure that all backlogged records are processed with each batch. The trigger once option means that the streaming query will process all the available data in the source and then stop5. However, this does not guarantee that the query will finish processing within 10 seconds, especially if there area lot of records in the source. Moreover, configuring a Databricks job to execute the query every 10 seconds may cause overlapping or missed batches, depending on the execution time of the query.
References: Memory Management Overview, Structured Streaming Performance Tuning Guide, Checkpointing, Recovery Semantics after Changes in a Streaming Query, Triggers
NEW QUESTION # 39
The sales team has asked the Data engineering team to develop a dashboard that shows sales per-formance for all stores, but the sales team would like to use the dashboard but would like to select individual store location, which of the following approaches Data Engineering team can use to build this functionality into the dashboard.
- A. Use Dynamic views to filter the data based on the location
- B. Use Databricks REST API to create a dashboard for each location
- C. Use SQL UDF function to filter the data based on the location
- D. Use query Parameters which then allow user to choose any location
- E. Currently dashboards do not support parameters
Answer: D
Explanation:
Explanation
The answer is
Databricks supports many types of parameters in the dashboard, a drop-down list can be created based on a query that has a unique list of store locations.
Here is a simple query that takes a parameter for
SELECT * FROM sales WHERE field IN ( {{ Multi Select Parameter }} )
Or
SELECT * FROM sales WHERE field = {{ Single Select Parameter }}
Query parameter types
*Text
*Number
*Dropdown List
*Query Based Dropdown List
*Date and Time
NEW QUESTION # 40
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.
The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.
Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?
- A. Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.
- B. Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.
- C. Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.
- D. Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.
- E. Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.
Answer: C
Explanation:
This is the correct answer because it addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed. The situation is that an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added, due to new requirements from a customer-facing application. By configuring a new table with all the requisite fields and new names and using this as the source for the customer-facing application, the data engineering team can meet the new requirements without affecting other teams that rely on the existing table schema and name. By creating a view that maintains the original data schema and table name by aliasing select fields from the new table, the data engineering team can also avoid duplicating data or creating additional tables that need to be managed. Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under
"CREATE VIEW" section.
NEW QUESTION # 41
You noticed that colleague is manually copying the notebook with _bkp to store the previous ver-sions, which of the following feature would you recommend instead.
- A. Databricks notebooks should be copied to a local machine and setup source control locally to version the notebooks
- B. Databricks notebooks support change tracking and versioning
- C. Databricks notebook can be exported as HTML and imported at a later time
- D. Databricks notebooks can be exported into dbc archive files and stored in data lake
Answer: B
Explanation:
Explanation
Answer is Databricks notebooks support automatic change tracking and versioning.
When you are editing the notebook on the right side check version history to view all the changes, every change you are making is captured and saved.
NEW QUESTION # 42
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.
If task A fails during a scheduled run, which statement describes the results of this run?
- A. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
- B. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
- C. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
- D. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.
- E. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
Answer: E
Explanation:
When a Databricks job runs multiple tasks with dependencies, the tasks are executed in a dependency graph. If a task fails, the downstream tasks that depend on it are skipped and marked as Upstream failed. However, the failed task may have already committed some changes to the Lakehouse before the failure occurred, and those changes are not rolled back automatically. Therefore, the job run may result in a partial update of the Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake toensure that the changes are only committed when the entire job run succeeds. Alternatively, you can use the Run if condition to configure tasks to run even when some or all of their dependencies have failed, allowing your job to recover from failures and continue running. References:
* transactional writes: https://docs.databricks.com/delta/delta-intro.html#transactional-writes
* Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html
NEW QUESTION # 43
Data science team members are using a single cluster to perform data analysis, although cluster size was chosen to handle multiple users and auto-scaling was enabled, the team realized queries are still running slow, what would be the suggested fix for this?
- A. Increase the size of the driver node
- B. Setup multiple clusters so each team member has their own cluster
- C. Disable the auto-scaling feature
- D. Use High concurrency mode instead of the standard mode
Answer: D
Explanation:
Explanation
The answer is Use High concurrency mode instead of the standard mode,
https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-mode High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs.
Databricks recommends enabling autoscaling for High Concurrency clusters.
NEW QUESTION # 44
Which statement characterizes the general programming model used by Spark Structured Streaming?
- A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
- B. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
- C. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
- D. Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
- E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
Answer: B
NEW QUESTION # 45
Which of the following SQL statements can be used to update a transactions table, to set a flag on the table from Y to N
- A. REPLACE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
- B. MERGE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
- C. UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
- D. MODIFY transactions SET active_flag = 'N' WHERE active_flag = 'Y'
Answer: A
Explanation:
Explanation
The answer is
UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
Delta Lake supports UPDATE statements on the delta table, all of the changes as part of the update are ACID compliant.
NEW QUESTION # 46
Which statement regarding stream-static joins and static Delta tables is correct?
- A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.
- B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.
- C. Stream-static joins cannot use static Delta tables because of consistency issues.
- D. The checkpoint directory will be used to track updates to the static Delta table.
- E. The checkpoint directory will be used to track state information for the unique keys present in the join.
Answer: A
Explanation:
Explanation
This is the correct answer because stream-static joins are supported by Structured Streaming when one of the tables is a static Delta table. A static Delta table is a Delta table that is not updated by any concurrent writes, such as appends or merges, during the execution of a streaming query. In this case, each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch, which means it will reflect any changes made to the static Delta table before the start of each microbatch. Verified References:[Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Stream and static joins" section.
NEW QUESTION # 47
The data engineering team maintains the following code:
Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?
- A. The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
- B. An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.
- C. The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.
- D. A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
- E. An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.
Answer: A
Explanation:
Explanation
This code is using the pyspark.sql.functions library to group the silver_customer_sales table by customer_id and then aggregate the data using the minimum sale date, maximum sale total, and sum of distinct order ids.
The resulting aggregated data is then written to the gold_customer_lifetime_sales_summary table, overwriting any existing data in that table. This is a batch job that does not use any incremental or streaming logic, and does not perform any merge or update operations. Therefore, the code will overwrite the gold table with the aggregated values from the silver table every time it is executed. References:
https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
https://docs.databricks.com/spark/latest/dataframes-datasets/transforming-data-with-dataframes.html
https://docs.databricks.com/spark/latest/dataframes-datasets/aggregating-data-with-dataframes.html
NEW QUESTION # 48
A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in thegeo_lookuptable.
Before executing the code, runningSHOWTABLESon the current database indicates the database contains only two tables:geo_lookupandsales.
Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?
- A. Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.
- B. Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.
- C. Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.
- D. Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.
- E. Both commands will fail. No new variables, tables, or views will be created.
Answer: B
Explanation:
This is the correct answer because Cmd 1 is written in Python and uses a list comprehension to extract the country names from the geo_lookup table and store them in a Python variable named countries af. This variable will contain a list of strings, not a PySpark DataFrame or a SQL view. Cmd 2 is written in SQL and tries to create a view named sales af by selecting from the sales table where city is in countries af. However, this command will fail because countries af is not a valid SQL entity and cannot be used in a SQL query. To fix this, a better approach would be to use spark.sql() to execute a SQL query in Python and pass the countries af variable as a parameter. Verified References: [Databricks Certified Data Engineer Professional], under
"Language Interoperability" section; Databricks Documentation, under "Mix languages" section.
NEW QUESTION # 49
You are asked to debug a databricks job that is taking too long to run on Sunday's, what are the steps you are going to take to identify the step that is taking longer to run?
- A. Enable debug mode in the Jobs to see the output activity of a job, output should be available to view.
- B. Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed.
- C. Once a job is launched, you cannot access the job's notebook activity.
- D. A notebook activity of job run is only visible when using all-purpose cluster.
- E. Use the compute's spark UI to monitor the job activity.
Answer: B
Explanation:
Explanation
The answer is, Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed.
You have the ability to view current active runs or completed runs, once you click the run you can see the A picture containing graphical user interface Description automatically generated
Click on the run to view the notebook output
Graphical user interface, text, application, email Description automatically generated
NEW QUESTION # 50
An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. Theuser_idfield represents a unique key for the data, which has the following schema:
user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT New records are all ingested into a table namedaccount_historywhich maintains a full record of all data in the same schema as the source. The next table in the system is namedaccount_currentand is implemented as a Type 1 table representing the most recent value for each uniqueuser_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the describedaccount_currenttable as part of each hourly batch job?
- A. Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.
- B. Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.
- C. Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.
- D. Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.
- E. Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
Answer: B
Explanation:
This is the correct answer because it efficiently updates the account current table with only the most recent value for each user id. The code filters records in account history using the last updated field and the most recent hour processed, which means it will only process the latest batch of data. It also filters by the max last login by user id, which means it will only keep the most recent record for each user id within that batch. Then, it writes a merge statement to update or insert the most recent value for each user id into account current, which means it will perform an upsert operation based on the user id column. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Upsert into a table using merge" section.
NEW QUESTION # 51
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:
df = spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
- A. dbutils.widgets.text("date", "null")
date = dbutils.widgets.get("date") - B. date = spark.conf.get("date")
- C. input_dict = input()
date= input_dict["date"] - D. import sys
date = sys.argv[1] - E. date = dbutils.notebooks.getParam("date")
Answer: E
Explanation:
Explanation
This is the correct way to get a parameter passed to a notebook by the Databricks Jobs API. The dbutils.notebooks.getParam method returns the value of a parameter passed to a notebook as a string. If no parameter with that name is passed, it returns None by default. You can also specify a default value as a second argument. Verified References: Databricks Certified Data Engineer Professional, under "Databricks Tooling" section; Databricks Documentation, under "Pass parameters to a notebook" section.
NEW QUESTION # 52
Which of the following SQL statements can replace python variables in Databricks SQL code, when the notebook is set in SQL mode?
1.%python
2.table_name = "sales"
3.schema_name = "bronze"
4.
5.%sql
6.SELECT * FROM ____________________
- A. SELECT * FROM ${schema_name}.${table_name}
- B. SELECT * FROM schema_name.table_name
- C. SELECT * FROM f{schema_name.table_name}
- D. SELECT * FROM {schem_name.table_name}
Answer: A
Explanation:
Explanation
The answer is, SELECT * FROM ${schema_name}.${table_name}
%python
table_name = "sales"
schema_name = "bronze"
%sql
SELECT * FROM ${schema_name}.${table_name}
${python variable} -> Python variables in Databricks SQL code
NEW QUESTION # 53
Which statement describes integration testing?
- A. Validates interactions between subsystems of your application
- B. Validates an application use case
- C. Requires an automated testing framework
- D. Validates behavior of individual elements of your application
- E. Requires manual intervention
Answer: A
Explanation:
Explanation
This is the correct answer because it describes integration testing. Integration testing is a type of testing that validates interactions between subsystems of your application, such as modules, components, or services.
Integration testing ensures that the subsystems work together as expected and produce the correct outputs or results. Integration testing can be done at different levels of granularity, such as component integration testing, system integration testing, or end-to-end testing. Integration testing can help detect errors or bugs that may not be found by unit testing, which only validates behavior of individual elements of your application. Verified References: [Databricks Certified Data Engineer Professional], under "Testing" section; Databricks Documentation, under "Integration testing" section.
NEW QUESTION # 54
A Data engineer wants to run unit's tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.
How can the data engineer run unit tests against function that work with data in production?
- A. Run unit tests against non-production data that closely mirrors production
- B. Define and import unit test functions from a separate Databricks notebook
- C. Define and unit test functions using Files in Repos
- D. Define units test and functions within the same notebook
Answer: A
Explanation:
The best practice for running unit tests on functions that interact with data is to use a dataset that closely mirrors the production data. This approach allows data engineers to validate the logic of their functions without the risk of affecting the actual production data. It's important to have a representative sample of production data to catch edge cases and ensure the functions will work correctly when used in a production environment.
References:
* Databricks Documentation on Testing: Testing and Validation of Data and Notebooks
NEW QUESTION # 55
You had AUTO LOADER to process millions of files a day and noticed slowness in load process, so you scaled up the Databricks cluster but realized the performance of the Auto loader is still not improving, what is the best way to resolve this.
- A. Setup a second AUTO LOADER process to process the data
- B. Increase the maxFilesPerTrigger option to a sufficiently high number
- C. Merge files to one large file
- D. AUTO LOADER is not suitable to process millions of files a day
- E. Copy the data from cloud storage to local disk on the cluster for faster access
Answer: B
Explanation:
Explanation
The default value of maxFilesPerTrigger is 1000 it can be increased to a much higher number but will require a much larger compute to process.
Graphical user interface, text, application, email Description automatically generated
https://docs.databricks.com/ingestion/auto-loader/options.html
NEW QUESTION # 56
......
Prepare for the Actual Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam Practice Materials Collection: https://exampasspdf.testkingit.com/Databricks/latest-Databricks-Certified-Professional-Data-Engineer-exam-dumps.html