
Professional-Data-Engineer Updated Exam Dumps [2021] Practice Valid Exam Dumps Question
Professional-Data-Engineer Sample with Accurate & Updated Questions
Certification Path
The Google Professional Data Engineer Certification is one of the highest level of certification mainly focussing to the professional Data Engineering.
There is no prerequisite for this exam but still it would be best to follow some sequence in order to prove immense knowledge as a Google professional Data Engineer.
You can complete Google Associate Certifications then approach for the professional certification. For more information related to Google cloud certification track Google-certification-path
NEW QUESTION 52
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
- B. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
- C. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
- D. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
Answer: A
NEW QUESTION 53
Which action can a Cloud Dataproc Viewer perform?
- A. List the jobs.
- B. Create a cluster.
- C. Delete a cluster.
- D. Submit a job.
Answer: A
Explanation:
A Cloud Dataproc Viewer is limited in its actions based on its role. A viewer can only list clusters, get cluster details, list jobs, get job details, list operations, and get operation details.
Reference:
https://cloud.google.com/dataproc/docs/concepts/iam#iam_roles_and_cloud_dataproc_ope rations_summary
NEW QUESTION 54
An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
- A. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
- B. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
- C. Enable BigQuery monitoring in Google Stackdriver and create an alert.
- D. Use federated data sources, and check data in the SQL query.
Answer: A
Explanation:
Topic 1, Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
* 8 physical servers in 2 clusters
* SQL Server - user data, inventory, static data
* 3 physical servers
* Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
* 60 virtual machines across 20 physical servers
* Tomcat - Java services
* Nginx - static content
* Batch servers
Storage appliances
* iSCSI for virtual machine (VM) hosts
* Fibre Channel storage area network (FC SAN) - SQL server storage
* Network-attached storage (NAS) image storage, logs, backups
* Apache Hadoop /Spark servers
* Core Data Lake
* Data analysis workloads
* 20 miscellaneous servers
* Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
* Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
NEW QUESTION 55
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform. What should you do?
- A. Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
- B. Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs
- C. Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
- D. Export the information to Cloud Stackdriver, and set up an Alerting policy
Answer: D
Explanation:
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.
NEW QUESTION 56
What is the general recommendation when designing your row keys for a Cloud Bigtable schema?
- A. Include multiple time series values within the row key
- B. Keep your row key as long as the field permits
- C. Keep the row keep as an 8 bit integer
- D. Keep your row key reasonably short
Answer: D
Explanation:
Explanation
A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.
Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys
NEW QUESTION 57
You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?
- A. Create a table on Cloud SQL, and insert and delete rows with the job information
- B. Create an API using App Engine to receive and send messages to the applications
- C. Create a table on Cloud Spanner, and insert and delete rows with the job information
- D. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
Answer: D
Explanation:
Pubsub is used to transmit data in real time and scale automatically.
NEW QUESTION 58
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- A. Use a row key of the form >#<sensorid>#<timestamp>.
- B. Use a row key of the form <sensorid>.
- C. Use a row key of the form <timestamp>.
- D. Use a row key of the form <timestamp>#<sensorid>.
Answer: A
NEW QUESTION 59
Which of these operations can you perform from the BigQuery Web UI?
- A. Upload multiple files using a wildcard.
- B. Upload a file in SQL format.
- C. Upload a 20 MB file.
- D. Load data with nested and repeated fields.
Answer: D
Explanation:
You can load data with nested and repeated fields using the Web UI.
You cannot use the Web UI to:
- Upload a file greater than 10 MB in size
- Upload multiple files at the same time
- Upload a file in SQL format
All three of the above operations can be performed using the "bq" command.
Reference: https://cloud.google.com/bigquery/loading-data
NEW QUESTION 60
You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?
- A. Recurrent neural network
- B. Linear regression
- C. Feedforward neural network
- D. Logistic classification
Answer: B
Explanation:
Forecasting and Liner regression is used for predicting housing price.
NEW QUESTION 61
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in a regional Cloud Storage bucket. Access the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- B. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
- C. Store the data in BigQuery. Access the data using the BigQuery Connector on Cloud Dataproc and Compute Engine.
- D. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
Answer: A
Explanation:
Explanation/Reference:
NEW QUESTION 62
Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
- 8 physical servers in 2 clusters
- SQL Server - user data, inventory, static data
- 3 physical servers
- Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
- 60 virtual machines across 20 physical servers
- Tomcat - Java services
- Nginx - static content
- Batch servers
* Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) - SQL server storage
Network-attached storage (NAS) image storage, logs, backups
* 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
* 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment
SEO Statement
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?
- A. Cloud Pub/Sub, Cloud SQL, and Cloud Storage
- B. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
- C. Cloud Dataflow, Cloud SQL, and Cloud Storage
- D. Cloud Pub/Sub, Cloud Dataflow, and Local SSD
- E. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage
Answer: A
NEW QUESTION 63
You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?
- A. Use Cloud Vision API by providing custom labels as recognition hints.
- B. Use Cloud Vision AutoML with the existing dataset.
- C. Train your own image recognition model leveraging transfer learning techniques.
- D. Use Cloud Vision AutoML, but reduce your dataset twice.
Answer: B
NEW QUESTION 64
When a Cloud Bigtable node fails, ____ is lost.
- A. all data
- B. the time dimension
- C. the last transaction
- D. no data
Answer: D
Explanation:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node. Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node. Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION 65
Which of these operations can you perform from the BigQuery Web UI?
- A. Upload multiple files using a wildcard.
- B. Upload a file in SQL format.
- C. Upload a 20 MB file.
- D. Load data with nested and repeated fields.
Answer: D
Explanation:
Explanation
You can load data with nested and repeated fields using the Web UI.
You cannot use the Web UI to:
- Upload a file greater than 10 MB in size
- Upload multiple files at the same time
- Upload a file in SQL format
All three of the above operations can be performed using the "bq" command.
Reference: https://cloud.google.com/bigquery/loading-data
NEW QUESTION 66
Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the validation set.
You want to increase the AUC of the model. What should you do?
- A. Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC
- B. Perform hyperparameter tuning
- C. Train a classifier with deep neural networks, because neural networks would always beat SVMs
- D. Deploy the model and measure the real-world AUC; it's always higher because of generalization
Answer: B
NEW QUESTION 67
Which SQL keyword can be used to reduce the number of columns processed by BigQuery?
- A. SELECT
- B. LIMIT
- C. WHERE
- D. BETWEEN
Answer: A
Explanation:
SELECT allows you to query specific columns rather than the whole table.
LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by BigQuery.
Reference: https://cloud.google.com/bigquery/launch-
checklist#architecture_design_and_development_checklist
NEW QUESTION 68
You are building a model to make clothing recommendations. You know a user's fashion pis likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
- A. Train on the new data while using the existing data as your test set.
- B. Continuously retrain the model on a combination of existing data and the new data.
- C. Train on the existing data while using the new data as your test set.
- D. Continuously retrain the model on just the new data.
Answer: A
NEW QUESTION 69
You are implementing security best practices on your data pipeline. Currently, you are manually executing
jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-
public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud
Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?
- A. Restrict the Google Cloud Storage bucket so only you can see the files
- B. Use a service account with the ability to read the batch files and to write to BigQuery
- C. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files
and write to BigQuery - D. Grant the Project Owner role to a service account, and run the job with it
Answer: D
NEW QUESTION 70
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine.
The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?
- A. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
- B. Host a visualization tool on a VM on Google Compute Engine.
- C. Run a local version of Jupiter on the laptop.
- D. Grant the user access to Google Cloud Shell.
Answer: A
Explanation:
Datalab provides Jupyter for this kind of work.
NEW QUESTION 71
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/ Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket.
The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
- A. An alert based on an increase of instance/storage/used_bytesfor the source and a rate of change decrease of subscription/num_undelivered_messages for the destination
- B. An alert based on a decrease of subscription/num_undelivered_messagesfor the source and a rate of change increase of instance/storage/used_bytesfor the destination
- C. An alert based on an increase of subscription/num_undelivered_messagesfor the source and a rate of change decrease of instance/storage/used_bytesfor the destination
- D. An alert based on a decrease of instance/storage/used_bytesfor the source and a rate of change increase of subscription/num_undelivered_messages for the destination
Answer: C
NEW QUESTION 72
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?
- A. Specify the TableReference object in the code.
- B. Use .fromQuery operation to read specific fields from the table.
- C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
- D. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.
Answer: D
NEW QUESTION 73
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low. You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)
- A. Introduce data compression for each file to increase the rate file of file transfer.
- B. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
- C. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.
- D. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
- E. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
Answer: B,D
NEW QUESTION 74
Your financial services company is moving to cloud technology and wants to store 50 TB of financial time- series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?
- A. Cloud Bigtable
- B. Google BigQuery
- C. Google Cloud Storage
- D. Google Cloud Datastore
Answer: A
Explanation:
Explanation/Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series
NEW QUESTION 75
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- A. Use a row key of the form >#<sensorid>#<timestamp>.
- B. Use a row key of the form <sensorid>.
- C. Use a row key of the form <timestamp>.
- D. Use a row key of the form <timestamp>#<sensorid>.
Answer: C
NEW QUESTION 76
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?
- A. Cloud Dataflow
- B. Cloud Composer
- C. Cloud Functions
- D. Cloud Scheduler
Answer: B
NEW QUESTION 77
......
Pass Google Professional-Data-Engineer Premium Files Test Engine pdf - Free Dumps Collection: https://www.passreview.com/Professional-Data-Engineer_exam-braindumps.html
Professional-Data-Engineer Exam Info and Free Practice Test | PassReview: https://drive.google.com/open?id=1vL7iUuYgbS8zVjGzkUKkQjsYj5q3axwo