[Jul 07, 2023] Prepare For The Professional-Data-Engineer Question Papers In Advance [Q13-Q31]

Share

[Jul 07, 2023] Prepare For The Professional-Data-Engineer Question Papers In Advance

Professional-Data-Engineer PDF Dumps Real 2023 Recently Updated Questions


Who is the Professional Data Engineer Exam Intended for?

This exam is designed for individuals who are experts in designing, building, securing, and monitoring data processing systems with a particular emphasis on compliance and security. The candidate who wants to take the Professional Data Engineer exam should have the ability to deploy, leverage, and training pre-existing machine learning models. Moreover, every applicant should have experience of more than 3 years including 1-year experience in designing and handling solutions utilizing GCP.

 

NEW QUESTION # 13
What are all of the BigQuery operations that Google charges for?

  • A. Storage, queries, and loading data from a file
  • B. Queries and streaming inserts
  • C. Storage, queries, and streaming inserts
  • D. Storage, queries, and exporting data

Answer: C

Explanation:
Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free operations.


NEW QUESTION # 14
An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do?

  • A. Create and share an authorized view that provides the aggregate results.
  • B. Create and share a new dataset and view that provides the aggregate results.
  • C. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.
  • D. Create and share a new dataset and table that contains the aggregate results.

Answer: C


NEW QUESTION # 15
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data.
What should you do?

  • A. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
  • B. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
  • C. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket.
    Manually destroy the key previously used for encryption, and rotate the key once.
  • D. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.

Answer: C


NEW QUESTION # 16
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

  • A. Rowkey: date
    Column data: device_id,data_point
  • B. Rowkey: date#data_point
    Column data: device_id
  • C. Rowkey: device_id
    Column data: date, data_point
  • D. Rowkey: date#device_id
    Column data: data_point
  • E. Rowkey: data_point
    Column data: device_id,date

Answer: E


NEW QUESTION # 17
What is the general recommendation when designing your row keys for a Cloud Bigtable schema?

  • A. Keep the row keep as an 8 bit integer
  • B. Include multiple time series values within the row key
  • C. Keep your row key as long as the field permits
  • D. Keep your row key reasonably short

Answer: D

Explanation:
A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.
Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys


NEW QUESTION # 18
You architect a system to analyze seismic dat
a. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?

  • A. Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.
  • B. Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
  • C. Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
  • D. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.

Answer: B


NEW QUESTION # 19
You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?

  • A. Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
  • B. Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
  • C. Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.
  • D. Create a cron schedule in Cloud Dataprep.

Answer: C


NEW QUESTION # 20
You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longtitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.
What should you do?

  • A. Provide latitude and longtitude as input vectors to your neural net.
  • B. Create a numeric column from a feature cross of latitude and longtitude.
  • C. Create a feature cross of latitude and longtitude, bucketize at the minute level and use L1 regularization during optimization.
  • D. Create a feature cross of latitude and longtitude, bucketize it at the minute level and use L2 regularization during optimization.

Answer: C

Explanation:
Reference https://cloud.google.com/bigquery/docs/gis-dataa


NEW QUESTION # 21
Your United States-based company has created an application for assessing and responding to user actions.
The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
* Single global endpoint
* ANSI SQL support
* Consistent access to the most up-to-date data
What should you do?

  • A. Implement BigQuery with no region selected for storage or processing.
  • B. Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.
  • C. Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe.
  • D. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.

Answer: D


NEW QUESTION # 22
Case Study 2 - MJTelco
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

  • A. The number of workers
  • B. The maximum number of workers
  • C. The disk size per worker
  • D. The zone

Answer: B


NEW QUESTION # 23
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform. What should you do?

  • A. Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
  • B. Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
  • C. Export the information to Cloud Stackdriver, and set up an Alerting policy
  • D. Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs

Answer: C

Explanation:
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.


NEW QUESTION # 24
Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11]
SELECT age
FROM
bigquery-public-data.noaa_gsod.gsod
WHERE
age != 99
AND_TABLE_SUFFIX = '1929'
ORDER BY
age DESC
Which table name will make the SQL statement work correctly?

  • A. 'bigquery-public-data.noaa_gsod.gsod'
  • B. bigquery-public-data.noaa_gsod.gsod*
  • C. 'bigquery-public-data.noaa_gsod.gsod'*
  • D. 'bigquery-public-data.noaa_gsod.gsod*`

Answer: D


NEW QUESTION # 25
You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?

  • A. PigLatin using Pig
  • B. Python using MapReduce
  • C. HiveQL using Hive
  • D. Java using MapReduce

Answer: A

Explanation:
Pig is scripting language which can be used for checkpointing and splitting pipelines.


NEW QUESTION # 26
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?

  • A. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
  • B. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
  • C. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
  • D. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.

Answer: D

Explanation:
Spanner allows transaction tables to scale horizontally and secondary indexes for range queries.


NEW QUESTION # 27
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

  • A. Cloud Dataflow
  • B. Cloud Composer
  • C. Cloud Scheduler
  • D. Cloud Functions

Answer: C


NEW QUESTION # 28
You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight dat
a. How can you adjust your application design?

  • A. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
  • B. Re-write the application to load accumulated data every 2 minutes.
  • C. Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.
  • D. Convert the streaming insert code to batch load for individual messages.

Answer: C

Explanation:
The data is first comes to buffer and then written to Storage. If we are running queries in buffer we will face above mentioned issues. If we wait for the bigquery to write the data to storage then we won't face the issue. So We need to wait till it's written tio storage


NEW QUESTION # 29
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

  • A. Rowkey: date#data_pointColumn data: device_id
  • B. Rowkey: dateColumn data: device_id, data_point
  • C. Rowkey: date#device_idColumn data: data_point
  • D. Rowkey: device_idColumn data: date, data_point
  • E. Rowkey: data_pointColumn data: device_id, date

Answer: E


NEW QUESTION # 30
You are deploying a new storage system for your mobile application, which is a media streaming service.
You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of
which can take on multiple values. For example, in the entity 'Movie'the property 'actors'and the
property 'tags' have multiple values but the property 'date released' does not. A typical query
would ask for all movies with actor=<actorname>ordered by date_releasedor all movies with
tag=Comedyordered by date_released. How should you avoid a combinatorial explosion in the
number of indexes?

  • A. Set the following in your entity options: exclude_from_indexes = 'date_published'
  • B. Manually configure the index in your index config as follows:
  • C. Manually configure the index in your index config as follows:
  • D. Set the following in your entity options: exclude_from_indexes = 'actors, tags'

Answer: C


NEW QUESTION # 31
......


Building & Operationalizing Data Processing Systems

Within this subject area, the test takers should show that they know how to build and operationalize storage systems. Specifically, they need to be conversant with effective use of managed services (such as Cloud Bigtable, Cloud SQL, Cloud Spanner, BigQuery, Cloud Storage, Cloud Memorystore, Cloud Datastore), storage costs & performance, and lifecycle management of data. The students should also be capable of building as well as operationalizing pipelines, including such technical tasks as data cleansing, transformation, batch & streaming, data acquisition & import, and integrating with new data sources. Apart from that, the candidates need to have sufficient competency to build and operationalize the processing infrastructure. This includes a good comprehension of provisioning resources, adjusting pipelines, monitoring pipelines, as well as testing & quality control.


Achieving the Google Professional-Data-Engineer certification is an excellent way for data professionals to demonstrate their expertise and advance their careers. This certification is highly valued by employers and can lead to new job opportunities and higher salaries. Moreover, it provides individuals with the skills and knowledge they need to design, build, and maintain data processing systems that are reliable, scalable, and secure.

 

Professional-Data-Engineer Dumps and Practice Test (270 Exam Questions): https://actual4test.torrentvce.com/Professional-Data-Engineer-valid-vce-collection.html