Exam: GCP: Professional Data Engineer

Total Questions: 101
Page of

You have an Azure subscription that contains multiple virtual machines in the West US Azure region.
You need to use Trac Analytics in Azure Network Watcher to monitor virtual machine trac.
Which two resources should you create? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. a Log Analytics workspace
B. an Azure Monitor workbook
C. a storage account
D. a Microsoft Sentinel workspace
E. a Data Collection Rule (DCR) in Azure Monitor
B

You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to change over time, so you build a data pipeline to
stream new data back to the model as it becomes available. How should you use this data to train the model?
A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.
B ✅ Explanation ✅ B. Retrain on a combination of existing data and new data Combats concept drift: Incorporating new data helps the model adapt to recent trends. -Preserves learned patterns: Including historical data ensures the model retains long-term insights. -Balances recency with robustness: Prevents overfitting to recent, potentially noisy data. -This is standard practice in incremental learning or online training pipelines.

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key
be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
A. Use a row key of the form <timestamp>.
B. Use a row key of the form <sensorid>.
C. Use a row key of the form <timestamp>#<sensorid>.
D. Use a row key of the form >#<sensorid>#<timestamp>.
Ans: D ✅ Explanation -In Cloud Bigtable, row key design is critical to performance, especially for real-time dashboards. The issue typically arises due to hotspotting, where too many writes or reads are concentrated on the same Bigtable node. -Why Option D is correct: Using a row key of the form <sensorid>#<timestamp> distributes data evenly by sensor, and then sorts it chronologically per sensor. -This enables: Efficient time-series queries per sensor (e.g., "latest values"). Even distribution of writes and reads across nodes. Bigtable stores rows in lexicographical order, so putting sensorid first helps group and access data efficiently by device.

Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other’s data. You want to ensure appropriate access to the data.

Which three steps should you take? (Choose three.)

A. Load data into different partitions.
B. Load data into a different dataset for each client.
C. Put each client’s BigQuery dataset into a different table.
D. Restrict a client’s dataset to approved users.
E. Only allow a service account to access the datasets.
F. Use the appropriate identity and access management (IAM) roles for each client’s users.
Answer: BDF ✅ Explanation -You need to secure and isolate data for multiple clients in BigQuery, ensuring that no client can access another’s data. Let’s break down the correct answers: ✅ B. Load data into a different dataset for each client BigQuery access control is primarily applied at the dataset level. Separating data by dataset allows you to assign fine-grained IAM permissions on a per-client basis. ✅ D. Restrict a client’s dataset to approved users You must explicitly control who can access each dataset. This ensures that only authorized users from each client can view or query their own data. ✅ F. Use the appropriate identity and access management (IAM) roles for each client’s users Use roles like bigquery.dataViewer, bigquery.dataEditor, or bigquery.user to grant proper access to the dataset. -IAM roles enforce role-based access control (RBAC), which is essential for multi-tenant security.

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information
required to do their jobs. You want to enforce this requirement with Google BigQuery.

Which three approaches can you take? (Choose three.)

A. Disable writes to certain tables.
B. Restrict access to tables by role.
C. Ensure that the data is encrypted at all times.
D. Restrict BigQuery API access to approved users.
E. Segregate data across multiple tables or databases.
F. Use Google Stackdriver Audit Logging to determine policy violations.
Answer: BDF ✅ Explanation -In a highly regulated industry, least privilege access is critical. This means users should only access the data necessary for their roles. Here's how the correct options support this principle in BigQuery: ✅ B. Restrict access to tables by role Use IAM roles to grant granular access (e.g., bigquery.dataViewer or custom roles). You can grant table-level permissions, limiting users to only the tables they are authorized to access. This is a direct implementation of least privilege. ✅ D. Restrict BigQuery API access to approved users Limiting API access ensures that only specific users or service accounts can interact with BigQuery programmatically. This helps prevent unauthorized or unmonitored data access. ✅ E. Segregate data across multiple tables or databases Data segregation allows you to isolate sensitive data. -You can apply different access controls to different tables or datasets. -This minimizes the risk of users accidentally or maliciously accessing unauthorized data.

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams
have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to
discover what everyone is doing. What should you do first?
A. Use Google Stackdriver Audit Logs to review data access.
B. Get the identity and access management IIAM) policy of each table
C. Use Stackdriver Monitoring to see the usage of BigQuery query slots.
D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Answer: A ✅ Explanation -To secure your BigQuery data warehouse, your first step should be to understand how it’s currently being used. Since there's no formal policy or documentation, you need visibility into who is accessing what data and how. The best tool for this is: ✅ A. Use Google Stackdriver Audit Logs (now part of Cloud Logging) Audit logs capture detailed records of: Who accessed BigQuery datasets and tables When the access occurred -What operations were performed (e.g., queries, reads, writes) This allows you to: Identify over-permissioned users Understand team use cases -Establish a foundation for least privilege access control -It provides historical, user-level insights, which are essential before you make policy changes.

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively
querying data.

Which query type should you use?

A. Include ORDER BY DESK on timestamp column and LIMIT to 1.
B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Answer: D ✅ Explanation -When streaming data into BigQuery with possible duplicates but having a unique ID per row, you want to deduplicate the data in your queries. -Why Option D? The ROW_NUMBER() window function assigns a unique sequential number to each row within a partition (here, partitioned by the unique ID). -By ordering within the partition (usually by timestamp or other criteria), you can keep the first occurrence (ROW_NUMBER() = 1) of each unique ID. -Filtering the query with WHERE row_number = 1 effectively removes duplicates, returning only one row per unique ID. -Why the others are incorrect: A. ORDER BY DESC on timestamp and LIMIT 1 Returns only one row total, not deduplicated data for all unique IDs. B. GROUP BY unique ID and timestamp with SUM Aggregation could combine duplicates but won't handle duplicates well if timestamps differ slightly or data is complex. Also, grouping by timestamp may create multiple groups for the same unique ID. C. Use LAG with PARTITION by unique ID and WHERE LAG IS NOT NULL -LAG helps compare previous rows, but filtering on LAG IS NOT NULL excludes the first occurrence. -Doesn’t cleanly deduplicate; harder to implement correctly.

You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update.

What should you do?

A. Update the current pipeline and use the drain flag.
B. Update the current pipeline and provide the transform mapping JSON object.
C. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
D. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
Answer: D ✅ Explanation -When making incompatible updates to a running Cloud Dataflow streaming pipeline, you cannot update it in place because the job state or pipeline code changes are not compatible. To avoid data loss, you need to carefully transition from the old pipeline to the new one. -Why Option D is correct: Create a new subscription to the Pub/Sub topic (a new subscription) for the new Dataflow pipeline. -This way, the new pipeline reads from the new subscription independently without interfering with the old pipeline. -Once the new pipeline is verified to be working properly, cancel the old pipeline and delete the old subscription if no longer needed. -This process ensures no data loss because: -Pub/Sub retains messages until they are acknowledged by the subscription. -Each subscription maintains its own message backlog independently.

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training
speed by removing some features while having a minimum effect on model accuracy. What can you do?
A. Eliminate features that are highly correlated to the output labels.
B. Combine highly co-dependent features into one representative feature.
C. Instead of feeding in each feature individually, average their values in batches of 3.
D. Remove the features that have null values for more than 50% of the training records.
Answer: B ✅ Explanation -You have thousands of input features and want to improve training speed by reducing dimensionality while maintaining accuracy. One effective way to do this is: -Identify features that are highly co-dependent (i.e., strongly correlated or redundant) -Combine them into one representative feature (e.g., using feature engineering techniques like PCA, feature aggregation, or domain-specific transformations) -This reduces the number of features without losing much information, improving training speed and keeping accuracy relatively stable.

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you
do not want to manage infrastructure scaling.
Which Google database service should you use?
A. Cloud SQL
B. BigQuery
C. Cloud Bigtable
D. Cloud Datastore
Answer: D ✅ Explanation -For payment transactions in a point-of-sale (POS) application with potential exponential growth and no desire to manage infrastructure scaling, the ideal choice is a fully managed, horizontally scalable NoSQL database like Cloud Datastore / Firestore. -Why Cloud Datastore / Firestore (Option D) is correct: Fully managed: No need to manage servers or scaling. -Automatic scaling: Scales seamlessly as user base grows. -Transactional support: Supports ACID transactions at the document/entity group level, suitable for many POS needs. -Real-time and global replication (if using Firestore in native mode): Adds robustness and low-latency reads.