Exam: AWS Certified Data Engineer - Associate

Total Questions: 71
Page of

A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an
error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.

Which solution will meet this requirement?

A. Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint.
B. Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket.
C. Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name.
D. Verify that the VPC's route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint.
Answer: D ✅ Explanation ✅ D. Verify that the VPC's route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint. Explanation: When you use AWS Glue inside a VPC and access Amazon S3, Glue needs to connect through a VPC gateway endpoint if there's no NAT gateway or internet access. If you're seeing an error related to the Amazon S3 VPC gateway endpoint, it likely means that the route table is not configured properly. -To fix this, ensure that: Your subnet's route table has a route for *s3* traffic through the VPC Gateway Endpoint. This route allows private communication with S3 without needing public internet access.

A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company's data analysts can access data only for customers who are within the same country
as the analysts.

Which solution will meet these requirements with the LEAST operational effort?

A. Create a separate table for each country's customer data. Provide access to each analyst based on the country that the analyst serves.
B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies.
C. Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.
D. Load the data into Amazon Redshift. Create a view for each country. Create separate IAM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts.
Answer: B ✅ Explanation ✅ B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies. Why? AWS Lake Formation is designed to simplify data lake security and governance. It supports fine-grained access control, including row-level security and column-level security, directly on data stored in S3. This lets you centrally manage permissions without duplicating data or building complex manual access controls. It requires minimal ongoing operational effort compared to manually managing separate tables, regions, or Redshift views. Why not the other options? -A. Create a separate table for each country's customer data. Operationally complex due to managing multiple datasets and access controls. More data duplication and management overhead. -C. Move the data to AWS Regions close to the countries. Doesn't address the access control requirements. Moving data across regions is complex, expensive, and increases operational effort. -D. Load the data into Amazon Redshift and create views per country. Involves data migration and managing Redshift clusters. More operational overhead than using Lake Formation on existing S3 data. -Final recommendation: Use AWS Lake Formation for centralized, fine-grained, and scalable access control on your existing S3 data lake. -Final answer: B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies.

A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing analytics platform.
The company wants to minimize the effort and time required to incorporate third-party datasets.

Which solution will meet these requirements with the LEAST operational overhead?

A. Use API calls to access and integrate third-party datasets from AWS Data Exchange.
B. Use API calls to access and integrate third-party datasets from AWS DataSync.
C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories.
D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR).
Answer: A ✅ Explanation ✅ A. Use API calls to access and integrate third-party datasets from AWS Data Exchange. Explanation: AWS Data Exchange is a managed service that allows you to easily find, subscribe to, and use third-party data in AWS. It provides ready-to-use datasets with easy API access, reducing the need for manual data ingestion or transformation. Integration is straightforward, reducing operational complexity and time. This approach fits perfectly with a company wanting to minimize effort and time to incorporate external data. Why the other options are incorrect: -B. Use API calls to access and integrate third-party datasets from AWS DataSync. AWS DataSync is designed for data transfer and migration, primarily for moving large datasets between on-premises and AWS or between AWS storage services. It is not a source of third-party datasets. -C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories. AWS CodeCommit is a source code repository service, not a source for third-party datasets. Kinesis Data Streams is a real-time streaming service, not suited for direct access to third-party datasets stored in CodeCommit. -D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR). Amazon ECR is a container image registry, not a data source. Kinesis Data Streams cannot pull datasets from ECR. Final Answer: -A. Use API calls to access and integrate third-party datasets from AWS Data Exchange.

A financial company wants to implement a data mesh. The data mesh must support centralized data governance, data analysis, and data access
control. The company has decided to use AWS Glue for data catalogs and extract, transform, and load (ETL) operations.
Which combination of AWS services will implement a data mesh? (Choose two.)
A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster for data analysis.
B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis.
C. Use AWS Glue DataBrew for centralized data governance and access control.
D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis.
E. Use AWS Lake Formation for centralized data governance and access control.
Answer: BE ✅ Explanation Amazon S3 is the preferred data lake storage solution for a data mesh. It is scalable, cost-effective, and supports diverse data formats. Amazon Athena allows for serverless, interactive querying of data stored in S3, enabling efficient data analysis across datasets. AWS Lake Formation is built to provide centralized data governance, fine-grained access control, and security policies on top of data stored in S3. It integrates well with AWS Glue Data Catalog and supports data mesh governance needs. Together, these services provide a decentralized data ownership model with centralized governance—the core principle of a data mesh. Why other options are less suitable: -A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster for data analysis. Aurora and Redshift are great for specific workloads, but Aurora is a relational DB, not ideal for a data mesh data lake. Redshift is a data warehouse, which is more centralized rather than mesh-style distributed data ownership. -C. Use AWS Glue DataBrew for centralized data governance and access control. DataBrew is a visual data preparation tool; it does not provide centralized governance or access control capabilities. -D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis. RDS is a relational database service, not designed as scalable data lake storage. EMR is a big data platform, but combining it with RDS is less aligned with a mesh architecture compared to S3 + Athena. -Final answers: B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis. E. Use AWS Lake Formation for centralized data governance and access control.

A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions. The data engineer requires a less manual way to update the Lambda functions.

Which solution will meet this requirement?

A. Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket.
B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.
C. Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket.
D. Assign the same alias to each Lambda function. Call reach Lambda function by specifying the function's alias.
Answer: B ✅ Explanation Lambda Layers are designed specifically for sharing common code (like Python scripts or libraries) across multiple AWS Lambda functions. By packaging the shared scripts as a Lambda layer, you can attach that layer to all your Lambda functions. When the scripts need to be updated, you: Update the layer. Publish a new version of the layer. Update your Lambda functions to use the new version of the layer. This centralizes and simplifies the update process, avoiding manual updates to each Lambda function's code base. Why the other options are incorrect: -A. Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket. The execution context object does not manage code imports; Lambda functions can’t execute code directly from S3 without downloading and importing it dynamically, which adds complexity and potential latency. -C. Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket. Environment variables are meant for configuration values, not for pointing to or importing executable code. This doesn't help in managing shared logic efficiently. -D. Assign the same alias to each Lambda function. Call each Lambda function by specifying the function's alias. Aliases are used to manage versions of a single Lambda function, not to manage shared code across multiple Lambda functions. -Final Answer: B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.

A company created an extract, transform, and load (ETL) data pipeline in AWS Glue. A data engineer must crawl a table that is in Microsoft SQL
Server. The data engineer needs to extract, transform, and load the output of the crawl to an Amazon S3 bucket. The data engineer also must
orchestrate the data pipeline.
Which AWS service or feature will meet these requirements MOST cost-effectively?

A. AWS Step Functions
B. AWS Glue workflows
C. AWS Glue Studio
D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
Answer: B ✅ Explanation ✅ B. AWS Glue workflows Explanation: AWS Glue workflows are designed specifically to orchestrate ETL jobs, crawlers, and triggers in AWS Glue. They are tightly integrated with the Glue ecosystem and provide a cost-effective and serverless way to manage ETL pipelines from data source (e.g., Microsoft SQL Server) to destination (e.g., Amazon S3). With AWS Glue workflows, you can: Crawl the Microsoft SQL Server table. Run transformations in AWS Glue jobs. Load data to Amazon S3. Monitor the entire pipeline through a visual interface. This makes AWS Glue workflows the most cost-effective and integrated option for orchestrating Glue-based pipelines. -Why the other options are less suitable: -A. AWS Step Functions While capable of orchestration, it is more general-purpose and may incur higher costs when orchestrating Glue jobs compared to Glue workflows. -C. AWS Glue Studio Glue Studio is a visual interface for building and managing Glue jobs, not for orchestration of full workflows. -D. Amazon MWAA (Managed Workflows for Apache Airflow) Suitable for complex workflows, but overkill and more expensive for simple Glue-based ETL pipelines. ✅ Final Answer: B. AWS Glue workflows

A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.

Which solution will meet these requirements with the LEAST operational overhead?

A. Establish WebSocket connections to Amazon Redshift.
B. Use the Amazon Redshift Data API.
C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.
D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.
Answer: B ✅ Explanation ✅ B. Use the Amazon Redshift Data API The Amazon Redshift Data API is specifically designed to allow applications (including web-based apps) to run SQL queries directly on Amazon Redshift without the need to manage persistent connections, drivers, or database connection pools. It is: -Serverless and fully managed. Ideal for real-time or near-real-time querying from applications. Requires minimal operational overhead compared to JDBC/WebSocket setups. This makes it the best fit for integrating Redshift with a web-based trading application with low overhead.

A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account.

Which solution will meet these requirements?

A. Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply the S3 bucket policy to the S3 bucket.
B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup.
C. Create an IAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena.
D. Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.
Answer: B ✅ Explanation -Amazon Athena Workgroups are designed to isolate query execution, access control, and query history among different users, teams, or applications—even when they exist within the same AWS account. -Workgroups allow you to: Separate query history and results. Apply IAM-based access controls to limit actions per workgroup. Use tags for permission scoping and billing. -By creating a separate workgroup per use case and using IAM policies with condition keys on tags, you can restrict what each user, team, or application can see and do in Athena, including accessing query history and saved queries.

A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time.

Which solution will run the Glue jobs in the MOST cost-effective way?

A. Choose the FLEX execution class in the Glue job properties.
B. Use the Spot Instance type in Glue job properties.
C. Choose the STANDARD execution class in the Glue job properties.
D. Choose the latest version in the GlueVersion field in the Glue job properties.
Answer: A ✅ Explanation -AWS Glue offers two execution classes: STANDARD: Prioritizes speed, provisioning resources quickly for near-immediate execution. It’s more expensive. -FLEX: Offers lower cost by utilizing spare capacity in AWS. Jobs may start with some delay, making it ideal for non-urgent, cost-sensitive jobs. -Since the data engineer does not require the jobs to run at a specific time, using the FLEX execution class will reduce costs and still meet the business requirements.

A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.

Which solution will meet these requirements with the LEAST operational overhead?

A. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
B. Create an S3 event notification that has an event type of s3:ObjectTagging:* for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
C. Create an S3 event notification that has an event type of s3:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
D. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic.
Answer: A ✅ Explanation -To meet the requirement of running a Lambda function only when a .csv file is uploaded to an S3 bucket, with the least operational overhead, the most direct and efficient approach is to use an S3 event notification with: -Event type: s3:ObjectCreated:* → This triggers the Lambda function when a new object is created. -Filter rule: .csv suffix → Ensures that only .csv files trigger the Lambda. -Destination: Directly link to the Lambda function → This eliminates the need for intermediary services like SNS or SQS.