Project Purpose
The goal is to create an enterprise Data Lake to simplify the data management process and meet expanding consumption and analytical needs. The answer is to create an AWS data lake, which provides for the secure, dependable, and highly accessible storage of any structured and unstructured data of any size. The infrastructure as code (IaC) approach is used to create the entire solution together with the CI/CD strategy.
Before Setup
The traditional data storage and analytical tools that the earlier customers were using did not have the agility, scalability, and flexibility needed to generate useful business insights. The system has high manual intervention needs and operational overhead.
Approach finalized
To centrally manage data assets, enable automated ingestion, enable consistent consumption patterns, and implement security guardrails, an enterprise data lake solution is required.
AWS provides Lake Formation, a natively packaged service that incorporates centralized access management on AWS, enabling quicker setup of the data lake. This service helps in data searching, sharing, transformation, analysis, and governance within an organization or with other outside users.
Architecture overview
The high-level AWS cloud architecture is shown in the below diagram. Using this information, the necessary AWS services are provisioned.
Solution Components
This solution uses the Terraform IaC tool to automatically create the entire platform on the AWS Cloud. Next, create different AWS code pipelines for each environment (DEV/TEST/PROD). Every pipeline is used to deploy the Terraform code for the respective environment. The following AWS services can be created and configured using the Terraform code to create a data-lake solution on the AWS Cloud.
· S3 Buckets
The solution uses Amazon S3 buckets, which allow for the storage of unlimited cloud files and datasets. For various sets of data, various S3 buckets (landing, raw, and curated) will be created. The landing bucket will contain all the unstructured data. Incoming source files can be securely stored temporarily in the landing bucket. After the validation tests are finished, the data from the landing bucket is transferred to the raw bucket. The raw bucket offers secure data storage in its unaltered, unprocessed state. The actual data processing will happen on the data available in the raw bucket. The data will then be loaded into a curated bucket after being analyzed and converted.
· AWS Transfer family
This solution configures a fully managed and secure transfer family that scales in real-time to meet your needs, enabling you to transfer files into and out of Amazon Simple Storage Service (Amazon S3) storage over the following security protocols. With respect to achieving data in transit encryption.
All unstructured data from different sources will be loaded into the landing S3 bucket using the transfer family service.
· AWS KMS
You can make, maintain, and exercise control over cryptographic keys using AWS Key Management Service (AWS KMS). Amazon S3 objects are encrypted using AWS KMS keys. As a result, the KMS keys are used to encrypt the data at rest.
· Lake Formation
The solution configures lake formation. It is a fully managed service that makes it easier to build, secure, and manage data lakes. It collects and stores any type of data, at any scale, and at a low cost. It secures the data and prevents unauthorized access. It helps you manage fine-grained access for internal and external customers/consumers at the table and column level from a centralized location in a scalable way.
· AWS Glue
The approach makes data in the data lake discoverable by using AWS Glue, and can-do extract, transform, and load (ETL) can prepare data for analysis. A single metadata repository for a range of diverse data sources is created by Glue in the form of a data catalogue. To keep track of changes, the solution will build glue crawlers for each data source and schedule a daily scan. The ETL glue jobs can be scheduled and executed using the glue workflows to accomplish data transformations.
· AWS Athena
Using common SQL queries, this solution configures Athena to make it simple to analyse data directly in Amazon S3. Users can query curated data in a SQL fashion with Athena thanks to its integration with AWS Glue Data Catalog.
· DynamoDB
It is a fully managed service for NoSQL databases. Amazon DynamoDB tables are used by the Data Lake on AWS solution to store metadata for the data packages, configuration options, and user items.
· AWS Lambda
To execute glue jobs and subsequent crawlers, this system uses Lambda to start Glue Workflows. Due to Lambda’s ability to run code without having to provision or manage servers.
· AWS SNS
It is a web service that makes setting up, using, and sending cloud notifications simple. SNS notifications regarding various processing steps are sent to recipients at every level of data processing.
Summary
The setting up of an enterprise data lake on the AWS Cloud was discussed in this post. It’s only the tip of the iceberg, though. Every time a situation, requirement, cost, team skill, or new service offered by cloud platforms changes.
Written by – Manish Juneja