AWS Glue

Fully managed ETL service.

AWS glue is a cloud optimized extract, transform and load service (ETL).

Consists of a central meta-data repository

Glue data catalog.

A spark ETL engine.
Flexible scheduler.
It allows you to organize, locate, move and transform all your data sets, across your business, so you can put them to use.
Glue is serverless.

Point glue to all you are ETL jobs and hit run.
We don’t need to provision, configure, spin up servers and manage their life cycle.

AWS glue offers a fully managed server-less ETL tool. This removes the overhead, and barriers to entry, when there is a requirement for a ETL service in AWS.
Requirements

S3 bucket

Data folder and sub folders.

Upload data

Script location
Temp directory

IAM role

Glue provides crawlers with automatic schema inference for your, semi structured and structured data sets.

Crawlers automatically discover all your data sets, discover your file types, extract the schema and store all this information in a centralized metadata catalog for later queries and analysis.

Glue automatically generates the Scripts that you need to extract, transform, and load your data from source to target so we don’t have to start from scratch.
Set up

Create a new bucket.

Create folder data.
Create folder, temp directory for temporary files.
Create folder Scripts.
Inside data, create folder, customer_database.
In customer_database folder, create customer_CSV.
In customer_CSV create folder, data load_{current time stamp}.

Next, we need to have a CSV file to load data from.

Upload CSV in S3 under data.

Go to IAM

Create a new rule for glue to access S3 bucket.

AWS glue data catalog

Persistent meta-data store.

Data location
Schema
Data Types
Data classification.

It is a Managed service that lets you store, annotate, and share meta-data which can be used to query and transform data.
one AWS glue data catalog per AWS region.
Identity and access management IAM. policies, control access.
Can be used for data governance.

A glue data catalog has following components

AWS glue Database.

A set of associated data catalog table, definitions, organized into logical group.
Create a new database.

Give it a name.
Give S3 URN of folder where you want to store database.
Give a description.

AWS glue Tables

The meta-data definition that represents your data.
The data resides in its original store.
This is just a representation of the schema.
we can add tables using crawler, manually, from external schema.
Add table manually

Enter table name.
Select database.
Select source type (S3)
Select S3 path
Select classification as CSV.
Select Delimiter.
Add columns from source to be mapped .

Add tables using Crawler.

Give crawler a name.
Crawler source type.

Repeat crawls of S3 data stores.
Choose data Store.
Choose path
Add another data store.
Choose IAM role to access S3 for glue.
Set frequency.
Configure output

Database
Table prefix

Select and run caller on finish we see table name is same as S3 folder name under tables.
Click on table and we will see the table data.

AWS glue Connections

A data catalogue object that contains the properties that are required to connect to a particular data store.
Adding connection.

Give name.
Select connection type like JDBC etc.
Select database engine.
Select instance.
Select dB name.
Username
Password

AWS glue Crawlers

A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema of your data, and then create metadata tables in the SWS glue data catalog.

Schema Registers.

EWS glue ETL has following components.

Blueprints
Workflows
AWS glue Jobs

The business logic that is required to perform ETL work. It is composed of a transformation, script, data, sources, and data targets. Job runs are initiated by triggers that can be scheduled or triggered by events.
Adding a job

Click add job
Give it a name.
Give it and IAM, role.
Choose S3 path of Scripts .
Give path to temp directory.
Choose data source.
Choose transformation type.
Choose data target.
Choose data, target directory.
You will see schema mapping from source to target on next screen.
Save job and edit script.
Save script and run job.
We will have our parquet file in S3.
We can view this file in Athena.

AWS glue Triggers.

Initiates an ETL job.Triggers can be defined based on a scheduled time or an event.
Adding a trigger

Give name
Give Trigger type
Give frequency
Start hour, minute.
Select job
Finish and activate trigger.

AWS glue Dev endpoints.

A Development end point is an environment that you can use to develop and test your AWS glue scripts.
It’s essentially an abstracted cluster.
NB the cost can add up.

AWS glue studio.

AWS glue security has following components

Security configuration.

We can query the data using AWS Athena.

Create a folder for Athena results in S3.
Go to settings and select query result location Folder, path in bucket.
Select data source.
Select database.
Select table and click preview table we will get results.

Partitions in AWS

Folders where data is stored on S3, which are physical entities, are mapped to partition, which are logical entities i.e. columns in the glue table.

Example

We have some logs stored in S3 and we have user data in our RDS.
We need to summarize and move all this data to our red shift data warehouse so that we can analyze and understand which demographics are actually contributing to our adoption.
We can point crawlers to our S3 and RDS they will infer datasets and data-structures inside file and store info in a catalog.
We can then point glue to tables in catalog and it will automatically generate Scripts that are needed to extract and transform data into red shift data warehouse.

Build ETL Pipeline using AWS glue and step functions

Typical ETL, glue pipeline

We can have data in JDBC data source.
AWS glue job is used to ETL data in S3
S3 is raw bucket area where we have data that is immutable and we don’t do any business transformation on this data just load from source.
We use AWS crawler to catalog our data so Data is discoverable and source-able.
Once our data is in S3 raw bucket, we use glue job to convert our data into cleansed data.

Cleansing operation depends on business needs it can be standardizing column names.
Standardizing currency format, date format, or it could be removing duplicate data.

Once we have cleansed the data we can use glue crawler again to catalog the data.
Then we again run glue job to convert cleansed data into curated data.

Curated data is created based on use case only like creating reports, dashboards, machine learning.

Again we use glue crawler to catalog our data.

Methods of building pipelines

AWS event based framework.

AWS. Lambda
AWS. Cloud watch.
Amazon DB
Amazon Event page.

AWS, glue workflow.
AWS step functions.

Pipeline workflow, using Step functions.

Step function based pipeline benefits

Reusability in Jobs and crawlers.
Observability.
Built in and out of box visibility.
Pipelines step level tracking.
Development effort

Early wire.
Consistent Development effort to throughout.

Integration with AWS services.

Integration possible
Direct AWS services indication in step function workflow.

Development environment.

Step functions, workflow, designer.

Starting a glue job

Job parameters

Job name
Job metadata

Synchronous or asynchronous Run
Next state

Starting a glue crawler.

There is no out of box task for glue crawler.
Crawler execution is a long running process.
Approach 1

Use task to invoke a crawler API.
Event-based execution control using “wait for task token”

Approach 2

Leverage lambda task with custom code.
You need two lambda functions

Start crawler.
Get status of crawler.

The crawler goes in starting state from ready state.
From starting running.
From running to stopping.
From stopping to ready again.

Once the crawler status turns back to ready, we move to next state.

We can invoke a lambda function from step function and crawler.

Inside lambda, we code as follows

Using boto3.client(‘glue’)
Using start_crawler method

For getting status of crawler , we can use lambda function, which gets crawler name as input from step function.

get_crawler(Name=target)
return response [“crawler”] [“state”]

We are using python in lambda function.

Practical

Create folders in S3 to ingest and clean data with scripts as follows

Cleansed
Raw
Scripts.

Create tables in catalog depending on source.
Create two crawlers to catalog data and cleansed data.
Create two jobs, one to ingest data from RDS (RAW data) and second to cleanse data.
Create 2 lambda functions

Start crawler function.
Get crawler function

Create state machine.

Search This Blog

Cloud Implant

AWS Glue

Comments

Post a Comment

Popular posts from this blog

High Level Diagrams(HLD's)

AWS Summaries

Ephemeral vs Persistent Data Storage Patterns