AWS Glue enables you to collect and integrate data from different sources, such as databases, flat files, streaming data, etc.

AWS Glue: What is it? What’s it for?

Q: How is AWS Glue structured?

Data Catalog: This is the permanent metadata store in AWS Glue. It contains table definitions, job definitions, etc. Database: this is a set of associated data catalog table definitions. Crawler: a program that connects to a data source to extract its data and determine its structure. It then uses this information to create table definitions in the data catalog.

Q: Reminder about ETL processes

ETL is a process designed to guarantee data quality and availability. It is divided into 3 phases: Extraction: recovery of data from various sources. Transformation: cleansing, normalizing and modifying data to make it usable. Loading: loading transformed data into a final environment, such as a database or data warehouse.

21 Jan 2024

m de lecture

Cloud & Data Engineering

Melanie

AWS Glue is a fully managed, scalable data processing service that enables users to run serverless ETL (Extract, Transform, Load) workflows, freeing them from the need to manage the underlying infrastructure.

A reminder about ETL processes

ETL is a process designed to guarantee data quality and availability. It is divided into 3 phases:

Extraction: recovery of data from various sources
Transformation: cleansing, normalizing and modifying data to make it usable
Loading: loading transformed data into a final environment, such as a database or data warehouse.

How is AWS Glue structured?

AWS Glue jobs perform the necessary extraction, transformation and loading of data from a source to a destination. The following diagram shows the architecture of AWS Glue, and then we describe the various elements:

Data Catalog: this is the permanent metadata storage in AWS Glue. It contains table definitions, job definitions, etc.
Database: a set of table definitions for associated data catalogs.
Crawler: a program that connects to a data source to extract its data and determine its structure. It then uses this information to create table definitions in the data catalog.
Connection: this AWS Glue connection is the data catalog that contains the information needed to connect to a certain data store.
Classifier: determines the data schema. AWS Glue provides classifiers for the most common file types, such as CSV, Json, etc.
Data store: repository for persistent data storage.
Data source: this is the entry point used for the transformation process.
Data target: the target to which the transformed data will be written.
Job: the business logic required for ETL jobs, made up of the various elements required.

AWS Glue features

AWS Glue allows you to fully manage your ETL processes through a variety of features, the most important of which are listed below:

	Data Collection and Integration	AWS Glue allows for the collection and integration of data from various sources, including databases, flat files, streaming data, etc.
	Data Transformation	Provides a set of tools for transforming data, including data processing functions, filtering, sorting, joining, and more.
	Data Catalog	Allows for the creation and management of a metadata catalog that facilitates data discovery, search, and analysis.
	ETL Task Execution and Scheduling	AWS Glue enables the scheduling and execution of ETL tasks to process data at scale.
	Workflow Automation	Offers workflow automation features to orchestrate complex tasks involving multiple steps.
	Custom Jobs	Enables the creation of custom jobs to address specific use cases. Custom jobs can be created using common programming languages such as Python and Scala.
	Error Handling	Allows for the management of errors encountered during data processing, such as syntax errors or connectivity issues.
	Monitoring	AWS Glue provides monitoring features to track ETL job performance, detect errors and performance issues, and optimize resource utilization.

Advantages and disadvantages of AWS Glue

Before embarking on using and learning AWS Glue, it’s important to consider both its advantages and disadvantages:

Advantages	Disadvantages
Large-scale data management	High costs for small businesses or small-scale projects, despite being a fully managed service
Fast data processing	Steep learning curve
Integration with other AWS services	Limited workflow customizations
Support for multiple programming languages	Requires expertise in data engineering
Fully managed platform
Built-in metadata catalog

Conclusion

As you’ve probably gathered by now, AWS Glue is a fully managed Amazon AWS ETL workflow service. Its great power and flexibility nevertheless require a steep learning curve and a very substantial investment in order to set it up to meet the required needs.

💡 Related articles:

AWS Elastic Load Balancer: The solution that distributes network traffic

Jam AWS: The playful Amazon learning platform

AWS Lambda: Introduction to the Serverless Function

AWS Certification: What is it and how do I get it?

AWS SageMaker: A guide for using the platform

5 AWS launches and announcements making developers’ life easy in 2022

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

MLOps

ETL Developer

Data Ops Engineer

Student Course

Amazon Web Services (AWS)

Microsoft Power BI

AWS Glue: What is it? What’s it for?