기사

A Beginner's Guide to Orchestrating Pipelines With Apache Airflow™ and Teradata

Explore the process of handling complex data workflows.

Satish Chinthanippu
Satish Chinthanippu
2024년 4월 30일 6 최소 읽기
Image illustrative of workflow management

Workflow orchestration is a key architectural concern in data engineering. Tools as simple as the Cron scheduler in Unix-like operating systems, such as Linux, can be used for simple workflows. However, Cron jobs might be laborious to implement when the intention is the orchestration of a complex workflow.

For these more complex workflows, Apache Airflow is a commonly used orchestration tool. As part of Teradata’s efforts to provide integration tools to our developer community, we've developed a provider, which allows the querying of Teradata Vantage™ databases from Airflow workflows.

Airflow and the Airflow Teradata provider

“Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows”. An Airflow workflow is a sequence of operations, defined programmatically from start to finish with consideration of the dependencies between the involved operations. In the context of data engineering, Airflow allows users to programmatically define, schedule, and monitor data pipelines using Python.

The key concept of Airflow is the definition of workflows as code. The key abstraction included in Airflow workflows is the directed acyclic graph (DAG). A DAG allows the orchestration of tasks with clear reference to their dependencies and order of execution.

Airflow communicates with external systems and tools through external packages called providers. The Airflow Teradata provider enables the integration of Airflow with Vantage. The Teradata provider implements hooks and operators that allow users to seamlessly run Teradata SQL scripts for data definition, data manipulation, querying of data, and execution of stored procedures from within Airflow workflows.
 

Example project

The example project illustrates the orchestration of a data pipeline with Airflow and the Airflow Teradata provider. The example workflow consists of a DAG that includes both data definition and data manipulation statements, and the creation and execution of Teradata stored procedures.

The example project consists of three general steps:

  • Installation and launch of Airflow
  • Configuration of a Teradata connection in Airflow
  • Data pipeline definition through an Airflow DAG using the Teradata provider
     

Note: Airflow is supported on Linux, Mac, and Windows WSL. To follow the example, you need access to a system with a Linux distro, macOS, or the Windows Subsystem for Linux (WSL) utility installed. On Windows, all code below should be executed in the context of the WSL utility.


​​​​Prerequisites


​​​​Step 1: Installation and launch of Airflow

Airflow is a Python package, and it’s installed and managed as any other Python package.  

  1. Create and activate a Python virtual environment.

python3 -m venv airflow-env
source airflow-env/bin/activate

  1. Set Airflow Home. It is recommended to set the default working directory of Airflow as `~/airflow`.

export AIRFLOW_HOME=~/airflow

  1. Install Airflow with pip.

  • To ensure the development environment's compatibility with the necessary providers, we must set some environment variables, including one pointing to a constraints file. These environment variables take care of managing the versioning of the needed packages. 

AIRFLOW_VERSION=2.9.0

PYTHON_VERSION= " $(python3 --version | cut -d " " -f 2 | cut -d "." -f 1-2)"

CONSTRAINT_URL= "https://raw.githubusercontent.com/apache/airflow/constraints- ${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

  1. Install Airflow Teradata provider.

pip install apache-airflow-providers-teradata

  1. Run Airflow standalone. Run below commands on terminal.

export AIRFLOW__CORE__LOAD_EXAMPLES=false
airflow standalone

  • The environment variable will clean other Airflow samples, so we worked specifically with our sample scenario.
  • The terminal screen will show a message notifying you that Airflow is ready. Just below you will find the password to log in to the Airflow console. Take note of that password.

Airflow console credentials displayed in the console logs.
Airflow console credentials displayed in the console logs.

  1. Access the Airflow UI.

  • Visit http://localhost:8080 and log in with the admin account details shown in the terminal.
  • Make sure your local host connection is under the http:// prefix and not https://.
  • The username to log in to the Airflow console is `admin` by default. The password is found in the terminal logs, as mentioned above.

Step 3: Configure a Teradata connection in Airflow

In the Airflow UI:

  1. Select Connections under the admin menu.  
  2. Select the + sign to create a new connection. 
  3. Provide the required configuration details for the Teradata connection. 
  • Connection id: Unique ID for the Teradata connection—in this example, `teradata_connector`.
  • Connection Type: From the drop-down, choose Teradata.
  • Database Server URL (required): Teradata instance hostname—in our example, the host of URI provided by ClearScape Analytics Experience.
  • Database (optional): Database that the connection will use by default.
  • Login (required): The username used by the connection.
  • Password (required): The password used by the connection.
  1. Select Save.

Airflow connection UI to define Teradata connection details. 

Step 4: Sample data pipeline orchestration with Airflow using the Teradata provider

As mentioned, Airflow workflows are Python files that define a DAG data structure through the Airflow DAG object. An Airflow DAG, in its most basic form, holds the DAG metadata definition (DAG ID, start date, schedule, etc.) set when the DAG object is initialized, a set of tasks, and the tasks execution sequence.

In our example, we define the following tasks, all provided by the Teradata operator.

  • Create a table
  • Insert sample data into the table
  • Retrieve data from the table
  • Retrieve data from the table by applying a WHERE clause
  • Drop the table
  • Create and execute a stored procedure

Setting the example DAG:

  1. Under the directory defined as `AIRFLOW_HOME` in your Airflow installation, create a directory named `dags`
  2. Under the `dags` directory, create a Python file—in this case, with the name `example_teradata_pipeline.py`
  3. Paste the content below to your newly created Python file
     

"""
Example use of Teradata operator.
"""


from __future__ import annotations
import datetime
from airflow import DAG
from airflow.providers.teradata.operators.teradata import TeradataOperator

DAG_ID = "example_teradata"
CONN_ID = "teradata_connector"

with DAG(
dag_id=DAG_ID,
start_date=datetime.datetime(2024, 2, 2),
schedule="@once",
catchup=False,
default_args={"conn_id": CONN_ID},
) as dag:

create_table = TeradataOperator(
task_id="create_table",
sql=r"""
CREATE TABLE Users, FALLBACK (
username varchar(50),
description varchar(256)
);
"""
,
)

populate_table = TeradataOperator(
task_id="populate_table",
sql=r"""
INSERT INTO Users (username, description)
VALUES ( 'Danny', 'Musician');
INSERT INTO Users (username, description)
VALUES ( 'Simone', 'Chef');
INSERT INTO Users (username, description)
VALUES ( 'Lily', 'Florist');
INSERT INTO Users (username, description)
VALUES ( 'Tim', 'Pet shop owner');
"""
,
)

get_data = TeradataOperator(
task_id="get_data",
sql=r"""
SELECT * FROM Users;
"""
,
)

get_data_with_filter = TeradataOperator(
task_id="get_data_with_filter",
sql=r"""
SELECT * FROM Users WHERE {{ params.column }}='{{ params.value }}';
"""
,
params= {"column": "username", "value": "Danny"},
)

drop_table = TeradataOperator(
task_id="drop_table",
sql=r"""
DROP TABLE Users;
"""
,
)

create_sp = TeradataOperator(
task_id="create_sp",
sql=r"""replace procedure example_sp(in p1 integer, inout p2 integer)
begin
set p2 = p1 + p2;
end;
"""
,
)

execute_sp = TeradataOperator(
task_id="execute_sp",
conn_id=CONN_ID,
sql ="CALL example_sp(2,2) ",
)

(
create_table
>> populate_table
>> get_data
>> get_data_with_filter
>> drop_table
>> create_sp
>> execute_sp
)


Running the DAG: 

  1. Select DAGs in the menu, and you’ll see a list of DAGs 

Airflow DAGs UI, displays DAGs. 

  • If you don’t see your DAG, you might need to exit the server, refresh the list manually with `airflow list_dags`, and restart the server 
  1. Select the play icon in your DAG 

  • Follow the execution by selecting the DAG ID to see the graph 
  • Rectangles in the graph represent the DAG’s tasks 
  • If a task fails, review the task logs by selecting the corresponding square and selecting logs in the menu 

Airflow DAG Graph View. 


References

To learn more, check out:

Tags

약 Satish Chinthanippu

Satish Chinthanippu is a senior software engineer at Teradata, specializing in integrating Teradata with third-party developer tools. He has played a key role in developing Teradata connectors for data engineering tools such as Airbyte and Apache Airflow™. Prior to joining Teradata, Chinthanippu was a software engineer at Cyient, where he focused on developing desktop and web applications using the Java stack.

모든 게시물 보기Satish Chinthanippu

알고 있어

테라데이트의 블로그를 구독하여 주간 통찰력을 얻을 수 있습니다



I consent that Teradata Corporation, as provider of this website, may occasionally send me Teradata Marketing Communications emails with information regarding products, data analytics, and event and webinar invitations. I understand that I may unsubscribe at any time by following the unsubscribe link at the bottom of any email I receive.

Your privacy is important. Your personal information will be collected, stored, and processed in accordance with the Teradata Global Privacy Statement.