max_active_runs of Airflow

In Apache Airflow, if you want to ensure that a new run of a DAG doesn’t start before the previous one has completed, you can use the max_active_runs parameter in the DAG definition. Setting this parameter to 1 ensures that only one instance of the DAG is running at any given time.

Setting max_active_runs for a DAG

Here’s an example of how to set up a DAG with max_active_runs=1:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime

# Define the DAG
with DAG(
    'my_dag',
    description='A sample DAG',
    schedule_interval='@daily',      # Set your schedule
    start_date=datetime(2023, 1, 1),
    catchup=False,
    max_active_runs=1                # This prevents new runs before the previous run completes
) as dag:

    # Define tasks
    start = DummyOperator(task_id='start')
    end = DummyOperator(task_id='end')

    # Define task dependencies
    start >> end

Explanation of max_active_runs

  • max_active_runs=1: Ensures that only one DAG run is active at any given time. If there is already a run in progress, any new scheduled or triggered runs will be queued until the current one finishes.
  • Where to Use: This option is set directly in the DAG definition.
  1. concurrency (Task-Level Control): You can set concurrency on the DAG level if you want to control the maximum number of tasks that can run simultaneously within the DAG. This is helpful if you want to limit parallelism across tasks, but it doesn’t specifically prevent overlapping DAG runs.

    DAG(
        'my_dag',
        concurrency=1,  # Limits task concurrency
        ...
    )
    
  2. depends_on_past=True (Task-Level Control): At the task level, you can set depends_on_past=True to ensure that a task only runs if the previous run of that task succeeded. While this doesn’t directly limit the DAG runs, it helps create dependencies based on past runs of individual tasks, which can add an extra layer of control.

    DummyOperator(
        task_id='my_task',
        depends_on_past=True,
        dag=dag
    )
    

Summary

For preventing concurrent DAG runs, max_active_runs=1 is the most effective option. This will ensure that Airflow queues any new DAG runs until the current one completes, helping you maintain sequential DAG executions without overlap.

댓글

이 블로그의 인기 게시물

Using the MinIO API via curl

How to checkout branch of remote git, 깃 리모트 브랜치 체크아웃

PySpark Dataframe from HBase

how to delete all issues on project in sentry, 센트리 이슈 삭제하기

Create topic on Kafka with partition count, 카프카 토픽 생성하기

The logs of the kubelet service

To switch to a specific tag in a Git repository

Auto-populate a calendar in an MUI (Material-UI) TextField component

In HBase, the "memory to disk" flush operation