max_active_runs of Airflow

In Apache Airflow, if you want to ensure that a new run of a DAG doesn’t start before the previous one has completed, you can use the max_active_runs parameter in the DAG definition. Setting this parameter to 1 ensures that only one instance of the DAG is running at any given time.

Setting max_active_runs for a DAG

Here’s an example of how to set up a DAG with max_active_runs=1:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime

# Define the DAG
with DAG(
    'my_dag',
    description='A sample DAG',
    schedule_interval='@daily',      # Set your schedule
    start_date=datetime(2023, 1, 1),
    catchup=False,
    max_active_runs=1                # This prevents new runs before the previous run completes
) as dag:

    # Define tasks
    start = DummyOperator(task_id='start')
    end = DummyOperator(task_id='end')

    # Define task dependencies
    start >> end

Explanation of max_active_runs

  • max_active_runs=1: Ensures that only one DAG run is active at any given time. If there is already a run in progress, any new scheduled or triggered runs will be queued until the current one finishes.
  • Where to Use: This option is set directly in the DAG definition.
  1. concurrency (Task-Level Control): You can set concurrency on the DAG level if you want to control the maximum number of tasks that can run simultaneously within the DAG. This is helpful if you want to limit parallelism across tasks, but it doesn’t specifically prevent overlapping DAG runs.

    DAG(
        'my_dag',
        concurrency=1,  # Limits task concurrency
        ...
    )
    
  2. depends_on_past=True (Task-Level Control): At the task level, you can set depends_on_past=True to ensure that a task only runs if the previous run of that task succeeded. While this doesn’t directly limit the DAG runs, it helps create dependencies based on past runs of individual tasks, which can add an extra layer of control.

    DummyOperator(
        task_id='my_task',
        depends_on_past=True,
        dag=dag
    )
    

Summary

For preventing concurrent DAG runs, max_active_runs=1 is the most effective option. This will ensure that Airflow queues any new DAG runs until the current one completes, helping you maintain sequential DAG executions without overlap.

댓글

이 블로그의 인기 게시물

Fundamentals of English Grammar #1

Create topic on Kafka with partition count, 카프카 토픽 생성하기

To download a file from MinIO using Spring Boot, 스프링부트 Minio 사용하기

Vespa vs Milvus

Scan an HBase table with a prefix filter

Using the MinIO API via curl

In HBase, the "memory to disk" flush operation

To switch to a specific tag in a Git repository

kafka polling vs listen

Joining an additional control plane node to an existing Kubernetes cluster