Airflow allows you to dynamically scale out tasks at runtime based on the output of an upstream task. This is heavily reliant on XComs.
@task def generate_config(): return "batch_size": 64, "threshold": 0.05 @task def process_batch(config): # Airflow automatically resolves 'config' via XCom behind the scenes print(f"Processing with batch size: config['batch_size']") # DAG Layout config_data = generate_config() process_batch(config_data) Use code with caution. 2. The Dangerous Pitfall: The Metadata Database Bottleneck
The TaskFlow API makes XComs implicit, clean, and Pythonic. When a PythonOperator (or @task decorator) returns a value, Airflow automatically pushes that value to XCom.
@task def process_customer_count(count_result): # count_result contains the XCom from sql_task's return_value print(f"Processing count_result customers")
@dag(start_date=datetime(2023,1,1), schedule=None, catchup=False) def xcom_exclusive_pipeline():
t1 >> t2 >> t3
Here is an overview of XCom exclusivity, limitations, and best practices.
At its simplest, an XCom is a key-value pair identified by a key , a task_id , and a dag_id . By default, when a task returns a value, Airflow automatically serializes that value and writes it into the metadata database (e.g., PostgreSQL or MySQL) in the xcom table.
: By default, XComs are accessible by any task within the same DAG run, but they aren't meant for massive datasets (like large CSVs); for those, external storage like S3 is preferred. Best Practices for an XCom-Heavy Workflow
Apache Airflow has become the de facto standard for orchestrating complex data pipelines, enabling data engineers to define workflows as directed acyclic graphs (DAGs). While Airflow excels at scheduling and managing task dependencies, one of its most powerful features for inter-task communication is (short for "cross-communication").
Airflow allows you to dynamically scale out tasks at runtime based on the output of an upstream task. This is heavily reliant on XComs.
@task def generate_config(): return "batch_size": 64, "threshold": 0.05 @task def process_batch(config): # Airflow automatically resolves 'config' via XCom behind the scenes print(f"Processing with batch size: config['batch_size']") # DAG Layout config_data = generate_config() process_batch(config_data) Use code with caution. 2. The Dangerous Pitfall: The Metadata Database Bottleneck
The TaskFlow API makes XComs implicit, clean, and Pythonic. When a PythonOperator (or @task decorator) returns a value, Airflow automatically pushes that value to XCom.
@task def process_customer_count(count_result): # count_result contains the XCom from sql_task's return_value print(f"Processing count_result customers")
@dag(start_date=datetime(2023,1,1), schedule=None, catchup=False) def xcom_exclusive_pipeline():
t1 >> t2 >> t3
Here is an overview of XCom exclusivity, limitations, and best practices.
At its simplest, an XCom is a key-value pair identified by a key , a task_id , and a dag_id . By default, when a task returns a value, Airflow automatically serializes that value and writes it into the metadata database (e.g., PostgreSQL or MySQL) in the xcom table.
: By default, XComs are accessible by any task within the same DAG run, but they aren't meant for massive datasets (like large CSVs); for those, external storage like S3 is preferred. Best Practices for an XCom-Heavy Workflow
Apache Airflow has become the de facto standard for orchestrating complex data pipelines, enabling data engineers to define workflows as directed acyclic graphs (DAGs). While Airflow excels at scheduling and managing task dependencies, one of its most powerful features for inter-task communication is (short for "cross-communication").