do not push stale update to related DagRun on TI update after task execution #44547

mobuchowski · 2024-12-01T20:20:15Z

When TI is marked as failed via UI, and later fails itself, scheduler changes the state to failed twice.

(some extra logs for clarity)

[2024-11-28T14:48:11.276+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 75
[2024-11-28T14:48:11.284+0000] {scheduler_job_runner.py:1666} ERROR - Scheduling dag run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731
025+00:00: manual__2024-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered:
True> state running
[2024-11-28T14:48:11.287+0000] {dagrun.py:906} ERROR - Marking run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731025+00:00: manual__202
4-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered: True> failed - from st
ate running at 2024-11-28 14:48:11.286513+00:00
[2024-11-28T14:48:11.288+0000] {dagrun.py:1101} ERROR - NOTIFYING STATE CHANGED TO failed
[2024-11-28T14:48:11.289+0000] {dagrun.py:989} INFO - DagRun Finished: dag_id=wait_to_fail, logical_date=2024-11-28 14:48:04.731025+00:
00, run_id=manual__2024-11-28T14:48:04.731025+00:00, run_start_date=2024-11-28 14:48:05.186397+00:00, run_end_date=2024-11-28 14:48:11.
288084+00:00, run_duration=6.101687, state=failed, external_trigger=True, run_type=manual, data_interval_start=2024-11-28 14:48:04.7310
25+00:00, data_interval_end=2024-11-28 14:48:04.731025+00:00, dag_version_name=wait_to_fail-1
[2024-11-28T14:48:11.299+0000] {client.py:110} INFO - OpenLineageClient will use `http` transport
[2024-11-28T14:48:12.307+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 76
[2024-11-28T14:48:13.344+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 77
[2024-11-28T14:48:14.382+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 78
[2024-11-28T14:48:15.274+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 79
[2024-11-28T14:48:16.309+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 80
[2024-11-28T14:48:17.345+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 81
[2024-11-28T14:48:17.361+0000] {scheduler_job_runner.py:1666} ERROR - Scheduling dag run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731
025+00:00: manual__2024-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered:
True> state running
[2024-11-28T14:48:17.366+0000] {dagrun.py:906} ERROR - Marking run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731025+00:00: manual__202
4-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered: True> failed - from st
ate running at 2024-11-28 14:48:17.364879+00:00
[2024-11-28T14:48:17.366+0000] {dagrun.py:1101} ERROR - NOTIFYING STATE CHANGED TO failed
[2024-11-28T14:48:17.367+0000] {dagrun.py:989} INFO - DagRun Finished: dag_id=wait_to_fail, logical_date=2024-11-28 14:48:04.731025+00:
00, run_id=manual__2024-11-28T14:48:04.731025+00:00, run_start_date=2024-11-28 14:48:05.186397+00:00, run_end_date=2024-11-28 14:48:17.
366330+00:00, run_duration=12.179933, state=failed, external_trigger=True, run_type=manual, data_interval_start=2024-11-28 14:48:04.731
025+00:00, data_interval_end=2024-11-28 14:48:04.731025+00:00, dag_version_name=wait_to_fail-1
[2024-11-28T14:48:17.370+0000] {client.py:110} INFO - OpenLineageClient will use `http` transport

This happens because on handle_failure, TaskInstance.save_to_db not only updates state of that task instance, it also pushes stale DagRun state - the one it got on TI start. So the actual DR state goes running -> failed -> running -> failed.

This causes other unintended behavior, such as calling on_dag_run_failed listeners twice.

The solution just loads DR state from db before pushing TI state. However, there probably is better solution, that someone with more knowledge of SQLAlchemy might help with.

Link to discussion on Airflow slack: https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1732805503889679

uranusjr · 2024-12-02T10:29:57Z

airflow/models/taskinstance.py

@@ -1127,7 +1127,8 @@ def _handle_failure(
        )

    if not test_mode:
-        TaskInstance.save_to_db(failure_context["ti"], session)
+        task_instance.dag_run.refresh_from_db()


Is the DagRun always stale here? Or does this only applied for some code paths? I’m just trying to understand the context of this and wondering if we should check either task_instance and/or dag_run is in session.

It is always stale if updated in the meantime by scheduler/webserver, so when it gets manually marked as failed the database is updated with stale data. TI is updated a bit before that, in TaskInstance.fetch_handle_failure_context

@uranusjr do you think there's anything else to do here?

I am not sure about this one. Here you refresh the DagRun and not the TI from DB.

In the error description/PR Header you write the task instance might be updated twice. But usually when you mark a TI as "failed" manually it is terminated. Is this error situation ALWAYS happening or only in a race when you mark it manually and the TI is close to completion?

And are you talking about on_error callback on the DAG or on the TI? Assuming if you mark failed it is on TI level and not the DAG itself?

I am not sure about this one. Here you refresh the DagRun and not the TI from DB. In the error description/PR Header you write the task instance might be updated twice.

Ah yes, I did not write this clearly enough - I thought about reading the description together with title - it's the DagRun that's update twice.

But usually when you mark a TI as "failed" manually it is terminated. Is this error situation ALWAYS happening or only in a race when you mark it manually and the TI is close to completion?

This is also happening when task is terminated, it still goes into handle_failure method:

except (AirflowTaskTimeout, AirflowException, AirflowTaskTerminated) as e: if not test_mode: ti.refresh_from_db(lock_for_update=True, session=session) # for case when task is marked as success/failed externally # or dagrun timed out and task is marked as skipped # current behavior doesn't hit the callbacks if ti.state in State.finished: ti.clear_next_method_args() TaskInstance.save_to_db(ti=ti, session=session) return None else: ti.handle_failure(e, test_mode, context, session=session) raise

And are you talking about on_error callback on the DAG or on the TI? Assuming if you mark failed it is on TI level and not the DAG itself?

I've been testing it with listeners, but it affects callbacks as well.

clear.py.log

you can see the line starting with

DAG clear_test failed at 2024-12-17 12:23:56.974528+00:00. Run ID: manual__2024-12-17T12:23:56.974528+00:00

happens twice. This is the DAG I'm testing it with.

from airflow import DAG from airflow.operators.empty import EmptyOperator from airflow.providers.standard.operators.python import PythonOperator from datetime import datetime, timedelta import time def wait_function(): time.sleep(10) def notify_failure(context): dag_id = context['dag'].dag_id run_id = context['run_id'] logical_date = context['logical_date'] log_url = context['task_instance'].log_url print(f"DAG {dag_id} failed at {logical_date}. Run ID: {run_id}. See logs: {log_url}") default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2023, 10, 1), 'email_on_failure': False, 'email_on_retry': False } with DAG( 'clear_test', default_args=default_args, description='A simple example DAG', on_failure_callback=notify_failure, catchup=False, ) as dag: start_task = EmptyOperator( task_id='start_task', ) wait_task = PythonOperator( task_id='wait_task', python_callable=wait_function, ) start_task >> wait_task

Okay as I am not an expert in this area... used your DAG and the outcome is mixed.

When I let the task "normal" fail, the call back executed once. If I manually mark as fail the call back is executed two times - also with the patch applied. So in my view it is not fixing the problem.

I tested with main and CeleryExecutor.

@jscheffl can you recheck? I've moved the refresh to save_to_db method - to cover more cases - and reverified.

As said... I am not an expert. Last feedback was just from manual tests. So naive view on code looks good.

But some pytests are failing. Can you correct this? Then I could re-review.

…ecution Signed-off-by: Maciej Obuchowski <[email protected]>

mobuchowski requested review from XD-DENG and ashb as code owners December 1, 2024 20:20

mobuchowski force-pushed the do-not-update-dr-after-ti-fails branch from b134561 to cfb8b88 Compare December 1, 2024 20:23

mobuchowski requested review from uranusjr and jedcunningham December 1, 2024 20:26

mobuchowski force-pushed the do-not-update-dr-after-ti-fails branch from cfb8b88 to a77aafa Compare December 2, 2024 10:13

uranusjr reviewed Dec 2, 2024

View reviewed changes

mobuchowski force-pushed the do-not-update-dr-after-ti-fails branch from a77aafa to 2aeb4db Compare December 2, 2024 14:56

mobuchowski force-pushed the do-not-update-dr-after-ti-fails branch 2 times, most recently from cae5cd6 to c1e0087 Compare December 20, 2024 15:59

mobuchowski requested a review from ephraimbuddy as a code owner December 20, 2024 15:59

do not push stale update to related DagRun on TI update after task ex…

ea2969b

…ecution Signed-off-by: Maciej Obuchowski <[email protected]>

mobuchowski force-pushed the do-not-update-dr-after-ti-fails branch from c1e0087 to ea2969b Compare December 24, 2024 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not push stale update to related DagRun on TI update after task execution #44547

do not push stale update to related DagRun on TI update after task execution #44547

mobuchowski commented Dec 1, 2024

uranusjr Dec 2, 2024

mobuchowski Dec 2, 2024

mobuchowski Dec 10, 2024

jscheffl Dec 16, 2024

mobuchowski Dec 17, 2024

jscheffl Dec 17, 2024

mobuchowski Dec 20, 2024

jscheffl Dec 20, 2024

do not push stale update to related DagRun on TI update after task execution #44547

Are you sure you want to change the base?

do not push stale update to related DagRun on TI update after task execution #44547

Conversation

mobuchowski commented Dec 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment