Skip to content

feat: auto-resolve stale run locks [DATA-31226]#19

Open
quocnguyendinh wants to merge 5 commits intomasterfrom
DATA-31226/auto-resolve-run-lock
Open

feat: auto-resolve stale run locks [DATA-31226]#19
quocnguyendinh wants to merge 5 commits intomasterfrom
DATA-31226/auto-resolve-run-lock

Conversation

@quocnguyendinh
Copy link
Copy Markdown
Collaborator

@quocnguyendinh quocnguyendinh commented Apr 3, 2026

Context

When Kubernetes pods are killed (e.g., during downgrade/redeployment), Diffa's start_run() creates a RUNNING record in diffa_check_runs but the signal handlers never fire, so the lock is never released. Future runs see the stale RUNNING record and throw RunningCheckRunsException, requiring manual intervention.

The fix: before raising RunningCheckRunsException, check if the RUNNING records are older than a timeout threshold. If so, auto-mark them as FAILED and proceed.

Summary

  • Auto-resolve RUNNING lock records older than a configurable timeout (default 3 hours) before checking for active locks
  • Adds --lock-timeout CLI option to configure the timeout in hours
  • Prevents stale locks from blocking future runs when pods are killed during redeployment

Jira

DATA-31226

Changes

  • data_models.py — map created_at column in DiffaCheckRun ORM model
  • diffa_check_run.py — add resolve_stale_running_records() (repo) + resolve_stale_check_runs() (service)
  • config.py — add lock_timeout_hours to DiffaConfig, thread through ConfigManager.configure()
  • cli.py — add --lock-timeout option
  • run_manager.py — call resolve before checking for running records
  • test_run_manager.py — add tests for stale lock resolution + call ordering

Test plan

  • Existing tests pass (28/29 — 1 pre-existing failure unrelated to this change)
  • New test: resolve_stale_check_runs called before getting_running_check_runs
  • New test: stale locks resolved, fresh locks still raise RunningCheckRunsException
  • Manual: create a RUNNING record with old created_at, run diffa data-diff, verify auto-resolution

🤖 Generated with Claude Code

quocnguyendinh and others added 2 commits April 3, 2026 17:34
Add CLAUDE.md and knowledge/ folder documenting architecture, design
patterns, conventions, and component layers to enable AI coding agents
to follow existing patterns when working on the codebase.

Refs: DATA-32873

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move knowledge files from knowledge/ to .claude/rules/ so they are
automatically loaded into every Claude Code session. Replace root
CLAUDE.md with .claude/CLAUDE.md.

Refs: DATA-32873

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@quocnguyendinh quocnguyendinh changed the title feat: auto-resolve stale run locks feat: auto-resolve stale run locks [DATA-31226] Apr 3, 2026
quocnguyendinh and others added 3 commits April 3, 2026 18:10
Each rule file now has a changelog table at the bottom tracking
date, PR, and description of changes for decision traceability.

Refs: DATA-32873

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When pods are killed during redeployment, RUNNING locks are left
unreleased, blocking future runs. This adds automatic resolution of
stale RUNNING records older than a configurable timeout (default 3h)
before checking for active locks.

- Add --lock-timeout CLI option (hours, default 3)
- Add resolve_stale_running_records() at repository and service layer
- Call resolve before checking for running records in RunManager
- Map created_at column in DiffaCheckRun ORM model

Refs: DATA-31226

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update architecture, cli, configuration, and database-layer rule files
with lock timeout feature details and changelog entries for PR #19.

Refs: DATA-31226

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@quocnguyendinh quocnguyendinh force-pushed the DATA-31226/auto-resolve-run-lock branch from 0a26841 to 281919c Compare April 3, 2026 11:12
@quocnguyendinh quocnguyendinh self-assigned this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant