Skip to content

Fix region migration retries and peer recreation#17512

Open
Pengzna wants to merge 1 commit intoapache:masterfrom
Pengzna:codex/iotv2-sync-lag-fix
Open

Fix region migration retries and peer recreation#17512
Pengzna wants to merge 1 commit intoapache:masterfrom
Pengzna:codex/iotv2-sync-lag-fix

Conversation

@Pengzna
Copy link
Copy Markdown
Collaborator

@Pengzna Pengzna commented Apr 17, 2026

Summary

  • rebase this branch onto the latest master
  • drop the local IoTConsensusV2 receiver fix because upstream already contains #17495
  • keep the remaining region migration fixes for bounded delete-old-peer retries and idempotent local peer recreation

Verification

  • mvn -pl iotdb-core/confignode,iotdb-core/consensus -am -DskipTests -Drat.skip=true -ntp compile

Copilot AI review requested due to automatic review settings April 17, 2026 16:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses stability issues in consensus/region-migration workflows, focusing on bounding retry behavior, preventing duplicate peer lists during migration, and fixing a writer-allocation race in the IoTConsensusV2 pipe receiver.

Changes:

  • Bound DELETE_OLD_REGION_PEER RPC retries and avoid adding duplicate destination peers during IoT/IoTV2 region migration.
  • Make IoTConsensus#createLocalPeer tolerate an already-existing consensus directory after restart recovery (and validate it’s a directory).
  • Improve IoTConsensusV2 receiver tsfile-writer pooling by reserving writers more deterministically and waking waiters when writers are returned.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/receiver/protocol/iotconsensusv2/IoTConsensusV2Receiver.java Adjust tsfile-writer borrowing logic and signal waiters on writer release to mitigate writer-pool races.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/IoTConsensus.java Allow createLocalPeer to proceed when the consensus dir already exists; validate path type.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java Avoid duplicate peers in create-peer requests; bound DELETE_OLD_REGION_PEER RPC retries per attempt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1017 to +1031
// We should synchronously find the idle writer to avoid concurrency issues.
lock.lock();
try {
// We need to check tsFileWriter.isPresent() here. Since there may be both retry-sent
// tsfile
// events and real-time-sent tsfile events, causing the receiver's tsFileWriter load to
// exceed IOTDB_CONFIG.getIoTConsensusV2PipelineSize().
while (!tsFileWriter.isPresent()) {
while (true) {
tsFileWriter =
iotConsensusV2TsFileWriterPool.stream()
.filter(
item ->
item.isUsed()
&& Objects.equals(
commitId, item.getCommitIdOfCorrespondingHolderEvent()))
.findFirst();
if (tsFileWriter.isPresent()) {
break;
}
Comment on lines +1339 to +1345
tsFileWriter.returnSelf(consensusPipeName);
iotConsensusV2TsFileWriterPool.lock.lock();
try {
iotConsensusV2TsFileWriterPool.condition.signalAll();
} finally {
iotConsensusV2TsFileWriterPool.lock.unlock();
}
@Pengzna Pengzna changed the title Fix IoTV2 receiver race and region migration retries Fix region migration retries and peer recreation Apr 17, 2026
@Pengzna Pengzna force-pushed the codex/iotv2-sync-lag-fix branch from 1748fa3 to ee1d27e Compare April 17, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants