Skip to content

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Feb 4, 2026

What problem does this PR solve?

Issue Number: close #10683

Problem Summary:

  • TSAN reports a data race on FailPointHelper::fail_point_val when one thread disables a failpoint while another reads it.

What is changed and how it works?

Fix concurrency of accessing fail_point_val
  • Guard fail_point_val with std::shared_mutex and use shared/unique locks for read/write access.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • Bug Fixes
    • Improved fail-point reliability by adding robust thread-safe synchronization and guarded operations around fail-point state and wait/notify handling. Ensures safe concurrent enable/disable, atomic notify-before-clear semantics, protected per-point values, and a consistent error when waiting on a missing fail-point to prevent race conditions and unintended clears during notifications.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 4, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Introduces mutex-based synchronization in FailPointHelper: new fail_point_wait_channels_mutex, fail_point_val_mutex, and fail_point_val map; all accesses to wait-channel and per-point value maps are now guarded; channel notification and lifecycle are adjusted to avoid races.

Changes

Cohort / File(s) Summary
FailPoint core
dbms/src/Common/FailPoint.cpp, dbms/src/Common/FailPoint.h
Added std::mutex fail_point_wait_channels_mutex, std::shared_mutex fail_point_val_mutex, and std::unordered_map<String, std::any> fail_point_val. Replaced direct map accesses with mutex-protected operations (unique_lock for writes, shared_lock for reads). Adjusted enable/disable/wait flows to atomically extract and notify channels using local references and to erase corresponding value entries while holding locks.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • CalvinNeo
  • xzhangxian1008

Poem

🐰 I hop where mutexes quietly hum,
Locks clasp the maps so threads don't come undone.
Channels wake gently, values held in tow,
No races now — just tidy ebb and flow.
A rabbit nods, and off I go!

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title mentions fixing concurrency of accessing fail_point_val, which aligns with the code changes that add synchronization primitives to guard this variable.
Description check ✅ Passed The description references the correct issue (#10683), explains the problem (TSAN data race), specifies the solution (std::shared_mutex with shared/unique locks), and confirms unit tests were added.
Linked Issues check ✅ Passed The PR directly addresses issue #10683 by adding thread-safe access to fail_point_val through std::shared_mutex, which resolves the TSAN-reported data race condition described in the linked issue.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the concurrency issue in FailPointHelper by adding synchronization primitives and guarding access to fail_point_val, which is directly related to the linked issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@JaySon-Huang JaySon-Huang changed the title Fix concurrency of accessing fail_point_val tests: Fix concurrency of accessing fail_point_val Feb 4, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a TSAN-reported data race on FailPointHelper::fail_point_val where concurrent threads could access the map while one thread is modifying it (e.g., one thread reading via getFailPointVal while another thread erases via disableFailPoint).

Changes:

  • Added std::shared_mutex fail_point_val_mutex to protect concurrent access to fail_point_val
  • Protected read access in getFailPointVal with std::shared_lock
  • Protected write access in enableFailPoint and disableFailPoint with std::unique_lock

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
dbms/src/Common/FailPoint.h Added declaration of fail_point_val_mutex with appropriate #ifdef FIU_ENABLE guards and included <shared_mutex> header
dbms/src/Common/FailPoint.cpp Defined fail_point_val_mutex and added locking in getFailPointVal, enableFailPoint, and disableFailPoint methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 4, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@dbms/src/Common/FailPoint.cpp`:
- Around line 365-375: The throw in FailPointHelper::wait should include a
specific ErrorCodes value and use fmt::format instead of string concatenation;
replace the current throw Exception(...) with one that passes
ErrorCodes::FAIL_POINT_ERROR and a formatted message built via fmt::format
(referencing FailPointHelper::wait, fail_point_wait_channels, and the Exception
constructor that accepts an error code) so the exception becomes
Exception(ErrorCodes::FAIL_POINT_ERROR, fmt::format("Can not find channel for
fail point {}", fail_point_name)).
🧹 Nitpick comments (1)
dbms/src/Common/FailPoint.cpp (1)

214-218: Rename new static members to camelCase.
New variables use snake_case, which conflicts with the C++ naming guideline. Consider failPointWaitChannelsMutex, failPointValMutex, and failPointVal (or failPointValMap) and update the header usages accordingly.

As per coding guidelines: Method and variable names in C++ must use camelCase (e.g., readBlock, totalBytes).

Signed-off-by: JaySon-Huang <[email protected]>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 4, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-04 09:03:19.13396075 +0000 UTC m=+257670.235359469: ☑️ agreed by JinheLin.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@dbms/src/Common/FailPoint.cpp`:
- Around line 214-218: The static member names use snake_case; rename them to
camelCase and update all declarations/usages: change
FailPointHelper::fail_point_wait_channels_mutex ->
FailPointHelper::failPointWaitChannelsMutex,
FailPointHelper::fail_point_wait_channels ->
FailPointHelper::failPointWaitChannels, FailPointHelper::fail_point_val_mutex ->
FailPointHelper::failPointValMutex, and FailPointHelper::fail_point_val ->
FailPointHelper::failPointVal; ensure you update the corresponding declarations
in the FailPoint.h header and any references throughout the codebase to match
the new names so linkage and compilation remain correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail TSAN Test: S3FileTest.PutDMFileLocalFilesWaitsForAllTasks

2 participants