c – 按需条件std :: atomic_thread_fence获取的好处和缺点？

2023年7月5日 657次阅读

下面的代码显示了通过原子标志获取共享状态的两种方法.读者线程调用poll1()或poll2()来检查写入器是否已发出标志信号.

民意调查选项#1：

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

民意调查选项#2：

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

注意,选项#1是presented in an earlier question,选项#2类似于example code at cppreference.com.

如果poll函数返回true,假设读者同意只检查共享状态,那么两个poll函数是正确的还是等效的？

选项#2是否有标准名称？

每种选择的好处和缺点是什么？

选项#2在实践中可能更有效吗？它的效率可能会降低吗？

这是一个完整的工作示例：

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

int x; // regular variable, could be a complex data structure

std::atomic<int> flag { 0 };

void writer_thread() {
    x = 42;
    // release value x to reader thread
    flag.store(1, std::memory_order_release);
}

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

int main() {
    x = 0;

    std::thread t(writer_thread);

    // "reader thread" ...  
    // sleep-wait is just for the test.
    // production code calls poll() at specific points

    while (!poll2()) // poll1() or poll2() here
      std::this_thread::sleep_for(std::chrono::milliseconds(50));

    std::cout << x << std::endl;

    t.join();
}

最佳答案我想我可以回答你的大部分问题.

这两个选项当然是正确的,但它们并不完全相同,因为独立栅栏的适用性稍大(它们在您想要完成的任务方面相当,但是独立栅栏在技术上可以应用于其他东西,如好吧 – 想象一下这段代码是否内联). this post by Jeff Preshing中解释了独立栅栏与存储/获取栅栏的不同之处.

据我所知,选项#2中的check-then-fence模式没有名称.但这并不罕见.

在性能方面,使用x64(Linux)上的g 4.8.1,两个选项生成的程序集归结为单个加载指令.这并不奇怪,因为x86(-64)加载和存储都在硬件级别具有获取和释放语义(x86以其相当强大的内存模型而闻名).

但是,对于ARM,内存屏障会编译为实际的单个指令,会生成以下输出(使用带有-O3 -DNDEBUG的gcc.godbolt.com)：

for while(！poll1());:

.L25:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    dmb     sy
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L25

for while(！poll2());:

.L29:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L29
    dmb     sy

您可以看到唯一的区别是放置同步指令(dmb)的位置 – 在poll1的循环内部,以及在poll2之后.因此,poll2在这个真实世界的情况下确实更有效:-)(但是请进一步阅读为什么这可能无关紧要,如果它们在循环中被调用阻塞直到标志发生变化.)

对于ARM64,输出是不同的,因为存在内置屏障的特殊加载/存储指令(ldar – > load-acquire).

for while(！poll1());:

.L16:
    ldar    w0, [x1]
    cmp     w0, 1
    bne     .L16

for while(！poll2());:

.L24:
    ldr     w0, [x1]
    cmp     w0, 1
    bne     .L24
    dmb     ishld

再次,poll2导致一个没有障碍的循环,一个在外面,而poll1每次都有障碍.

现在,哪一个实际上更高性能需要运行基准测试,不幸的是我没有这方面的设置. poll_和poll2,反直觉地说,在这种情况下可能最终同样有效,因为如果标志变量是需要传播的那些效果之一,花费额外的时间等待内存效应在循环内传播可能实际上不会浪费时间(即,即使对poll1的个别(内联)调用花费的时间长于poll2,所以直到循环退出所花费的总时间可能相同).当然,这假设循环等待标志改变 – 对poll1的单独调用确实比单独调用poll2需要更多的工作.

因此,我认为总体来说,相对于poll1来说,poll2的效率应该永远不会低得多,并且通常可以更快,只要编译器可以在内联时消除分支(这似乎至少是这三个情况)流行的架构).

我的(略有不同)测试代码供参考：

#include <atomic>
#include <thread>
#include <cstdio>

int sharedState;
std::atomic<int> flag(0);

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

void __attribute__((noinline)) threadFunc()
{
    while (!poll2());
    std::printf("%d\n", sharedState);
}

int main(int argc, char** argv)
{
    std::thread t(threadFunc);
    sharedState = argc;
    flag.store(1, std::memory_order_release);
    t.join();
    return 0;
}