Loop Splitting使代码变慢

2023年6月29日 221次阅读

所以我正在优化一个循环(作为家庭作业),增加10,000个元素600,000次.没有优化的时间是23.34s~我的目标是B小于7秒,A小于5秒.

所以我首先通过这样展开循环来开始我的优化.

int     j;

        for (j = 0; j < ARRAY_SIZE; j += 8) {
            sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] +  array[j+6] + array[j+7];

这将运行时间缩短到大约6.4秒(如果我进一步展开,我可以达到6左右).

所以我想我会尝试添加子和并在最后得到最后的总和,以节省读写依赖的时间,我想出了这样的代码.

int     j;

    for (j = 0; j < ARRAY_SIZE; j += 8) {
        sum0 += array[j] + array[j+1]; 
        sum1 += array[j+2] + array[j+3];
        sum2 += array[j+4] + array[j+5]; 
        sum3 += array[j+6] + array[j+7];

但是,这会将运行时间增加到大约6.8秒

我尝试使用指针的类似技术,我能做的最好的是大约15秒.

我只知道我正在运行的机器(因为它是学校购买的服务)是一台32位,远程,基于Intel的Linux虚拟服务器,我认为它运行的是Red Hat.

我已经尝试了所有我能想到的技术来加速代码,但它们似乎都有相反的效果.有人可以详细说明我做错了吗？还是我可以用来降低运行时间的另一种技术？老师能做的最好的事情是大约4.8秒.

作为一个附加条件,我在完成的项目中不能有超过50行代码,因此做一些复杂的事情可能是不可能的.

以下是两个来源的完整副本

    #include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...

//  double sum0 = 0;
//  double sum1 = 0;
//  double sum2 = 0;
//  double sum3 = 0;

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - ACTUAL NAME\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...

        int     j;

        for (j = 0; j < ARRAY_SIZE; j += 8) {
            sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] +  array[j+6] + array[j+7];
        }

        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.

        }

    // You can add some final code between this comment ...
//  sum = sum0 + sum1 + sum2 + sum3;
    // ... and this one.

    return 0;
}

破碎的代码

    #include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...

    double sum0 = 0;
    double sum1 = 0;
    double sum2 = 0;
    double sum3 = 0;

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - ACTUAL NAME\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...

        int     j;

        for (j = 0; j < ARRAY_SIZE; j += 8) {
            sum0 += array[j] + array[j+1]; 
            sum1 += array[j+2] + array[j+3];
            sum2 += array[j+4] + array[j+5]; 
            sum3 += array[j+6] + array[j+7];
        }

        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.

        }

    // You can add some final code between this comment ...
    sum = sum0 + sum1 + sum2 + sum3;
    // ... and this one.

    return 0;
}

回答

我们用来判断等级的“时间”应用程序有点偏差.我能做的最好的事情是4.9~通过展开循环50次并使用TomKarzes的基本格式对其进行分组.

int     j;
        for (j = 0; j < ARRAY_SIZE; j += 50) {
            sum +=(((((((array[j] + array[j+1]) + (array[j+2] + array[j+3])) +
                    ((array[j+4] + array[j+5]) + (array[j+6] + array[j+7]))) + 
                    (((array[j+8] + array[j+9]) + (array[j+10] + array[j+11])) +
                    ((array[j+12] + array[j+13]) + (array[j+14] + array[j+15])))) +
                    ((((array[j+16] + array[j+17]) + (array[j+18] + array[j+19]))))) +
                    (((((array[j+20] + array[j+21]) + (array[j+22] + array[j+23])) +
                    ((array[j+24] + array[j+25]) + (array[j+26] + array[j+27]))) + 
                    (((array[j+28] + array[j+29]) + (array[j+30] + array[j+31])) +
                    ((array[j+32] + array[j+33]) + (array[j+34] + array[j+35])))) +
                    ((((array[j+36] + array[j+37]) + (array[j+38] + array[j+39])))))) + 
                    ((((array[j+40] + array[j+41]) + (array[j+42] + array[j+43])) +
                    ((array[j+44] + array[j+45]) + (array[j+46] + array[j+47]))) + 
                    (array[j+48] + array[j+49])));
        }

最佳答案我试验了一下分组.在我的机器上,使用我的gcc,我发现以下效果最好：

    for (j = 0; j < ARRAY_SIZE; j += 16) {
        sum = sum +
              (array[j   ] + array[j+ 1]) +
              (array[j+ 2] + array[j+ 3]) +
              (array[j+ 4] + array[j+ 5]) +
              (array[j+ 6] + array[j+ 7]) +
              (array[j+ 8] + array[j+ 9]) +
              (array[j+10] + array[j+11]) +
              (array[j+12] + array[j+13]) +
              (array[j+14] + array[j+15]);
    }

换句话说,它展开了16次,它将总和分组成对,然后它线性地添加对.我还删除了=运算符,它影响何时首次使用sum.

我发现即使没有改变任何东西,测量的时间也会在一次运行中发生显着变化,因此我建议对每个版本进行多次计时,然后再对时间是否有所改善或变得更糟.

我有兴趣知道你使用这个版本的内循环在你的机器上获得了什么数字.

更新：这是我目前最快的版本(在我的机器上,使用我的编译器)：

    int     j1, j2;

    j1 = 0;
    do {
        j2 = j1 + 20;
        sum = sum +
              (array[j1   ] + array[j1+ 1]) +
              (array[j1+ 2] + array[j1+ 3]) +
              (array[j1+ 4] + array[j1+ 5]) +
              (array[j1+ 6] + array[j1+ 7]) +
              (array[j1+ 8] + array[j1+ 9]) +
              (array[j1+10] + array[j1+11]) +
              (array[j1+12] + array[j1+13]) +
              (array[j1+14] + array[j1+15]) +
              (array[j1+16] + array[j1+17]) +
              (array[j1+18] + array[j1+19]);
        j1 = j2 + 20;
        sum = sum +
              (array[j2   ] + array[j2+ 1]) +
              (array[j2+ 2] + array[j2+ 3]) +
              (array[j2+ 4] + array[j2+ 5]) +
              (array[j2+ 6] + array[j2+ 7]) +
              (array[j2+ 8] + array[j2+ 9]) +
              (array[j2+10] + array[j2+11]) +
              (array[j2+12] + array[j2+13]) +
              (array[j2+14] + array[j2+15]) +
              (array[j2+16] + array[j2+17]) +
              (array[j2+18] + array[j2+19]);
    }
    while (j1 < ARRAY_SIZE);

这使用了40的总展开量,分成两组,每组20个,具有交替的归纳变量,这些变量预先递增以破坏依赖性,以及后测试循环.同样,您可以尝试使用括号分组来为您的编译器和平台进行微调.