使用numpy数组优化python函数

2023年7月7日 241次阅读

我一直在尝试优化我过去两天写的
python脚本.使用几个分析工具(cProfile,line_profiler等)我将问题缩小到下面的函数.

df是一个numpy数组,有3列和1,000,000行(数据类型为float).使用line_profiler,我发现只要需要访问numpy数组,函数就会花费大部分时间.

full_length = head df [rnd_truck,2]

和

full_weight = df [rnd_truck,1]

占用大部分时间,然后是

full_length = df [rnd_truck,2]

full_weight = df [rnd_truck,1]

线.

据我所知,瓶颈是由访问时间引起的,该函数试图从numpy数组中获取一个数字.

当我以MonteCarlo(df,15.,1000.)运行该功能时,在具有8GB RAM的i7 3.40GhZ 64位Windows机器上调用该功能需要37秒.在我的应用程序中,我需要运行它1,000,000,000以确保收敛,这将执行时间超过一个小时.我尝试使用operator.add方法进行求和,但它根本没有帮助我.看起来我必须想出一种更快的方式来访问这个numpy数组.

任何想法都会受到欢迎！

def MonteCarlo(df,head,span):
    # Pick initial truck
    rnd_truck = np.random.randint(0,len(df))
    full_length = df[rnd_truck,2]
    full_weight = df[rnd_truck,1]

    # Loop using other random truck until the bridge is full
    while 1:
        rnd_truck = np.random.randint(0,len(df))
        full_length += head + df[rnd_truck, 2]
        if full_length > span:
            break
        else:
            full_weight += df[rnd_truck,1]

    # Return average weight per feet on the bridge
    return(full_weight/span)

下面是我正在使用的df numpy数组的一部分：

In [31] df
Out[31]: 
array([[  12. ,  220.4,  108.4],
       [  11. ,  220.4,  106.2],
       [  11. ,  220.3,  113.6],
       ..., 
       [   4. ,   13.9,   36.8],
       [   3. ,   13.7,   33.9],
       [   3. ,   13.7,   10.7]])

最佳答案正如其他人所指出的那样,这根本不是矢量化的,所以你的缓慢实际上是由于Python解释器的缓慢.
Cython可以在这里以最小的变化为您提供帮助：

>>> %timeit MonteCarlo(df, 5, 1000)
10000 loops, best of 3: 48 us per loop

>>> %timeit MonteCarlo_cy(df, 5, 1000)
100000 loops, best of 3: 3.67 us per loop

MonteCarlo_cy就在哪里(在IPython笔记本中,在％load_ext cythonmagic之后)：

%%cython
import numpy as np
cimport numpy as np

def MonteCarlo_cy(double[:, ::1] df, double head, double span):
    # Pick initial truck
    cdef long n = df.shape[0]
    cdef long rnd_truck = np.random.randint(0, n)
    cdef double full_weight = df[rnd_truck, 1]
    cdef double full_length = df[rnd_truck, 2]

    # Loop using other random truck until the bridge is full
    while True:
        rnd_truck = np.random.randint(0, n)
        full_length += head + df[rnd_truck, 2]
        if full_length > span:
            break
        else:
            full_weight += df[rnd_truck, 1]

    # Return average weight per feet on the bridge
    return full_weight / span