python – Cython：为什么NumPy数组需要类型转换为对象？

2023年1月30日 240次阅读

我在
Pandas source中看过几次这样的事情：

def nancorr(ndarray[float64_t, ndim=2] mat, bint cov=0, minp=None):
    # ...
    N, K = (<object> mat).shape

这意味着一个名为mat的NumPy ndarray是一个Python对象的type-casted.*

在进一步检查时,似乎使用了它,因为如果不是则会出现编译错误.我的问题是：为什么首先需要这种类型转换？

这里有一些例子. This回答简单地说,元组包装在Cython中不像在Python中那样有效 – 但它似乎不是一个元组解包问题. (无论如何,这是一个很好的答案,我不是故意要选择它.)

使用以下脚本shape.pyx.它将在编译时失败,并且“无法将’npy_intp *’转换为Python对象.”

from cython cimport Py_ssize_t
import numpy as np
from numpy cimport ndarray, float64_t
cimport numpy as cnp
cnp.import_array()

def test_castobj(ndarray[float64_t, ndim=2] arr):

    cdef:
        Py_ssize_t b1, b2

    # Tuple unpacking - this will fail at compile
    b1, b2 = arr.shape
    return b1, b2

但同样,问题似乎并不是元组拆包本身.这将失败并出现相同的错误.

def test_castobj(ndarray[float64_t, ndim=2] arr):

    cdef:
        # Py_ssize_t b1, b2
        ndarray[float64_t, ndim=2] zeros

    zeros = np.zeros(arr.shape, dtype=np.float64)
    return zeros

看起来,这里没有发生任何元组拆包.元组是np.zeros的第一个arg.

def test_castobj(ndarray[float64_t, ndim=2] arr):
    """This works"""
    cdef:
        Py_ssize_t b1, b2
        ndarray[float64_t, ndim=2] zeros

    b1, b2 = (<object> arr).shape
    zeros = np.zeros((<object> arr).shape, dtype=np.float64)
    return b1, b2, zeros

这也有效(也许是最令人困惑的)：

def test_castobj(object[float64_t, ndim=2] arr):
    cdef:
        tuple shape = arr.shape
        ndarray[float64_t, ndim=2] zeros
    zeros = np.zeros(shape, dtype=np.float64)
    return zeros

例：

>>> from shape import test_castobj
>>> arr = np.arange(6, dtype=np.float64).reshape(2, 3)

>>> test_castobj(arr)
(2, 3, array([[0., 0., 0.],
        [0., 0., 0.]]))

*也许它与作为记忆视图的arr有关？但那是在黑暗中拍摄的.

另一个例子是在Cython docs中：

cpdef int sum3d(int[:, :, :] arr) nogil:
    cdef size_t i, j, k
    cdef int total = 0
    I = arr.shape[0]
    J = arr.shape[1]
    K = arr.shape[2]

在这种情况下,简单地索引arr.shape [i]可以防止错误,我觉得这很奇怪.

这也有效：

def test_castobj(object[float64_t, ndim=2] arr):
    cdef ndarray[float64_t, ndim=2] zeros
    zeros = np.zeros(arr.shape, dtype=np.float64)
    return zeros

最佳答案你是对的,它与Cython下的元组解包没有任何关系.

原因是,cnp.ndarray不是一个通常的numpy数组(这意味着一个带有python接口的numpy数组),而是一个Cython wrapper的numpy的C实现为PyArrayObject(在np.array中称为np.array)Python)：

ctypedef class numpy.ndarray [object PyArrayObject]:
    cdef __cythonbufferdefaults__ = {"mode": "strided"}

    cdef:
        # Only taking a few of the most commonly used and stable fields.
        # One should use PyArray_* macros instead to access the C fields.
        char *data
        int ndim "nd"
        npy_intp *shape "dimensions"
        npy_intp *strides
        dtype descr
        PyObject* base

实际上将形状映射到底层C-stuct的dimensions-field(npy_intp *形状“尺寸”而不是简单的npy_intp *尺寸).这是一个技巧,所以人们可以写

mat.shape[0]

并且它具有外观(并且在某种程度上具有感觉),就像调用numpy的python-property形状一样.但实际上,直接使用底层C-stuct的快捷方式.

Btw调用python-shape是非常昂贵的：必须创建一个元组并用维度中的值填充,然后访问第0个元素.另一方面,Cython的做法要便宜得多 – 只需访问正确的元素.

但是,如果您还想访问数组的python-property,则必须将其转换为普通的python-object(即忘记这是一个ndarray),然后通过通常的Python将shape解析为tuple-property调用-机制.

所以基本上,即使这很方便,你也不想像在pandas-code中那样在紧密循环中访问numpy数组的维度,而是你会为性能做更详细的变体：

...
N=mat.shape[0]
K=mat.shape[1]
...

为什么你可以在函数签名中写对象[cnp.float64_t]或类似的东西让我觉得奇怪 – 这个参数显然被解释为一个简单的对象.也许这只是一个bug.