Python numpy：在numpy 2-D数组中对每对列执行函数？

2023年4月15日 285次阅读

我正在尝试将函数应用于numpy数组中的每对列(每列是个体的基因型).

例如：

[48]: g[0:10,0:10]

array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=int8)

我的目标是产生距离矩阵d,使得d的每个元素是以g为单位比较每列的成对距离.

d[0,1] = func(g[:,0], g[:,1])

任何想法都会很棒！谢谢！

最佳答案您可以简单地将函数定义为：

def count_snp_diffs(x, y): 
    return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)

然后使用itertools.combinations生成的数组作为索引来调用它,以获得所有可能的列组合：

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

此外,如果输出必须存储在一个矩阵中(对于大g不会因为只有上三角形将被填充而其余部分将是无用信息,这可以通过相同的技巧实现：

d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

现在,d [i,j]返回列i和j之间的距离(而d [j,i]是零).这种方法依赖于以下事实：数组可以使用包含重复索引的列表或数组进行索引：

a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]

这是对正在发生的事情的一步一步解释.

调用g [:, combination [：,0]]访问第一列排列中的所有列,生成一个新数组,逐列与使用g [：,combination [：,1]]生成的数组进行比较.因此,生成布尔数组diff.如果g有3列,它看起来像这样,其中每列是列0,1,0,2和1,2的比较：

[[ True False False]
 [False  True False]
 [ True  True False]
 [False False False]
 [False  True False]
 [False False False]]

最后,添加每列的值：

np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]

另外,由于python中的布尔类继承自整数类(大致为False == 0和True == 1),请参阅这篇answer的“是False == 0和True == 1在Python中的实现细节还是语言保证？“更多信息) np.count_nonzero为每个True位置添加1,这与使用np.sum获得的结果相同：

np.sum(diff,axis=0) 
# Out
# [2 3 0]

关于性能和内存的说明

对于大型阵列,一次使用整个阵列可能需要太多内存,并且您可能会遇到内存错误,但是,对于小型或中型阵列,它往往是最快的方法.在某些情况下,按块工作可能很有用：

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
    dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
    # B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])

对于g.shape =(300,N),我的计算机中使用python 2.7,numpy 1.14.2和allel 1.1.10的%% timeit报告的执行时间是：

> 10列

> numpy矩阵存储：107μs
> numpy 1D存储：101μs
>等位基因：247μs

> 100列

> numpy矩阵存储：15.7 ms
> numpy 1D存储：16毫秒
>等位基因：22.6毫秒

> 1000列

> numpy矩阵存储：1.54秒
> numpy 1D存储：1.53 s
>等位基因：2.28秒

使用这些数组维度,纯numpy比allel模块更快,但应检查计算时间以查找问题中的维度.