python – PySpark 2.2.0：’numpy.ndarray’对象没有属性’indices’

2023年3月23日 207次阅读

任务

我正在使用Python API for Spark(PySpark)计算__SparseVector__中索引的大小.

脚本

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

问题

当我在__count_variables__数据帧上执行__.count()__时,会出现错误：

AttributeError: ‘numpy.ndarray’ object has no attribute ‘indices’

要考虑的主要部分是：

data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF([“id”, “frequency”])

我相信这个块与错误有关,但是我无法理解为什么异常会告诉__numpy.ndarray__如果我通过映射__lambda表达式_来进行计算,它将__SparseVector__作为参数(用__assembler__创建).

有什么建议？有谁可能知道我做错了什么？

最佳答案这里有两个问题.第一个是indices.size调用,索引和大小是
SparseVector class的两个不同属性,size是完整的向量大小,index是矢量索引,其值非零,但size不是indices属性.因此,假设您的所有向量都是SparseVector类的实例：

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
                            (1, Vectors.sparse(4, [], [])),
                            (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
                           ["documento", "variables"])

df.show()

+---------+--------------------+
|documento|           variables|
+---------+--------------------+
|        0|(4,[0,1],[11.0,2.0])|
|        1|           (4,[],[])|
|        3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

解决方案是len功能：

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))\
               .toDF(["documento", "variables", "frecuencia"])
df.show()  
+---------+--------------------+----------+
|documento|           variables|frecuencia|
+---------+--------------------+----------+
|        0|(4,[0,1],[11.0,2.0])|         2|
|        1|           (4,[],[])|         0|
|        3|(4,[0,1,2],[2.0,2...|         3|
+---------+--------------------+----------+

第二个问题是：VectorAssembler并不总是生成SparseVectors,这取决于什么更有效,可以生成SparseVector或DenseVectors(基于原始矢量的零数).例如,假设下一个数据框：

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
                             (1, Vectors.dense([1., 1., 1., 1.])),
                              (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
                           ["documento", "variables"])

df.show()      
+---------+--------------------+
|documento|           variables|
+---------+--------------------+
|        0|(4,[0,1],[11.0,2.0])|
|        1|   [1.0,1.0,1.0,1.0]|
|        3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

文档1是DenseVector,并且previos解决方案不起作用,因为DenseVectors没有indices属性,因此您必须使用更一般的向量表示来处理包含稀疏和密集向量的DataFrame,例如numpy：

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
                           x[1], 
                           np.nonzero(x[1])[0].size))\
                .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento|           variables|frecuencia|
+---------+--------------------+----------+
|        0|(4,[0,1],[11.0,2.0])|         2|
|        1|   [1.0,1.0,1.0,1.0]|         4|
|        3|(4,[0,1,2],[2.0,2...|         3|
+---------+--------------------+----------+