任务
我正在使用Python API for Spark(PySpark)计算__SparseVector__中索引的大小.
脚本
def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])
问题
当我在__count_variables__数据帧上执行__.count()__时,会出现错误:
AttributeError: ‘numpy.ndarray’ object has no attribute ‘indices’
要考虑的主要部分是:
data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF([“id”, “frequency”])
我相信这个块与错误有关,但是我无法理解为什么异常会告诉__numpy.ndarray__如果我通过映射__lambda表达式_来进行计算,它将__SparseVector__作为参数(用__assembler__创建).
有什么建议?有谁可能知道我做错了什么?
最佳答案 这里有两个问题.第一个是indices.size调用,索引和大小是
SparseVector class的两个不同属性,size是完整的向量大小,index是矢量索引,其值非零,但size不是indices属性.因此,假设您的所有向量都是SparseVector类的实例:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.sparse(4, [], [])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+
解决方案是len功能:
df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))\
.toDF(["documento", "variables", "frecuencia"])
df.show()
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+
第二个问题是:VectorAssembler并不总是生成SparseVectors,这取决于什么更有效,可以生成SparseVector或DenseVectors(基于原始矢量的零数).例如,假设下一个数据框:
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.dense([1., 1., 1., 1.])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+
文档1是DenseVector,并且previos解决方案不起作用,因为DenseVectors没有indices属性,因此您必须使用更一般的向量表示来处理包含稀疏和密集向量的DataFrame,例如numpy:
import numpy as np
df = df.rdd.map(lambda x: (x[0],
x[1],
np.nonzero(x[1])[0].size))\
.toDF(["documento", "variables", "frecuencia"])
df.show()
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+