【Spark MLlib】如何将海量字符串映射为数字——StringIndexer & IndexToString

【前言】在使用Spark MLlib协同过滤ALS API的时候发现Rating的三个参数:用户id,商品名称,商品打分,前两个都需要是Int值。那么问题来了,当你的用户id,商品名称是String类型的情况下,我们必须寻找一个方法可以将海量String映射为数字类型。好在Spark MLlib可以answer这一切。

StringIndexer encodes astring column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies. So the most frequentlabel gets index 0. If the input column is numeric, wecast it to string and index the string values. When downstream pipelinecomponents such as Estimator or Transformer make use of this string-indexed label, you must setthe input column of the component to this string-indexed column name. In manycases, you can set the input column with setInputCol.

StringIndexer 将一列字符串标签编码成一列下标标签,下标范围在[0, 标签数量),顺序是标签的出现频率。所以最经常出现的标签获得的下标就是0。如果输入列是数字的,我们会将其转换成字符串,然后将字符串改为下标。当下游管道组成部分,比如说Estimator Transformer 使用将字符串转换成下标的标签时,你必须将组成部分的输入列设置为这个将字符串转换成下标后的列名。很多情况下,你可以使用setInputCol设置输入列。

Examples

Assume that we have the following DataFrame withcolumns id and category:

 id |category

—-|———-

 0  | a

 1  | b

 2  | c

 3  | a

 4  | a

 5  | c

category is astring column with three labels: “a”, “b”, and “c”. Applying StringIndexer with category as theinput column and categoryIndex as theoutput column, we should get the following:

 id | category| categoryIndex

—-|———-|—————

 0  | a       | 0.0

 1  | b       | 2.0

 2  | c       | 1.0

 3  | a       | 0.0

 4  | a       | 0.0

 5  | c       | 1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.

Additionaly, there are two strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:

此外,当你针对一个数据集训练了一个StringIndexer然后使用其去transform另一个数据集的时候,针对不可见的标签StringIndexer 有两个应对策略:

·       throw an exception (which is the default)默认是抛出异常

·       skip the row containing the unseen label entirely跳过包含不可见标签的这一行

Examples

Let’s go back to our previous example but this time reuseour previously defined StringIndexer on thefollowing dataset:

我们现在回到之前的例子,不过这一次重复使用之前定义好的StringIndexer 在接下来的数据集上:

 id |category

—-|———-

 0  | a

 1  | b

 2  | c

 3  | d

If you’ve not set how StringIndexer handles unseen labels or set it to “error”, anexception will be thrown. However, if you had called setHandleInvalid(“skip”), the following dataset will begenerated:

如果你没有设置StringIndexer 应该如何处理不可见标签,一个异常会被抛出。

如果你设置了setHandleInvalid(“skip”),以下的数据集会被生成

 id |category | categoryIndex

—-|———-|—————

 0  | a       | 0.0

 1  | b       | 2.0

 2  | c       | 1.0

Notice that the row containing “d” does not appear.

注意包含了“d”的行没有出现

·       Scala代码如下:

Refer to the StringIndexer Scala docs for more detailson the API.

importorg.apache.spark.ml.feature.StringIndexer

//将两组数组成一个DataFrame,给出列名“id”,“category”

val df = sqlContext.createDataFrame(

  Seq((0,“a”),(1,“b”),(2,“c”),(3,“a”),(4,“a”),(5,“c”))

).toDF(“id”,“category”)

//创建一个StringIndexer,设置输入列为“category”,输出列为“categoryIndex”

val indexer =newStringIndexer()

  .setInputCol(“category”)

  .setOutputCol(“categoryIndex”)

//???

val indexed = indexer.fit(df).transform(df)

indexed.show()

Find fullexample code at”examples/src/main/scala/org/apache/spark/examples/ml/StringIndexerExample.scala”in the Spark repo.

 

 

IndexToString

Symmetrically to StringIndexerIndexToString maps acolumn of label indices back to a column containing the original labels asstrings. The common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve theoriginal labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.

IndexToString StringIndexer是对称的,它将一列下标标签映射回一列包含原始字符串的标签。常用的场合是使用StringIndexer生产下标,通过这些下标训练模型,通过IndexToString预测出的下标列重新获得原始标签。不过,你也可以使用你自己的标签。

Examples

Building on the StringIndexer example, let’s assume we have the followingDataFrame with columns id and categoryIndex:

继续StringIndexer 的例子,假设我们有下面的DataFrame

 id |categoryIndex

—-|—————

 0  | 0.0

 1  | 2.0

 2  | 1.0

 3  | 0.0

 4  | 0.0

 5  | 1.0

Applying IndexToString with categoryIndex as the input column, originalCategory as the output column, we are able to retrieve ouroriginal labels (they will be inferred from the columns’ metadata):

使用IndexToString categoryIndex 作为输入列,originalCategory 作为输出列,我们可以获得原始标签(他们可以从列的元数据中获得):

 id |categoryIndex | originalCategory

—-|—————|—————–

 0  | 0.0           | a

 1  | 2.0           | b

 2  | 1.0          | c

 3  | 0.0           | a

 4  | 0.0           | a

 5  | 1.0           | c

·       Scala代码如下:

Refer to the IndexToString Scala docs for more detailson the API.

importorg.apache.spark.ml.feature.{ StringIndexer,IndexToString}

//创建一个数据框

val df = sqlContext.createDataFrame(Seq(

  (0,“a”),

  (1,“b”),

  (2,“c”),

  (3,“a”),

  (4,“a”),

  (5,“c”)

)).toDF(“id”,“category”)

//使用StringIndexer,将“category”作为输入列,训练模型,输出列是“categoryIndex”

val indexer =newStringIndexer()

  .setInputCol(“category”)

  .setOutputCol(“categoryIndex”)

  .fit(df)

val indexed = indexer.transform(df)

//使用IndexToString“categoryIndex”作为输入列,“originalCategory”作为输出列

val converter =newIndexToString()

  .setInputCol(“categoryIndex”)

  .setOutputCol(“originalCategory”)

 

val converted = converter.transform(indexed)

converted.select(“id”,“originalCategory”).show()

Find fullexample code at”examples/src/main/scala/org/apache/spark/examples/ml/IndexToStringExample.scala”in the Spark repo.

    原文作者:栗子ma
    原文地址: https://blog.csdn.net/sinat_40431164/article/details/80578848
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞