我有一个元素列表,可能会启动一些RDD记录的字符串.如果我有和元素列表是和否,它们应该匹配yes23和no3但不匹配35yes或41no.使用pyspark,我如何使用列表或元组中的任何元素的开头.
DF的一个例子是:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 2| 45yes|
| 3| no123|
| 4| 75no|
+-----+------+
当我尝试:
Element_List = ['yes','no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))
生成的df应该类似于:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 3| no123|
+-----+------+
相反,我得到错误:
Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
提示,所以看起来startsWith不能用于任何类型的列表.有简单的工作吗?
最佳答案 像这样撰写表达式:
from pyspark.sql.functions import col, lit
from functools import reduce
element_list = ['yes','no']
df = spark.createDataFrame(
["yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
starts_with = reduce(
lambda x, y: x | y,
[col("location").startswith(s) for s in element_list],
lit(False))
df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# | yes23|
# | no3|
# +--------+