Spark DataFrame基本操作

DataFrame的概念来自R/Pandas语言,不过R/Pandas只是runs on One MachineDataFrame是分布式的,接口简单易用。

  • Threshold: Spark RDD API VS MapReduce API
  • One Machine:R/Pandas

官网的说明
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#datasets-and-dataframes
拔粹如下:

  • A Dataset is a distributed collection of data:分布式的数据集
  • A DataFrame is a Dataset organized into named columns. (RDD with Schema)
    以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称
  • An abstraction for selecting,filtering,aggregation and plotting structured data
  • It is conceptually equivalent to a table in a relational database
    or a data frame in R/Python

RDDDataFrame对比:

  • RDD运行起来,速度根据执行语言不同而不同:
java/scala  ==> jvm
python ==> python runtime
  • DataFrame运行起来,执行语言不同,但是运行速度一样:
java/scala/python ==> Logic Plan

根据官网的例子来了解下DataFrame的基本操作,

import org.apache.spark.sql.SparkSession

/**
  * DataFrame API基本操作
  */
object DataFrameApp {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .appName("DataFrameApp")
      .master("local[2]")
      .getOrCreate();

    // 将json文件加载成一个dataframe
    val peopleDF = spark.read.json("C:\\Users\\Administrator\\IdeaProjects\\SparkSQLProject\\spark-warehouse\\people.json");
    // Prints the schema to the console in a nice tree format.
    peopleDF.printSchema();

    // 输出数据集的前20条记录
    peopleDF.show();

    //查询某列所有的数据: select name from table
    peopleDF.select("name").show();

    // 查询某几列所有的数据,并对列进行计算: select name, age+10 as age2 from table
    peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 10).as("age2")).show();

    //根据某一列的值进行过滤: select * from table where age>19
    peopleDF.filter(peopleDF.col("age") > 19).show();

    //根据某一列进行分组,然后再进行聚合操作: select age,count(1) from table group by age
    peopleDF.groupBy("age").count().show();

 spark.stop();
  }
}
    原文作者:sparkle123
    原文地址: https://www.jianshu.com/p/42d3cccfb761
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞