【Spark Java API】Transformation(12)—zipPartitions、zip

2023年7月14日 179次阅读来源: 小飞_侠_kobe

zipPartitions

官方文档描述：

Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by applying a function 
to the zipped partitions. Assumes that all the RDDs have the *same number of partitions*, 
but does *not* require them to have the same number of elements in each partition.

函数原型：

def zipPartitions[U, V](    
    other: JavaRDDLike[U, _], 
    f: FlatMapFunction2[java.util.Iterator[T], java.util.Iterator[U], V]): JavaRDD[V]

该函数将两个分区RDD按照partition进行合并，形成一个新的RDD。

源码分析：

def zipPartitions[B: ClassTag, V: ClassTag]    
      (rdd2: RDD[B], preservesPartitioning: Boolean)    
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {  
    new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)
}

private[spark] class ZippedPartitionsRDD2[A: ClassTag, B: ClassTag, V: ClassTag](    
    sc: SparkContext,    
    var f: (Iterator[A], Iterator[B]) => Iterator[V],    
    var rdd1: RDD[A],    
    var rdd2: RDD[B],    
    preservesPartitioning: Boolean = false)  
  extends ZippedPartitionsBaseRDD[V](sc, List(rdd1, rdd2), preservesPartitioning) {  

  override def compute(s: Partition, context: TaskContext): Iterator[V] = {    
      val partitions = s.asInstanceOf[ZippedPartitionsPartition].partitions    
      f(rdd1.iterator(partitions(0), context), rdd2.iterator(partitions(1), context))  
  }  

  override def clearDependencies() {    
      super.clearDependencies()    
      rdd1 = null    
      rdd2 = null    
      f = null  
  }
}

从源码中可以看出，zipPartitions函数生成ZippedPartitionsRDD2，该RDD继承ZippedPartitionsBaseRDD，在ZippedPartitionsBaseRDD中的getPartitions方法中判断需要组合的RDD是否具有相同的分区数，但是该RDD实现中并没有要求每个partitioner内的元素数量相同。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3, 2, 12, 5, 6, 1);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1,3);
JavaRDD<String> zipPartitionsRDD = javaRDD.zipPartitions(javaRDD1, new FlatMapFunction2<Iterator<Integer>, Iterator<Integer>, String>() {    
    @Override    
    public Iterable<String> call(Iterator<Integer> integerIterator, Iterator<Integer> integerIterator2) throws Exception {        
        LinkedList<String> linkedList = new LinkedList<String>();        
        while(integerIterator.hasNext() && integerIterator2.hasNext())            
            linkedList.add(integerIterator.next().toString() + "_" + integerIterator2.next().toString());        
        return linkedList;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipPartitionsRDD.collect());

zip

官方文档描述：

Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* 
and the *same number of elements in each partition* (e.g. one was made through a map on the other).

函数原型：

def zip[U](other: JavaRDDLike[U, _]): JavaPairRDD[T, U]

该函数用于将两个RDD进行组合，组合成一个key/value形式的RDD。

源码分析：

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {  
  zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>    
    new Iterator[(T, U)] {      
      def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {        
        case (true, true) => true        
        case (false, false) => false        
        case _ => throw new SparkException("Can only zip RDDs with " +          "same number of elements in each partition")      
      }      
      def next(): (T, U) = (thisIter.next(), otherIter.next())    
    }  
  }
}

从源码中可以看出，zip函数是基于zipPartitions实现的，其中preservesPartitioning为false，preservesPartitioning表示是否保留父RDD的partitioner分区；另外，两个RDD的partition数量及元数的数量都是相同的，否则会抛出异常。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Integer> zipRDD = javaRDD.zip(javaRDD1);
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipRDD.collect());

    原文作者：小飞_侠_kobe
    原文地址: https://www.jianshu.com/p/d19263471050
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。