剛開始接觸RDD時,會覺得RDD是一個裝資料的貨櫃(Container).再多瞭解一點後,才發現RDD更像裝洋片的太空包(
這是Spark關於RDD這個物件的描述:A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD) RDD是Spark的核心框架,有非常多的特性,“不可變性(immutable)”是第二個重點.所謂的不可變性是說,當資料轉換成RDD物件後,那個RDD物件基本就處於被封裝的狀態,如果要進行filter或map的動作,會在使用另外一個RDD來封裝改變後的資料.這點其實也反映在[Scala] 型態抹除這點上面,當資料丟到RDD後,我們僅能從包裝上(RDD[Int],RDD[Double])來辨識裡面裝的東西,而無法直接看到裡面的資料類型.除了型態之外,Spark也設計了不同的RDD來表示各種對RDD的操作結果:
(詳細的RDD家族列表:
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/rdd)
在每個階段都會轉換成相對應的RDD來裝操作後的資料,而不是將操作的結果放回原來的RDD中(貨櫃vs太空包).
下面是簡單Run一下程式中會出現的各種RDD:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val num = 1 to 100 | |
//num: scala.collection.immutable.Range.Inclusive = Range(1,2,3,...,100) | |
val numRDD = sc.parallelize(num) | |
//numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:14 | |
val numFileter = numRDD.filter(_ < 10) | |
//numFileter: org.apache.spark.rdd.RDD[Int] = FilteredRDD[12] at filter at <console>:16 | |
val numMap = numFileter.map(_ + 10) | |
//numMap: org.apache.spark.rdd.RDD[Int] = MappedRDD[13] at map at <console>:18 | |
val numSum = numMap.reduce(_ + _) | |
//numSum: Int = 135 |
例子中每行程式下面的註解就是執行後吐出來的物件型態.可以看到隨著filter,map的指令出現不同的RDD型態,
沒有留言:
張貼留言