RDD(Resilient Distributed Dataset)是SAPRK的核心概念和特色,最主要的特色在於:
1. RDD是基於在記憶體上的儲存和運算
2. RDD具備平行運算處理的能力
RDD背後有很深的理論和演算法基礎,想要有更深的了解可以參考這篇論文:Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing
本文從實作方面來看一下如何透過python在spark上建立RDD物件,以及做基本的操作.
在進入pysaprk的環境(改天再寫如何在ipython notebook上建立pyspark環境)後,先啟動saprk:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark import SparkContext | |
sc = SparkContext("local","test_app") |
將路徑貼到瀏覽器中,即可看到Spark-UI的畫面,目前因為我們尚未建立任何一個RDD物件,所以還沒有任何東西在裡面
要建立RDD物件的方式很簡單,就像建立一個任何的PYTHON物件一樣:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In [3]: | |
raw_ratings = sc.textFile('/Users/bryanyang/Documents/Data/Movie Rating/ratings.dat',10) | |
raw_ratings.setName("raw ratings") | |
type(raw_ratings) | |
Out[3]: | |
pyspark.rdd.RDD |
沒有留言:
張貼留言