昨天Spark 1.2.0正式release囉!(https://spark.apache.org/news/spark-1-2-0-released.html)這次更新了不少東西.主要是增加了效能和穩定性,一些心得API,PYTHON的支援等等,以下將翻譯一些重要的部分:
- Spark Core
1.2版主要升級了兩個子系統來增加處理非常巨量資料shuffle時的效能和穩定度.一是將communication管理員從bulk transfer轉換到netty-based implementation.第二部分是將shuffle機制升級為“sort based”的shuffle方式.另外還增加了 elastic scaling mechanism機制來增加了在長時間ETL job時cluster的使用率.最後Spark 1.2.0支援了Scala 2.11.
- Spark Streaming
串流這邊也有兩個主要更新.一是增加了Python的Api,現在python的API涵蓋了所有DSTREAM的轉換和輸出的操作.在輸入部分開始支援文字檔案以及在透過socket傳輸的文字檔案.Kafka和Flume的支援需要等到下次更新.二是增加write ahead log (WAL)這個機制,將暫存檔案丟到像是HDFS這樣的系統中,來避免訊息丟失.
- MLLib
新增了關於機器學習的API--ML,另外也開始支援learning pipelines,讓多個演算法能夠依序的串接執行.而新的ML包,增加了分類數的演算法-隨機森林和GBDT(gradient-boosted trees),另外也增加了python支援的API.
- Spark SQL
這次增加了新的外部資料來源的API.Spark的Parquet和JSON也被重新改寫,希望能夠和更多社群的專案有更好的整合.此次更新也整合了HIVE,可以支援 fixed-precision decimal type 和 Hive 0.13.
- GraphX
這次更新讓GraphX更為穩定.aggregateMessages方法替換原本的mapReduceTriplet,在處理資料上有更好的效能.另外還增加了 graph checkpointing和 lineage truncation等功能,來協助大型迭代運算.最後也增加了計算PageRank 和 graph loading 的效率.
- Other Notes
- PySpark’s sort operator now supports external spilling for large datasets.
- PySpark now supports broadcast variables larger than 2GB and performs external spilling during sorts.
- Spark adds a job-level progress page in the Spark UI, a stable API for progress reporting, and dynamic updating of output metrics as jobs complete.
- Spark now has support for reading binary files for images and other binary formats.
- Upgrading to Spark 1.2
Spark1.2.0和1.1.0完全相容,code不用重寫(賀). 以下列出一些有改變的預設值:
spark.shuffle.blockTransferService
has been changed fromnio
tonetty
spark.shuffle.manager
has been changed fromhash
tosort
- In PySpark, the default batch size has been changed to 0, which means the batch size is chosen based on the size of object. Pre-1.2 behavior can be restored using
SparkContext([... args... ], batchSize=1024)
. - Spark SQL has changed the following defaults:
spark.sql.parquet.cacheMetadata
:false
->true
spark.sql.parquet.compression.codec
:snappy
->gzip
spark.sql.hive.convertMetastoreParquet
:false
->true
spark.sql.inMemoryColumnarStorage.compressed
:false
->true
spark.sql.inMemoryColumnarStorage.batchSize
:1000
->10000
spark.sql.autoBroadcastJoinThreshold
:10000
->10485760
(10 MB)
沒有留言:
張貼留言