2015年11月23日 星期一

[Spark] Why Use Apache Spark?


For today, I would like to switch the focus from programing or technical details, and share some of my experience and opinion with you from a Spark user’s perspective.

As a data analyst and data engineer specializing in BIG DATA, Hadoop cluster is always handy in saving data. When it comes to ETL processes or analysis, besides Spark, we have the following choices:


1. Pig. As far as I know, it is popular in Yahoo.

2. Hive. SQL forever.

3. Map Reduce. Hadoop is based on MR, but it is ugly though.

4. Python, R or other analytic applications. They are hard to distribute. You may need one powerful machine or sampling from raw data to conduct analysis. Therefore, it is not a bad idea indeed.

So, why do I use Apache Spark and why Spark is getting this great deal of attention recently?

In my opinion, the strongest advantage of Spark is that it simplifies Map Reduce in the two ways listed below.

Programming-wise: implementation of Map Reduce in Spark is extremely easy thanks to lambda expression. E.g. rdd.map(x: x.split(' ')).map(x: (x,1)).reduceByKey(x: _ + _).collect()
You can totally just count the words in an article in one line.

Architecture-wise: RDD is the foundational data structure of Spark, which can load data into memory from hdfs, and does not save data into disk but memory at the end of every operation to RDD.

The two advantages mentioned above accelerates Spark in Hadoop and allow various applications or libraries to be built on Spark, such as Spark SQL and MLlib.

Similar to Hadoop’s ecosystem, Spark also developed its own ecosystem covering SQL, machine learning, streaming and more to expect in the future. Based on the understanding, as a Spark user, I could enjoy multiple application from just learning one trick. There is little need to learn different kinds of components or languages to achieve what I want. Instead, here comes the unified world of Spark. Farewell the tower of Hadoop.