I dropped by the library while waiting for friends the other day and picked up a book. This time, Advanced Analytics with Spark – Patterns for Learning from Data at Scale by O’Reilly Media.
I must admit, in my 2 years in data analytics, I’ve never had to deal with cluster/distributed computing. As a relatively small company with a relatively new data team, we seldom exceed a TB of data in our analysis, and tasks seldom exceed a few layers of aggregation, ordering, and regression. Furthermore, with BigQuery, we were able to leverage on incredible computing scale without learning about dealing with distributed/ cluster systems.
Apache Spark?
From a longer term standpoint, it feels necessary to gain a basic understanding of cluster computing in data analytics. Apache Spark is an open-source cluster-computing framework. While not the only one in its field, it is probably one of the most popular today. Regardless of whether Apache Spark is the right solution, I hope to get a better idea of what is and what can be achieved with cluster computing. Looking forward to see if this is what it takes to take our data analytics capabilities to the next level.
Advanced Analytics with Spark – Patterns for Learning from Data at Scale
Looks like this book is not too code heavy and more on concepts and case studies. Come back soon for more details and a review when I’m done. Assuming I finish it – I often have problems finishing books.
Update – 9 July 2017
As mentioned earlier, this book is filled with case studies and full code examples on deploying analytics projects using Spark. I picked out a few chapters which were of interest to me – Introduction to Data Analysis with Scala and Spark, Anomaly Detection in Network Traffic with K-means Clustering, and Understanding Wikipedia with Latent Semantic Analysis. This is not very suitable for those without some background in Spark or Data Analysis (as the title suggests). It will also not provide information on why you should use Spark for your analytics tasks, and does not touch on syntax.
However, if you’ve already decided to go with Spark, know a little about analytics, and want to get an idea on how these ideas can be implemented in Spark, this book provides an excellent introduction for that. The book will give you a good idea on some of the possible areas where Spark can be deployed to make data analysis more efficient.
Finally, if you are already a Spark user and have already gone through a few projects with it, you may find it a little too basic (despite its title). I enjoyed getting through the chapters (without going into details), although I still haven’t quite found the answers I’m looking for – why and when we should move analytics tasks to Spark. Till then, I’ll just have to keep learning!
Any thoughts on the book or on cluster computing in data analytics? Let me know in the comments below!