PySpark $99 全新课程3月6日开课

课程简介

Spark已经成为所有Data Engineer/Data Scientist工作中必不可少的一个大杀器。
Spark从何而来?
Spark是第一个脱胎于该转变的快速、通用分布式计算范式,并且很快流行起来。Spark使用函数式编程范式扩展了MapReduce模型以支持更多计算类型,可以涵盖广泛的工作流,这些工作流之前被实现为Hadoop之上的特殊系统。Spark使用内存缓存来提升性能,因此进行交互式分析也足够快速(就如同使用Python解释器,与集群进行交互一样)。缓存同时提升了迭代算法的性能,这使得Spark非常适合数据理论任务,特别是机器学习。

课程中,我们将首先讨论如何在本地机器上或者EC2的集群上设置Spark进行简单分析。然后,我们在入门级水平探索Spark,了解Spark是什么以及它如何工作(希望可以激发更多探索)。最后两节我们开始通过命令行与Spark进行交互,然后演示如何用Python写Spark应用,并作为Spark作业提交到集群上。

开课时间: 3/6/2018 (Every Tuesday night and some Sunday afternoons)

课程最优惠价格:(保证市场最低价格)

特惠价格 99$, 如果你符合以下任何一种情况, 

1)报过Data Engineer的课程(任何一期)

2)报过BD305 (Big Data Spark)的课程任何一期

特惠价格 199$, 如果你符合以下任何一种情况,

1)报过Python for Data Analysis任何一期

其他学员

$330 Before 2/24 (报名点击)

Then regular price $350

课程大纲

(Total hours: Est. Class (24 H)+Homework (30~40 H))

Getting Started with Spark
Getting set up – installing Python, a JDK, and Spark and its dependencies
Installing the MovieLens movie rating dataset
Run your first Spark program – the ratings histogram example
Summary

Spark Basics and Spark Examples
What is Spark?
The Resilient Distributed Dataset
What is the RDD?
Ratings histogram walk-through
Understanding the code
Looking at the ratings-counter script in Canopy
Key/value RDDs and the average friends by age example
Key/value concepts – RDDs can hold key/value pairs
The friends by age example
Running the average friends by age example
Examining the script
Running the code
Filtering RDDs and the minimum temperature by location example
What is filter
Running the minimum temperature example and modifying it for maximums
Examining the min-temperatures script
Running the maximum temperature by location example
Counting word occurrences using flatmap
Map versus flatmap
Code sample – count the words in a book
Improving the word-count script with regular expressions
Text normalization
Examining the use of regular expressions in the word-count script
Running the code
Sorting the word count results
Step 1 – Implement countByValue
Step 2 – Sort the new RDD
Examining the script
Running the code
Find the total amount spent by customer
Introducing the problem
Strategy for solving the problem
Useful snippets of code
Check your results and sort them by the total amount spent
Check your sorted implementation and results against mine
Summary

Advanced Examples of Spark Programs
Finding the most popular movie
Examining the popular-movies script
Getting results
Using broadcast variables to display movie names instead of ID numbers
Introducing broadcast variables
Getting results
Finding the most popular superhero in a social graph
Superhero social networks
Strategy
Running the script – discover who the most popular superhero is
Mapping input data to
Adding up co-occurrence by hero ID
Flipping the
Using max
Getting results
Superhero degrees of separation – introducing the breadth-first search algorithm
Degrees of separation
Accumulators and implementing BFS in Spark
Convert the input file into structured data
Iteratively process the RDD–Using a mapper and a reducer
Superhero degrees of separation – review the code and run it
Setting up an accumulator and using the convert to BFS function
Getting results
Item-based collaborative filtering in Spark, cache
How does item-based collaborative filtering work?
Making item-based collaborative filtering a Spark problem
It’s getting real–Caching RDDs
Running the similar-movies script using Spark’s cluster manager
Examining the script
Getting results
Improving the quality of the similar movies example
Summary

Running Spark on a Cluster
Introducing Elastic MapReduce
Why use Elastic MapReduce?
Warning – Spark on EMR is not cheap
Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
Partitioning
Using .partitionBy
Creating similar movies from one million ratings – part 1
Changes to the script
Creating similar movies from one million ratings – part 2
Our strategy
Setting up to run the movie-similarities-1m.py script on a cluster
Creating similar movies from one million ratings ¨C part 3
Assessing the results
Terminating the cluster
Troubleshooting Spark on a cluster
More troubleshooting and managing dependencies
Troubleshooting
Summary

SparkSQL, DataFrames, and DataSets
Introducing SparkSQL
Using SparkSQL in Python
Differences between DataFrames and DataSets
Shell access in SparkSQL
User-defined functions
Executing SQL commands and SQL-style functions on a DataFrame
Using SQL-style functions instead of queries
Using DataFrames instead of RDDs
Summary

Other Spark Technologies and Libraries
Introducing MLlib
MLlib capabilities
Making movie recommendations
Using MLlib to produce movie recommendations
Examining the movie-recommendations-als.py script
Analyzing the ALS recommendations results
Why did we get bad results?
Using DataFrames with MLlib
Examining the spark-linear-regression.py script
Getting results
Spark Streaming and GraphX
What is Spark Streaming?
GraphX
Summary
Lab,
1, Social networking, The average friends by age
2, Minimum temperature observed at weather stations since 1800
3, counting word occurrences using flatmap
4, Using Regular Expression to improve your word counting
5, Find the most popular movies with advanced Spark programs