PySpark $99 全新课程3月6日开课


Spark已经成为所有Data Engineer/Data Scientist工作中必不可少的一个大杀器。


开课时间: 3/6/2018 (Every Tuesday night and some Sunday afternoons)


特惠价格 99$, 如果你符合以下任何一种情况, 

1)报过Data Engineer的课程(任何一期)

2)报过BD305 (Big Data Spark)的课程任何一期

特惠价格 199$, 如果你符合以下任何一种情况,

1)报过Python for Data Analysis任何一期


$330 Before 2/24 (报名点击)

Then regular price $350


(Total hours: Est. Class (24 H)+Homework (30~40 H))

Getting Started with Spark
Getting set up – installing Python, a JDK, and Spark and its dependencies
Installing the MovieLens movie rating dataset
Run your first Spark program – the ratings histogram example

Spark Basics and Spark Examples
What is Spark?
The Resilient Distributed Dataset
What is the RDD?
Ratings histogram walk-through
Understanding the code
Looking at the ratings-counter script in Canopy
Key/value RDDs and the average friends by age example
Key/value concepts – RDDs can hold key/value pairs
The friends by age example
Running the average friends by age example
Examining the script
Running the code
Filtering RDDs and the minimum temperature by location example
What is filter
Running the minimum temperature example and modifying it for maximums
Examining the min-temperatures script
Running the maximum temperature by location example
Counting word occurrences using flatmap
Map versus flatmap
Code sample – count the words in a book
Improving the word-count script with regular expressions
Text normalization
Examining the use of regular expressions in the word-count script
Running the code
Sorting the word count results
Step 1 – Implement countByValue
Step 2 – Sort the new RDD
Examining the script
Running the code
Find the total amount spent by customer
Introducing the problem
Strategy for solving the problem
Useful snippets of code
Check your results and sort them by the total amount spent
Check your sorted implementation and results against mine

Advanced Examples of Spark Programs
Finding the most popular movie
Examining the popular-movies script
Getting results
Using broadcast variables to display movie names instead of ID numbers
Introducing broadcast variables
Getting results
Finding the most popular superhero in a social graph
Superhero social networks
Running the script – discover who the most popular superhero is
Mapping input data to
Adding up co-occurrence by hero ID
Flipping the
Using max
Getting results
Superhero degrees of separation – introducing the breadth-first search algorithm
Degrees of separation
Accumulators and implementing BFS in Spark
Convert the input file into structured data
Iteratively process the RDD–Using a mapper and a reducer
Superhero degrees of separation – review the code and run it
Setting up an accumulator and using the convert to BFS function
Getting results
Item-based collaborative filtering in Spark, cache
How does item-based collaborative filtering work?
Making item-based collaborative filtering a Spark problem
It’s getting real–Caching RDDs
Running the similar-movies script using Spark’s cluster manager
Examining the script
Getting results
Improving the quality of the similar movies example

Running Spark on a Cluster
Introducing Elastic MapReduce
Why use Elastic MapReduce?
Warning – Spark on EMR is not cheap
Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
Using .partitionBy
Creating similar movies from one million ratings – part 1
Changes to the script
Creating similar movies from one million ratings – part 2
Our strategy
Setting up to run the script on a cluster
Creating similar movies from one million ratings ¨C part 3
Assessing the results
Terminating the cluster
Troubleshooting Spark on a cluster
More troubleshooting and managing dependencies

SparkSQL, DataFrames, and DataSets
Introducing SparkSQL
Using SparkSQL in Python
Differences between DataFrames and DataSets
Shell access in SparkSQL
User-defined functions
Executing SQL commands and SQL-style functions on a DataFrame
Using SQL-style functions instead of queries
Using DataFrames instead of RDDs

Other Spark Technologies and Libraries
Introducing MLlib
MLlib capabilities
Making movie recommendations
Using MLlib to produce movie recommendations
Examining the script
Analyzing the ALS recommendations results
Why did we get bad results?
Using DataFrames with MLlib
Examining the script
Getting results
Spark Streaming and GraphX
What is Spark Streaming?
1, Social networking, The average friends by age
2, Minimum temperature observed at weather stations since 1800
3, counting word occurrences using flatmap
4, Using Regular Expression to improve your word counting
5, Find the most popular movies with advanced Spark programs