BD305: Spark and Hadoop for Data Engineer | Developer

Total: 12 sessions

Programming Language Preparation: (Python, Scala) – 3 Sessions

Provide Reading Material for both Python and Scala, student needs to choose one language to finish the weekly exercises and submit the code on Github. Instructor will review the code.

Spark Core Course (8 Sessions)

Part 1: Basic Knowledge

(Targeting for Apache Spark Certification)

SPARK Core API&Best Practices

Shared Variables and performance

SPARK SQL

SPARK MLlib

SPARK GraphX

SAPRK Streaming

Deployment and Infrastructure

Part 2: Building Spark Applications (Data Engineer)

Introduction to the Spark APIs

PySpark: Loading and Importing Data

PySpark: Parsing and Transforming Data

PySpark: Analyzing Fight Delays

Spark SQL (Adding Structure to your data)

SPARK SQL integration into Existing workflows

Part 3: Spark Application (Data Science)

How Spark fits into Data Science Process

Data Quality Checks

How work with Text: NLP

Tokenization and Vectorization with Spark

Unsupervised Learning with Spark and implementation K-Means

Hadoop Projects/Labs

  1. Sqoop, from mySQL table to HDFS files
  2. Hive Tables

2.1 External tables from HDFS

2.2 Internal tables from HDFS

3. Impala Shell

Hadoop Ecosystem and popular open source projects–1 session

Hadoop Architecture and HDFS

Modeling and Managing Data with Impala and Hive

Big Data Streaming data with Kafka

Big Data ETL Cascading