infyni

Hadoop with Pyspark

Create real-time stream processing applications using Hadoop with Pyspark. This online course is taken live by instructors who take you through every step. Interacting with you and answering your questions, every doubt is clarified making it easy for you to learn tough processes.

Live Course

Live Class: Friday, 08 Mar

Duration: 40 Hours

Enrolled: 0

Offered by: infyni

Live Course

About Course

Hadoop has clearly democratized the way Big Data is used. Ever since its birth in 2006 as a framework that provided a way to store and process Big Data, it has grown into a huge and popular ecosystem with a variety of tools that help in developing applications that can ingest data from a variety of data sources, process them through multiple ways and persist the output in various locations. This hands-on training course delivers the key concepts and expertise developers need to use Apache Hadoop, Apache Spark and its eco-system tools to develop high-performance parallel applications.

After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster and better decisions applied to a wide variety of use cases, architectures, and industries.

You will learn to code Spark Scala & PySpark like a real world developer, understand coding best practices, logging, error handling and configuration management using both Scala and Python.

Skills You Will Gain

Spark Scala Sqoop Pig Apache Flume Hive HCatalog AVRO Scala REPL SBT/Eclipse Apache Kafka Spark Streaming Impala

Course Offerings

Instructor-led interactive classes
Clarify your doubts during class
Access recordings of the class
Attend on mobile or tablet
Live projects to practice
Case studies to learn from
Lifetime mentorship support
Industry specific curriculum
Certificate of completion
Employability opportunity

Topics
Instructor

1 Module 1: Introduction to Big Data and Hadoop Ecosystem

What is Big Data and Why is the world excited?
3V / 5 V / Wikepedia definition
What is Hadoop and Why is it so popular?
Introduction to the Hadoop Eco System

2 Module 2: Setting up the Virtual Machine

Installing Virtual Box
Importing the VM Image
Discussion on the components in the VM

3 Module 3: HDFS – What,Why and How

Why another file system?
Design goals/assumptions of HDFS
Discussion on NameNode and Data Node
Secondary NameNode – Back up or not?
Failure scenarios of NameNode or DataNode fails

4 Moldule 4: MapReduce – Concepts and Programming

Discussion on different stages involved in a MR Program
Mapper, Combiner, Partitioner, and Reducer.

5 Module 5: YARN – Concepts

Problems with Hadoop Version 1.0
Why YARN?
HA, Scaling, Multitenancy and Performance with YARN
Scheduling with YARN

6 Module 6: Hive – The “SQL approach” to MapReduce

Compare and contrast Hive with RDBMS
Interact with Hive via both the terminal and Hue
Learn Hive QL and Execute queries
Deep Dive Into Metastore DB
Creating Hive Tables
Managed Vs External Tables
Joins in Hive
Multiple Inserts, CTAS Statement
Partitioning, Bucketing
Complex Data Types: Struct, Map, Array
Working with different file formats
Avro, ORC, Parquet, CSV, XML, Json
Write UDFs and use them in Hive

7 Module 7: Sqoop – The Connector

Sqoop Architecture
Importing Tables using Sqoop
Into HDFS, Local File System, Hive
Into various formats
Using a “Direct Query”, Selecting specific tables, etc
With / Without passwords via the commands
Incremental Append
Exporting Tables using Sqoop
into MySQL
Working with Sqoop Jobs

8 Module 8: Spark in Hadoop Ecosystem

Need for Spark
Introduction to Spark Libraries (Spark SQL, Streaming and ML)
Spark’s place in the Hadoop – Spark Eco system

9 Module 9: Spark - Fundamentals

Spark Architecture – An Introduction
Spark Cluster Managers – Standalone vs YARN
Connecting to Spark using Spark Shell – Local vs YARN
Local: spark-submit with local[*] / local / local[2]
Yarn – client vs cluster mode, executor / driver memory
Dissecting a Spark Job by understanding terminology: Application, Driver, Executor, Stages and Tasks
Spark UI – An overview

10 Module 10: Working with RDDs

Actions and Transformations – What are they?
Maps, filters, reduce functions
Repartitions, Coalesce and Cache/Persist - Concepts
Working with Spark interactively using the Spark Shell
Working with Spark and Scala using Eclipse IDE

11 Module 11: Spark SQL

Working with Dataframes
DataFrames, SQL and Datasets
Manual vs Inferred Schemas
Executing queries using Dataframe API and Spark SQL
Working with different file formats: ORC, Parquet, Avro, CSV
Working with data from different sources: MySQL, Hive, Local FS, HDFS

12 Module 12: Spark Streaming

Introduction to Spark Streaming
Need for Streaming (use cases)
Streaming Architecture
Overview of Dstreams (1.x)
Structured Streaming (2.x) – Advantages
Using Spark Built-In sources: The Socket Source
The file source
Output modes (Complete, Update, Append)
Controlling processing times
Streaming and Dataframes
Creating streaming Dataframes
Transforming Dataframes
Executing Streaming Queries