infyni

Hadoop with Pyspark

Create real-time stream processing applications using Hadoop with Pyspark. This online course is taken live by instructors who take you through every step. Interacting with you and answering your questions, every doubt is clarified making it easy for you to learn tough processes.

Live Course

Live Class: Friday, 08 Mar

Duration: 40 Hours

Enrolled: 0

Offered by: infyni

Live Course
$375 20% off

$300

About Course

Hadoop has clearly democratized the way Big Data is used. Ever since its birth in 2006 as a framework that provided a way to store and process Big Data, it has grown into a huge and popular ecosystem with a variety of tools that help in developing applications that can ingest data from a variety of data sources, process them through multiple ways and persist the output in various locations. This hands-on training course delivers the key concepts and expertise developers need to use  Apache Hadoop, Apache Spark and its eco-system tools to develop high-performance parallel applications.

After taking this course, participants will be prepared to face real-world challenges and build  applications to execute faster and better decisions applied to a wide variety of use cases, architectures, and industries.

You will learn to code Spark Scala & PySpark like  a real world developer, understand coding best practices, logging, error handling and configuration management using both Scala and Python.

Skills You Will Gain

Spark Scala Sqoop Pig Apache Flume Hive HCatalog AVRO Scala REPL SBT/Eclipse Apache Kafka Spark Streaming Impala

Course Offerings

  • Instructor-led interactive classes
  • Clarify your doubts during class
  • Access recordings of the class
  • Attend on mobile or tablet
  • Live projects to practice
  • Case studies to learn from
  • Lifetime mentorship support
  • Industry specific curriculum
  • Certificate of completion
  • Employability opportunity
  • Topics
  • Instructor
  • What is Big Data and Why is the world excited?
  • 3V / 5 V / Wikepedia definition
  • What is Hadoop and Why is it so popular?
  • Introduction to the Hadoop Eco System
  • Installing Virtual Box
  • Importing the VM Image
  • Discussion on the components in the VM
  • Why another file system?
  • Design goals/assumptions of HDFS
  • Discussion on NameNode and Data Node
  • Secondary NameNode – Back up or not?
  • Failure scenarios of NameNode or DataNode fails
  • Discussion on different stages involved in a MR Program
  • Mapper, Combiner, Partitioner, and Reducer.
  • Problems with Hadoop Version 1.0
  • Why YARN?
  • HA, Scaling, Multitenancy and Performance with YARN
  • Scheduling with YARN
  • Compare and contrast Hive with RDBMS
  • Interact with Hive via both the terminal and Hue
  • Learn Hive QL and Execute queries
  • Deep Dive Into Metastore DB
  • Creating Hive Tables
  • Managed Vs External Tables
  • Joins in Hive
  • Multiple Inserts, CTAS Statement
  • Partitioning, Bucketing
  • Complex Data Types: Struct, Map, Array
  • Working with different file formats
  • Avro, ORC, Parquet, CSV, XML, Json
  • Write UDFs and use them in Hive
  • Sqoop Architecture
  • Importing Tables using Sqoop
  • Into HDFS, Local File System, Hive
  • Into various formats
  • Using a “Direct Query”, Selecting specific tables, etc
  • With / Without passwords via the commands
  • Incremental Append
  • Exporting Tables using Sqoop
  • into MySQL
  • Working with Sqoop Jobs
  • Need for Spark
  • Introduction to Spark Libraries (Spark SQL, Streaming and ML)
  • Spark’s place in the Hadoop – Spark Eco system
  • Spark Architecture – An Introduction
  • Spark Cluster Managers – Standalone vs YARN
  • Connecting to Spark using Spark Shell – Local vs YARN
  • Local: spark-submit with local[*] / local / local[2]
  • Yarn – client vs cluster mode, executor / driver memory
  • Dissecting a Spark Job by understanding terminology: Application, Driver, Executor, Stages and Tasks
  • Spark UI – An overview
  • Actions and Transformations – What are they?
  • Maps, filters, reduce functions
  • Repartitions, Coalesce and Cache/Persist - Concepts
  • Working with Spark interactively using the Spark Shell
  • Working with Spark and Scala using Eclipse IDE
  • Working with Dataframes
  • DataFrames, SQL and Datasets
  • Manual vs Inferred Schemas
  • Executing queries using Dataframe API and Spark SQL
  • Working with different file formats: ORC, Parquet, Avro, CSV
  • Working with data from different sources: MySQL, Hive, Local FS, HDFS
  • Introduction to Spark Streaming
  • Need for Streaming (use cases)
  • Streaming Architecture
  • Overview of Dstreams (1.x)
  • Structured Streaming (2.x) – Advantages
  • Using Spark Built-In sources: The Socket Source
  • The file source
  • Output modes (Complete, Update, Append)
  • Controlling processing times
  • Streaming and Dataframes
  • Creating streaming Dataframes
  • Transforming Dataframes
  • Executing Streaming Queries