Formation Apache Spark : An introduction

Workshop for developeurs

Formation officielle
Formation officielle

Durée 2 jours
Prix(HT) : 1600 €
APACHE-SPARK-02

Prochaines sessions

15 mai 2017
Paris

Description

This two­ day course, created by Dean Wampler, Ph.D., is designed to teach developers how to implement data processing pipelines and analytics using Apache Spark . Developers will use hands on exercises to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs. Developers will also learn about Spark internals and tips for improving application performance.Additional coverage includes integration with Mesos, Hadoop, and Reactive frameworks like Akka.

Objectifs

  • Understand how to use the Spark Scala APIs to implement various data analytics algorithms for offline (batch­mode) and event­streaming applications
  • Understand Spark internals
  • Understand Spark performance considerations
  • Understand how to test and deploy Spark applications
  • Understand the basics of integrating Spark with Mesos, Hadoop, and Akka

Pré-requis : 

  • Experience with Scala, such as completion of Fast Track to Scala course
  • Experience with SQL, machine learning, and other Big Data tools will be helpful, but not required.

Public : 

  • Developers wishing to learn how to write data­centric applications using Spark.

Programme

Introduction ­ Why Spark

  • How Spark improves on Hadoop MapReduce
  • The core abstractions in Spark
  • What happens during a Spark job?
  • The Spark ecosystem
  • Deployment options
  • References for more information

Spark's Core API

  • Resilient Distributed Datasets (RDD) and how they implement your job
  • Using the Spark Shell (interpreter) vs submitting Spark batch jobs
  • Using the Spark web console.
  • Reading and writing data files
  • Working with structured and unstructured data
  • Building data transformation pipelines
  • Spark under the hood: caching, checkpointing, partitioning, shuffling,etc.
  • Mastering the RDD API
  • Broadcast variables, accumulators

Spark SQL and DataFrames

  • Working with the DataFrame API for structured data
  • Working with SQL
  • Performance optimizations
  • Support for JSON and Parquet formats
  • Integration with Hadoop Hive

Processing events with Spark Streaming:

  • Working with time slices, “mini­batches”, of events
  • Working with moving windows of mini­batches
  • Reuse of code in batch­mode and streaming: the Lambda Architecture
  • Working with different streaming sources: sockets, file systems, Kafka,etc.
  • Resiliency and fault tolerance considerations
  • Stateful transformations (e.g., running statistics)

Other Spark­based Libraries:

  • MLlib for machine learning
  • Discussion of GraphX for graph algorithms, Tachyon for distributed caching, and BlinkDB for approximate queries

Deploying to clusters:

  • Spark’s clustering abstractions: cluster vs. client deployments, coarse­grained and fine­grained process management
  • Standalone mode
  • Mesos
  • Hadoop YARN
  • EC2
  • Cassandra rings

Using Spark with the Lightbend Reactive Platform:

  • Akka Streams and Spark Streaming

Conclusions