Formations Big Data & Data Science

Logo Lightbend Apache Spark for Scala - Professional

Formation Lightbend Apache Spark for Scala - Professional

Workshop for Developer

Durée 2 jours • Prix (HT) 1600€

Logo OfficielleOfficielle
Logo CertifianteProgramme

Introduction ­ Why Spark

  • How Spark improves on Hadoop MapReduce
  • The core abstractions in Spark
  • What happens during a Spark job?
  • The Spark ecosystem
  • Deployment options
  • References for more information

Spark's Core API

  • Resilient Distributed Datasets (RDD) and how they implement your job
  • Using the Spark Shell (interpreter) vs submitting Spark batch jobs
  • Using the Spark web console.
  • Reading and writing data files
  • Working with structured and unstructured data
  • Building data transformation pipelines
  • Spark under the hood: caching, checkpointing, partitioning, shuffling,etc.
  • Mastering the RDD API
  • Broadcast variables, accumulators

Spark SQL and DataFrames

  • Working with the DataFrame API for structured data
  • Working with SQL
  • Performance optimizations
  • Support for JSON and Parquet formats
  • Integration with Hadoop Hive

Processing events with Spark Streaming:

  • Working with time slices, “mini­batches”, of events
  • Working with moving windows of mini­batches
  • Reuse of code in batch­mode and streaming: the Lambda Architecture
  • Working with different streaming sources: sockets, file systems, Kafka,etc.
  • Resiliency and fault tolerance considerations
  • Stateful transformations (e.g., running statistics)

Other Spark­based Libraries:

  • MLlib for machine learning
  • Discussion of GraphX for graph algorithms, Tachyon for distributed caching, and BlinkDB for approximate queries

Deploying to clusters:

  • Spark’s clustering abstractions: cluster vs. client deployments, coarse­grained and fine­grained process management
  • Standalone mode
  • Mesos
  • Hadoop YARN
  • EC2
  • Cassandra rings

Using Spark with the Lightbend Reactive Platform:

  • Akka Streams and Spark Streaming