Formation Apache Spark : An introduction
Workshop for developeurs
Prix(HT) : 1600 €
This two day course, created by Dean Wampler, Ph.D., is designed to teach developers how to implement data processing pipelines and analytics using Apache Spark . Developers will use hands on exercises to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs. Developers will also learn about Spark internals and tips for improving application performance.Additional coverage includes integration with Mesos, Hadoop, and Reactive frameworks like Akka.
- Understand how to use the Spark Scala APIs to implement various data analytics algorithms for offline (batchmode) and eventstreaming applications
- Understand Spark internals
- Understand Spark performance considerations
- Understand how to test and deploy Spark applications
- Understand the basics of integrating Spark with Mesos, Hadoop, and Akka
- Experience with Scala, such as completion of Fast Track to Scala course
- Experience with SQL, machine learning, and other Big Data tools will be helpful, but not required.
- Developers wishing to learn how to write datacentric applications using Spark.
Introduction Why Spark
- How Spark improves on Hadoop MapReduce
- The core abstractions in Spark
- What happens during a Spark job?
- The Spark ecosystem
- Deployment options
- References for more information
Spark's Core API
- Resilient Distributed Datasets (RDD) and how they implement your job
- Using the Spark Shell (interpreter) vs submitting Spark batch jobs
- Using the Spark web console.
- Reading and writing data files
- Working with structured and unstructured data
- Building data transformation pipelines
- Spark under the hood: caching, checkpointing, partitioning, shuffling,etc.
- Mastering the RDD API
- Broadcast variables, accumulators
Spark SQL and DataFrames
- Working with the DataFrame API for structured data
- Working with SQL
- Performance optimizations
- Support for JSON and Parquet formats
- Integration with Hadoop Hive
Processing events with Spark Streaming:
- Working with time slices, “minibatches”, of events
- Working with moving windows of minibatches
- Reuse of code in batchmode and streaming: the Lambda Architecture
- Working with different streaming sources: sockets, file systems, Kafka,etc.
- Resiliency and fault tolerance considerations
- Stateful transformations (e.g., running statistics)
Other Sparkbased Libraries:
- MLlib for machine learning
- Discussion of GraphX for graph algorithms, Tachyon for distributed caching, and BlinkDB for approximate queries
Deploying to clusters:
- Spark’s clustering abstractions: cluster vs. client deployments, coarsegrained and finegrained process management
- Standalone mode
- Hadoop YARN
- Cassandra rings
Using Spark with the Lightbend Reactive Platform:
- Akka Streams and Spark Streaming