Cloudera Apache Spark程序员
培训班型: 公开课,内训
课程长度: 3天/18小时
培训日期: 待定
认证考试: 暂无
培训地点: 博学国际教育培训中心
环境要求: 投影仪、白板、大白纸
培训形式: 实例讲授,现场演、练、及时沟通
培训资料: 培训教材
课程内容
Cloudera Developer Training for Apache Spark
课程概述:
结合批处理、流媒体和交互分析技术,利用 Apache Spark 构建完整统一的大 数据应用。学习编写复杂的并行应用程序,为各种用例、架构和行业执行快速良好的决策和实时行动。
授课对象:
面向意欲优化应用程序速度、易用性和复杂程度的开发人员和工程师。培训对象要求 具 备Python或Scala背景知识,具备Linux 相关基础知识更佳。
培训目标:
Using the Spark shell for interactive data analysis
The features of Spark’s Resilient Distributed Datasets
How Spark runs on a cluster
How Spark parallelizes task execution
Writing Spark applications
Processing streaming data with Spark
课程内容:
Introduction to Spark
What is Spark?
Review: From Hadoop MapReduce to Spark
Review: HDFS
Review: YARN
Spark Overview
Spark Basics
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark
Working with RDDs in Spark
Creating RDDs
Other General RDD Operations
Aggregating Data with Pair RDDs
Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations
Writing and Deploying Spark Applications
Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Hands-On Exercise: Write and Run a Spark Application
Configuring Spark Properties
Logging
Parallel Processing
Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks
Spark RDD Persistence
RDD Lineage
RDD Persistence Overview
Distributed Persistence
Basic Spark Streaming
Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Spark Streaming Applications
Advanced Spark Streaming
Multi-Batch Operations
State Operations
Sliding Window Operations
Advanced Data Sources
Common Patterns in Spark Data Processing
Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means
Improving Spark Performance
Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues
Diagnosing Performance Problems
Spark SQL and DataFrames
Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Spark SQL, Impala and Hive-on-Spark