Su	Mo	Tu	We	Th	Fr	Sa
31	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	1	2	3	4

Tuning Apache Spark: Powerful Big Data Processing Recipes

Posted By: ELK1nG

Date: 10 Jun 2023 17:17:58

Tuning Apache Spark: Powerful Big Data Processing Recipes
Last updated 2/2019
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 5.11 GB | Duration: 12h 0m

Uncover the lesser known secrets of powerful big data processing with Spark and Kafka

What you'll learn

How to attain a solid foundation in the most powerful and versatile technologies involved in data streaming: Apache Spark and Apache Kafka

Form a robust and clean architecture for a data streaming pipeline

Ways to implement the correct tools to bring your data streaming architecture to life

How to create robust processing pipelines by testing Apache Spark jobs

How to create highly concurrent Spark programs by leveraging immutability

How to solve repeated problems by leveraging the GraphX API

How to solve long-running computation problems by leveraging lazy evaluation in Spark

Tips to avoid memory leaks by understanding the internal memory management of Apache Spark

Troubleshoot real-time pipelines written in Spark Streaming

Requirements

To pick up this course, you don’t need to be an expert with Spark. Customers should be familiar with Java or Scala.

Description

Video Learning Path OverviewA Learning Path is a specially tailored course that brings together two or more different topics that lead you to achieve an end goal. Much thought goes into the selection of the assets for a Learning Path, and this is done through a complete understanding of the requirements to achieve a goal.Today, organizations have a difficult time working with large datasets. In addition, big data processing and analyzing need to be done in real time to gain valuable insights quickly. This is where data streaming and Spark come in.In this well thought out Learning Path, you will not only learn how to work with Spark to solve the problem of analyzing massive amounts of data for your organization, but you’ll also learn how to tune it for performance. Beginning with a step by step approach, you’ll get comfortable in using Spark and will learn how to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You’ll be able to perform tasks and get the best out of your databases much faster.Moving further and accelerating the pace a bit, You’ll learn some of the lesser known techniques to squeeze the best out of Spark and then you’ll learn to overcome several problems you might come across when working with Spark, without having to break a sweat. The simple and practical solutions provided will get you back in action in no time at all!By the end of the course, you will be well versed in using Spark in your day to day projects.Key FeaturesFrom blueprint architecture to complete code solution, this course treats every important aspect involved in architecting and developing a data streaming pipelineTest Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bulletproof.Solve several painful issues like slow-running jobs that affect the performance of your application.Author BiosAnghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He has been working with the Spark and ML APIs for the past 5 years with production experience in processing petabytes of data. He is passionate about nearly everything associated with software development and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland, Confitura and JDD (Java Developers Day), and at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He is a co-founder of initlearn, an e-learning platform that was built with the Java language. He has also written articles about everything related to the Java world.

Overview

Section 1: Data Stream Development with Apache Spark, Kafka, and Spring Boot

Lecture 1 The Course Overview

Lecture 2 Discovering the Data Streaming Pipeline Blueprint Architecture

Lecture 3 Analyzing Meetup RSVPs in Real-Time

Lecture 4 Running the Collection Tier (Part I – Collecting Data)

Lecture 5 Collecting Data Via the Stream Pattern and Spring WebSocketClient API

Lecture 6 Explaining the Message Queuing Tier Role

Lecture 7 Introducing Our Message Queuing Tier –Apache Kafka

Lecture 8 Running The Collection Tier (Part II – Sending Data)

Lecture 9 Dissecting the Data Access Tier

Lecture 10 Introducing Our Data Access Tier – MongoDB

Lecture 11 Exploring Spring Reactive

Lecture 12 Exposing the Data Access Tier in Browser

Lecture 13 Diving into the Analysis Tier

Lecture 14 Streaming Algorithms For Data Analysis

Lecture 15 Introducing Our Analysis Tier – Apache Spark

Lecture 16 Plug-in Spark Analysis Tier to Our Pipeline

Lecture 17 Brief Overview of Spark RDDs

Lecture 18 Spark Streaming

Lecture 19 DataFrames, Datasets and Spark SQL

Lecture 20 Spark Structured Streaming

Lecture 21 Machine Learning in 7 Steps

Lecture 22 MLlib (Spark ML)

Lecture 23 Spark ML and Structured Streaming

Lecture 24 Spark GraphX

Lecture 25 Fault Tolerance (HML)

Lecture 26 Kafka Connect

Lecture 27 Securing Communication between Tiers

Section 2: Apache Spark: Tips, Tricks, & Techniques

Lecture 28 The Course Overview

Lecture 29 Using Spark Transformations to Defer Computations to a Later Time

Lecture 30 Avoiding Transformations

Lecture 31 Using reduce and reduceByKey to Calculate Results

Lecture 32 Performing Actions That Trigger Computations

Lecture 33 Reusing the Same RDD for Different Actions

Lecture 34 Delve into Spark RDDs Parent/Child Chain

Lecture 35 Using RDD in an Immutable Way

Lecture 36 Using DataFrame Operations to Transform It

Lecture 37 Immutability in the Highly Concurrent Environment

Lecture 38 Using Dataset API in an Immutable Way

Lecture 39 Detecting a Shuffle in a Processing

Lecture 40 Testing Operations That Cause Shuffle in Apache Spark

Lecture 41 Changing Design of Jobs with Wide Dependencies

Lecture 42 Using keyBy() Operations to Reduce Shuffle

Lecture 43 Using Custom Partitioner to Reduce Shuffle

Lecture 44 Saving Data in Plain Text

Lecture 45 Leveraging JSON as a Data Format

Lecture 46 Tabular Formats – CSV

Lecture 47 Using Avro with Spark

Lecture 48 Columnar Formats – Parquet

Lecture 49 Available Transformations on Key/Value Pairs

Lecture 50 Using aggregateByKey Instead of groupBy()

Lecture 51 Actions on Key/Value Pairs

Lecture 52 Available Partitioners on Key/Value Data

Lecture 53 Implementing Custom Partitioner

Lecture 54 Separating Logic from Spark Engine – Unit Testing

Lecture 55 Integration Testing Using SparkSession

Lecture 56 Mocking Data Sources Using Partial Functions

Lecture 57 Using ScalaCheck for Property-Based Testing

Lecture 58 Testing in Different Versions of Spark

Lecture 59 Creating Graph from Datasource

Lecture 60 Using Vertex API

Lecture 61 Using Edge API

Lecture 62 Calculate Degree of Vertex

Lecture 63 Calculate Page Rank

Section 3: Troubleshooting Apache Spark

Lecture 64 The Course Overview

Lecture 65 Eager Computations: Lazy Evaluation

Lecture 66 Caching Values: In-Memory Persistence

Lecture 67 Unexpected API Behavior: Picking the Proper RDD API

Lecture 68 Wide Dependencies: Using Narrow Dependencies

Lecture 69 Making Computations Parallel: Using Partitions

Lecture 70 Defining Robust Custom Functions: Understanding User-Defined Functions

Lecture 71 Logical Plans Hiding the Truth: Examining the Physical Plans

Lecture 72 Slow Interpreted Lambdas: Code Generation Spark Optimization

Lecture 73 Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume

Lecture 74 Slow Joins: Choosing an Execution Plan for Join

Lecture 75 Distributed Joins Problem: DataFrame API

Lecture 76 TypeSafe Joins Problem: The Newest DataSet API

Lecture 77 Minimizing Object Creation: Reusing Existing Objects

Lecture 78 Iterating Transformations – The mapPartitions() Method

Lecture 79 Slow Spark Application Start: Reducing Setup Overhead

Lecture 80 Performing Unnecessary Recomputation: Reusing RDDs

Lecture 81 Repeating the Same Code in Stream Pipeline: Using Sources and Sinks

Lecture 82 Long Latency of Jobs: Understanding Batch Internals

Lecture 83 Fault Tolerance: Using Data Checkpointing

Lecture 84 Maintaining Batch and Streaming: Using Structured Streaming Pros

An Application Developer, Data Scientist, Analyst, Statistician, Big data Engineer, or anyone who has some experience with Spark will feel perfectly comfortable in understanding the topics presented. They usually work with large amounts of data on a day to day basis. They may or may not have used Spark, but it’s an added advantage if they have some experience with the tool.