Spark Sql And Pyspark 3 Using Python 3 Hands-On With Labs
Last updated 8/2022
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 10.03 GB | Duration: 32h 12m
Last updated 8/2022
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 10.03 GB | Duration: 32h 12m
A Comprehensive Course on Spark SQL as well as Data Frame APIs using Python 3 with complementary lab access
What you'll learn
Setup the Single Node Hadoop and Spark using Docker locally or on AWS Cloud9
Review ITVersity Labs (exclusively for ITVersity Lab Customers)
All the HDFS Commands that are relevant to validate files and folders in HDFS.
Quick recap of Python which is relevant to learn Spark
Ability to use Spark SQL to solve the problems using SQL style syntax.
Pyspark Dataframe APIs to solve the problems using Dataframe style APIs.
Relevance of Spark Metastore to convert Dataframs into Temporary Views so that one can process data in Dataframes using Spark SQL.
Apache Spark Application Development Life Cycle
Apache Spark Application Execution Life Cycle and Spark UI
Setup SSH Proxy to access Spark Application logs
Deployment Modes of Spark Applications (Cluster and Client)
Passing Application Properties Files and External Dependencies while running Spark Applications
Requirements
Basic programming skills using any programming language
Self support lab (Instructions provided) or ITVersity lab at additional cost for appropriate environment.
Minimum memory required based on the environment you are using with 64 bit operating system
4 GB RAM with access to proper clusters or 16 GB RAM to setup environment using Docker
Description
As part of this course, you will learn all the key skills to build Data Engineering Pipelines using Spark SQL and Spark Data Frame APIs using Python as a Programming language. This course used to be a CCA 175 Spark and Hadoop Developer course for the preparation for the Certification Exam. As of 10/31/2021, the exam is sunset and we have renamed it to Apache Spark 2 and 3 using Python 3 as it covers industry-relevant topics beyond the scope of certification.About Data EngineeringData Engineering is nothing but processing the data depending upon our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc. Apache Spark is evolved as a leading technology to take care of Data Engineering at scale.I have prepared this course for anyone who would like to transition into a Data Engineer role using Pyspark (Python + Spark). I myself am a proven Data Engineering Solution Architect with proven experience in designing solutions using Apache Spark.Let us go through the details about what you will be learning in this course. Keep in mind that the course is created with a lot of hands-on tasks which will give you enough practice using the right tools. Also, there are tons of tasks and exercises to evaluate yourself. We will provide details about Resources or Environments to learn Spark SQL and PySpark 3 using Python 3 as well as Reference Material on GitHub to practice Spark SQL and PySpark 3 using Python 3. Keep in mind that you can either use the cluster at your workplace or set up the environment using provided instructions or use ITVersity Lab to take this course.Setup of Single Node Big Data ClusterMany of you would like to transition to Big Data from Conventional Technologies such as Mainframes, Oracle PL/SQL, etc and you might not have access to Big Data Clusters. It is very important for you set up the environment in the right manner. Don't worry if you do not have the cluster handy, we will guide you through support via Udemy Q&A.Setup Ubuntu-based AWS Cloud9 Instance with the right configurationEnsure Docker is setupSetup Jupyter Lab and other key componentsSetup and Validate Hadoop, Hive, YARN, and SparkAre you feeling a bit overwhelmed about setting up the environment? Don't worry!!! We will provide complementary lab access for up to 2 months. Here are the details.Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment, and acknowledge it by providing a 5* rating and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to support@itversity.com to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.A quick recap of PythonThis course requires a decent knowledge of Python. To make sure you understand Spark from a Data Engineering perspective, we added a module to quickly warm up with Python. If you are not familiar with Python, then we suggest you go through our other course Data Engineering Essentials - Python, SQL, and Spark.Master required Hadoop Skills to build Data Engineering ApplicationsAs part of this section, you will primarily focus on HDFS commands so that we can copy files into HDFS. The data copied into HDFS will be used as part of building data engineering pipelines using Spark and Hadoop with Python as the Programming Language.Overview of HDFS CommandsCopy Files into HDFS using the put or copyFromLocal command using appropriate HDFS CommandsReview whether the files are copied properly or not to HDFS using HDFS Commands.Get the size of the files using HDFS commands such as du, df, etc.Some fundamental concepts related to HDFS such as block size, replication factor, etc.Data Engineering using Spark SQLLet us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.Getting Started with Spark SQLBasic Transformations using Spark SQLManaging Tables - Basic DDL and DML in Spark SQLManaging Tables - DML and Create Partitioned Tables using Spark SQLOverview of Spark SQL Functions to manipulate strings, dates, null values, etcWindowing Functions using Spark SQL for ranking, advanced aggregations, etc.Data Engineering using Spark Data Frame APIsSpark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.Data Processing Overview using Spark or Pyspark Data Frame APIs.Projecting or Selecting data from Spark Data Frames, renaming columns, providing aliases, dropping columns from Data Frames, etc using Pyspark Data Frame APIs.Processing Column Data using Spark or Pyspark Data Frame APIs - You will be learning functions to manipulate strings, dates, null values, etc.Basic Transformations on Spark Data Frames using Pyspark Data Frame APIs such as Filtering, Aggregations, and Sorting using functions such as filter/where, groupBy with agg, sort or orderBy, etc.Joining Data Sets on Spark Data Frames using Pyspark Data Frame APIs such as join. You will learn inner joins, outer joins, etc using the right examples.Windowing Functions on Spark Data Frames using Pyspark Data Frame APIs to perform advanced Aggregations, Ranking, and Analytic FunctionsSpark Metastore Databases and Tables and integration between Spark SQL and Data Frame APIsApache Spark Application Development and Deployment Life CycleOnce you go through the content related to Spark using Jupyter-based environment, we will also walk you through the details about how the Spark applications are typically developed using Python, deployed as well as reviewed.Setup Python Virtual Environment and Project for Spark Application Development using PycharmUnderstand complete Spark Application Development Lifecycle using Pycharm and PythonBuild zip file for the Spark Application, copy to the environment where it is supposed to run and run.Understand how to review the Spark Application Execution Life Cycle.All the demos are given on our state-of-the-art Big Data cluster. You can avail of one-month complimentary lab access by reaching out to support@itversity.com with a Udemy receipt.
Overview
Section 1: Introduction about Spark SQL and PySpark 3 using Python 3
Lecture 1 Introduction to Spark SQL and PySpark 3 using Python 3
Lecture 2 Curriculum for Spark SQL and Pyspark 3 using Python 3
Lecture 3 Purchasing the Spark SQL and PySpark using Python 3 Course
Lecture 4 Introduction to Udemy Course Landing Page
Lecture 5 Overview of Udemy Course or Video Player
Lecture 6 Adding Notes to Course Lectures
Lecture 7 Using Course Sidebar to move between lectures
Lecture 8 Overview of Support to ITVersity courses on Udemy
Lecture 9 Best Practices to get ITVersity Support using Udemy
Lecture 10 Resources for Spark SQL and Pyspark 3 using Python 3
Lecture 11 Material for Spark SQL and PySpark 3 using Python 3
Lecture 12 Become Part of ITVersity Data Engineering Community
Lecture 13 Rate and Leave Feedback - Spark SQL and PySpark 3 using Python 3
Lecture 14 Udemy for Business Customers - Important Information for about labs for practice
Section 2: Using ITVersity Labs for hands-on practice (for ITVersity Lab Customers only)
Lecture 15 Setup Development Environment using VS Code Remote Development Extension Pack
Lecture 16 Review Data Sets Provided as part of Gateway Nodes of Hadoop and Spark Cluster
Lecture 17 Validate HDFS on Multi Node Hadoop and Spark Cluster from Gateway Node
Lecture 18 Validate Hive on Hadoop and Spark Multinode Cluster
Lecture 19 Review Hadoop HDFS and YARN Property Files on Hadoop and Spark Cluster
Lecture 20 Review Hadoop HDFS and YARN Property Files using Visual Studio Code Editor
Lecture 21 Review Hive Property Files on Multinode Hadoop and Spark Cluster
Lecture 22 Review Spark 2 Property Files and Important Properties
Lecture 23 Validate Spark Shell CLI using Spark 2
Lecture 24 Validate Pyspark CLI using Spark 2
Lecture 25 Validate Spark SQL CLI using Spark 2
Lecture 26 Review Spark 3 Property Files and Important Properties
Lecture 27 Validate Spark Shell CLI using Spark 3
Lecture 28 Validate Pyspark CLI using Spark 3
Lecture 29 Validate Spark SQL CLI using Spark 3
Section 3: Setup Hadoop and Spark Single Node Cluster on Windows 11 using Docker
Lecture 30 Prerequisites for Single Node Hadoop and Spark Cluster on Windows
Lecture 31 Overview of Windows System Configuration
Lecture 32 Setup Ubuntu on Windows 11 using wsl
Lecture 33 Setup and Validate Ubuntu VM on Windows using wsl
Lecture 34 Install Docker Desktop on Windows 11 using wsl2
Lecture 35 Overview of Docker Desktop on Windows 11
Lecture 36 Validate Docker Commands using Windows Powershell as well as wsl Ubuntu
Lecture 37 Setup Visual Studio Code IDE on Windows
Lecture 38 Install Visual Studio Code Extension for Remote Development
Lecture 39 Clone GitHub Repository for Pyspark Course using Visual Studio Code
Lecture 40 Launching Terminal using Visual Studio Code and WSL
Lecture 41 Review Docker Compose File to setup Hadoop and Spark Lab
Lecture 42 Start Hadoop and Spark Lab along with Jupyter Lab on Windows 11
Lecture 43 Review the resource utilization of Windows for Hadoop and Spark Lab
Lecture 44 Review Docker Desktop for Hadoop and Spark Lab using Docker
Lecture 45 Overview of Docker Compose Commands to manage Hadoop and Spark Lab
Lecture 46 Validate Hadoop and Spark setup using Docker on Windows
Section 4: Setup Hadoop and Spark Single Node Cluster on AWS Cloud9 using Docker
Lecture 47 Getting Started with AWS Cloud9
Lecture 48 Creating AWS Cloud9 Environment
Lecture 49 Warming up with AWS Cloud9 IDE
Lecture 50 Review Operating System Details on AWS Cloud9
Lecture 51 Overview of EC2 Instance related to AWS Cloud9
Lecture 52 Opening ports for AWS Cloud9 Instance
Lecture 53 Associating Elastic IPs to AWS Cloud9 Instance
Lecture 54 Increase EBS Volume Size of AWS Cloud9 Instance
Lecture 55 Setup Docker Compose on AWS Cloud9 Instance
Lecture 56 Clone GitHub Repository on AWS Cloud9 for the Course Material
Lecture 57 Review Docker Compose File to setup Hadoop and Spark Lab
Lecture 58 Start Hadoop and Spark Lab along with Jupyter Lab on Windows 11
Lecture 59 Overview of Docker Compose Commands to manage Hadoop and Spark Lab
Lecture 60 Validate Hadoop and Spark setup using Docker
Section 5: Python Fundamentals
Lecture 61 Introduction and Setting up Python
Lecture 62 Basic Programming Constructs
Lecture 63 Functions in Python
Lecture 64 Python Collections
Lecture 65 Map Reduce operations on Python Collections
Lecture 66 Setting up Data Sets for Basic I/O Operations
Lecture 67 Basic I/O operations and processing data using Collections
Section 6: Overview of Hadoop HDFS Commands
Lecture 68 Getting help or usage
Lecture 69 Listing HDFS Files
Lecture 70 Managing HDFS Directories
Lecture 71 Copying files from local to HDFS
Lecture 72 Copying files from HDFS to local
Lecture 73 Getting File Metadata
Lecture 74 Previewing Data in HDFS File
Lecture 75 HDFS Block Size
Lecture 76 HDFS Replication Factor
Lecture 77 Getting HDFS Storage Usage
Lecture 78 Using HDFS Stat Commands
Lecture 79 HDFS File Permissions
Lecture 80 Overriding Properties
Section 7: Apache Spark 2.x - Data processing - Getting Started
Lecture 81 Introduction
Lecture 82 Review of Setup Steps for Spark Environment
Lecture 83 Using ITVersity labs
Lecture 84 Apache Spark Official Documentation (Very Important)
Lecture 85 Quick Review of Spark APIs
Lecture 86 Spark Modules
Lecture 87 Spark Data Structures - RDDs and Data Frames
Lecture 88 Develop Simple Application
Lecture 89 Apache Spark - Framework
Lecture 90 Create Data Frames from Text Files
Lecture 91 Create Data Frames from Hive Tables
Section 8: Apache Spark using SQL - Getting Started
Lecture 92 Getting Started - Overview
Lecture 93 Overview of Spark Documentation
Lecture 94 Launching and using Spark SQL CLI
Lecture 95 Overview of Spark SQL Properties
Lecture 96 Running OS Commands using Spark SQL
Lecture 97 Understanding Spark Metastore Warehouse Directory
Lecture 98 Managing Spark Metastore Databases using Spark SQL
Lecture 99 Managing Spark Metastore Tables using Spark SQL
Lecture 100 Retrieve Metadata of Spark Metastore Tables using Spark SQL Describe Command
Lecture 101 Role of Spark Metastore or Hive Metastore
Lecture 102 Exercise - Getting Started with Spark SQL
Section 9: Apache Spark using SQL - Basic Transformations using Spark SQL
Lecture 103 Basic Transformations using Spark SQL - Introduction
Lecture 104 Spark SQL - Overview
Lecture 105 Define Problem Statement
Lecture 106 Prepare Spark Metastore Tables for Basic Transformations using Spark SQL
Lecture 107 Projecting Data using Spark SQL Select Clause
Lecture 108 Filtering Data using Spark SQL Where Clause
Lecture 109 Joining Tables using Spark SQL - Inner
Lecture 110 Joining Tables using Spark SQL - Outer
Lecture 111 Aggregating Data using Group By in Spark SQL
Lecture 112 Sorting Data using Order By in Spark SQL
Lecture 113 Conclusion - Final Solution for the problem statement using Spark SQL
Section 10: Apache Spark using SQL - Basic DDL and DML
Lecture 114 Introduction to Basic DDL and DML in Spark SQL
Lecture 115 Create Spark Metastore Tables using Spark SQL Create Statement
Lecture 116 Overview of Data Types used in Spark Metastore Tables
Lecture 117 Adding Comments to Spark Metastore Tables using Spark SQL
Lecture 118 Loading Data from Local File System Into Tables using Spark SQL Load Statement
Lecture 119 Loading Data from HDFS Folders Into Tables using Spark SQL Load Statement
Lecture 120 Difference between Load with Append and Overwrite using Spark SQL Load Statement
Lecture 121 Creating External Spark Metastore Tables using Spark SQL
Lecture 122 Difference between Managed and External Spark Metastore Tables
Lecture 123 Overview of File Formats used in Spark Metastore Tables
Lecture 124 Drop Spark Metastore Tables and Databases using Spark SQL
Lecture 125 Truncating Spark Metastore Tables
Lecture 126 Exercise - Managed Spark Metastore Tables
Section 11: Apache Spark using SQL - DML and Partitioning
Lecture 127 Introduction to DML and Partitioning using Spark SQL on Spark Metastore Tables
Lecture 128 Introduction to Partitioning of Spark Metastore Tables using Spark SQL
Lecture 129 Creating Spark Metastore Tables using Parquet File Format
Lecture 130 Difference between Load and Insert to get data into Spark Metastore Tables
Lecture 131 Inserting Data using Stage Table leveraging Spark SQL
Lecture 132 Creating Spark Metastore Partitioned Tables using Spark SQL
Lecture 133 Adding Partitions to Spark Metastore Tables using Spark SQL
Lecture 134 Loading Data into Spark Metastore Partitioned Tables using Spark SQL
Lecture 135 Inserting Data into Spark Metastore Partitions using Spark SQL Insert Statement
Lecture 136 Using Dynamic Partition Mode while inserting into Spark Partitioned Tables
Lecture 137 Exercise - Partitioned Tables using Spark SQL
Section 12: Apache Spark using SQL - Pre-defined Functions
Lecture 138 Introduction - Overview of Spark SQL Pre-defined Functions
Lecture 139 Overview of Spark SQL Pre-defined Functions
Lecture 140 Validating Spark SQL Functions
Lecture 141 String Manipulation using Spark SQL Functions
Lecture 142 Date Manipulation using Spark SQL Functions
Lecture 143 Overview of Numeric Functions in Spark SQL
Lecture 144 Data Type Conversion using Spark SQL
Lecture 145 Dealing with Nulls using Spark SQL
Lecture 146 Using CASE and WHEN in Spark SQL Queries
Lecture 147 Query Example - Word Count using Spark SQL
Section 13: Apache Spark SQL - Windowing Functions
Lecture 148 Introduction to Windowing Functions in Spark SQL
Lecture 149 Prepare HR Database for Windowing Functions in Spark SQL
Lecture 150 Overview of Windowing Functions using Spark SQL
Lecture 151 Aggregations using Spark SQL Windowing Functions
Lecture 152 Using LEAD or LAG in Spark SQL Windowing Functions
Lecture 153 Getting first and last values using Spark SQL Windowing Functions
Lecture 154 Ranking using Spark SQL Windowing Functions - rank, dense_rank and row_number
Lecture 155 Order of execution of Spark SQL Queries
Lecture 156 Overview of Subqueries in Spark SQL
Lecture 157 Filtering Window Function Results using Spark SQL
Section 14: Apache Spark using Python - Data Processing Overview
Lecture 158 Starting Spark Context - pyspark
Lecture 159 Overview of Spark Read APIs
Lecture 160 Understanding airlines data
Lecture 161 Inferring Schema using Spark Data Frame APIs
Lecture 162 Previewing Airlines Data using Spark Data Frame APIs
Lecture 163 Overview of Data Frame APIs
Lecture 164 Overview of Functions on Spark Data Frames
Lecture 165 Overview of Spark Write APIs
Section 15: Apache Spark using Python - Processing Column Data
Lecture 166 Overview of Predefined Functions on Spark Data Frame Columns
Lecture 167 Create Dummy Data Frame to explore Functions on Data Frame Columns
Lecture 168 Categories of Predefined Functions used on Spark Data Frame Columns
Lecture 169 Special Functions for Spark Data Frame Columns - col and lit
Lecture 170 Common String Manipulation Functions for Spark Data Frame Columns
Lecture 171 Extracting Strings using substring from Spark Data Frame Columns
Lecture 172 Extracting Strings using split from Spark Data Frame Columns
Lecture 173 Padding Characters around Strings in Spark Data Frame Columns
Lecture 174 Trimming Characters from Strings in Spark Data Frame Columns
Lecture 175 Date and Time Manipulation Functions for Spark Data Frame Columns
Lecture 176 Date and Time Arithmetic on Spark Data Frame Columns
Lecture 177 Using Date and Time Trunc Functions on Spark Data Frame Columns
Lecture 178 Date and Time Extract Functions for Spark Data Frame Columns
Lecture 179 Using to_date and to_timestamp on Spark Data Frame Columns
Lecture 180 Using date_format Function on Spark Data Frame Columns
Lecture 181 Dealing with Unix Timestamp in Spark Data Frame Columns
Lecture 182 Dealing with Nulls in Spark Data Frame Columns
Lecture 183 Using CASE and WHEN on Spark Data Frame Columns
Section 16: Apache Spark using Python - Basic Transformations
Lecture 184 Overview of Basic Transformations on Spark Data Frames
Lecture 185 Spark Data Frames for basic transformations
Lecture 186 Basic Filtering of Data or rows using where from Spark Data Frames
Lecture 187 Filtering Example using dates on Spark Data Frames
Lecture 188 Boolean Operators while filtering from Spark Data Frames
Lecture 189 Using IN Operator or isin Function while filtering from Spark Data Frames
Lecture 190 Using LIKE Operator or like Function while filtering from Spark Data Frames
Lecture 191 Using BETWEEN Operator while filtering from Spark Data Frames
Lecture 192 Dealing with Nulls while Filtering from Spark Data Frames
Lecture 193 Total Aggregations on Spark Data Frames
Lecture 194 Aggregate data using groupBy from Spark Data Frames
Lecture 195 Aggregate data using rollup on Spark Data Frames
Lecture 196 Aggregate data using cube on Spark Data Frames
Lecture 197 Overview of Sorting Spark Data Frames
Lecture 198 Solution - Problem 1 - Get Total Aggregations
Lecture 199 Solution - Problem 2 - Get Total Aggregations By FlightDate
Section 17: Apache Spark using Python - Joining Data Sets
Lecture 200 Prepare Datasets for Joining Spark Data Frames
Lecture 201 Analyze Datasets for Joining Spark Data Frames
Lecture 202 Problem Statements for Joining Spark Data Frames
Lecture 203 Overview of Joins on Spark Data Frames
Lecture 204 Using Inner Joins on Spark Data Frames
Lecture 205 Left or Right Outer Join on Spark Data Frames
Lecture 206 Solution - Get Flight Count Per US Airport using Spark Data Frame APIs
Lecture 207 Solution - Get Flight Count Per US State using Spark Data Frame APIs
Lecture 208 Solution - Get Dormant US Airports using Spark Data Frame APIs
Lecture 209 Solution - Get Origins without master data using Spark Data Frame APIs
Lecture 210 Solution - Get Count of Flights without master data using Spark Data Frame APIs
Lecture 211 Solution - Get Count of Flights per Airport without master data
Lecture 212 Solution - Get Daily Revenue using Spark Data Frame APIs
Lecture 213 Solution - Get Daily Revenue rolled up till Yearly using Spark Data Frame APIs
Section 18: Apache Spark using Python - Spark Metastore
Lecture 214 Overview of APIs to deal with Spark Metastore
Lecture 215 Exploring Spark Catalog
Lecture 216 Creating Spark Metastore Tables using catalog
Lecture 217 Inferring Schema while creating Spark Metastore Tables using Spark Catalog
Lecture 218 Define Schema for Spark Metastore Tables using StructType
Lecture 219 Inserting into Existing Spark Metastore Tables using Spark Data Frame APIs
Lecture 220 Read and Process data from Spark Metastore Tables using Data Frame APIs
Lecture 221 Create Spark Metastore Partitioned Tables using Data Frame APIs
Lecture 222 Saving as Spark Metastore Partitioned Table using Data Frame APIs
Lecture 223 Creating Temporary Views on top of Spark Data Frames
Lecture 224 Using Spark SQL against Temporary Views on Spark Data Frames
Section 19: Getting Started with Semi Structured Data using Spark
Lecture 225 Introduction to Getting Started with Semi Structured Data using Spark
Lecture 226 Create Spark Metastore Table with Special Data Types
Lecture 227 Overview of ARRAY Type in Spark Metastore Table
Lecture 228 Overview of MAP and STRUCT Type in Spark Metastore Table
Lecture 229 Insert Data into Spark Metastore Table with Special Type Columns
Lecture 230 Create Spark Data Frame with Special Data Types
Lecture 231 Create Spark Data Frame with Special Types using Python List
Lecture 232 Insert Spark Data Frame with Special Types into Spark Metastore Table
Lecture 233 Review Data in the JSON File with Special Data Types
Lecture 234 Setup JSON Data Set to explore Spark APIs on Special Data Type Columns
Lecture 235 Read JSON Data with Special Types into Spark Data Frame
Lecture 236 Flatten Array Fields in Spark Data Frames using explode and explode_outer
Lecture 237 Get Size or Length of Array Type Columns in Spark Data Frame
Lecture 238 Concatenate Array Values into Delimited String using Spark APIs
Lecture 239 Convert Delimited Strings from Spark Data Frame Columns to Arrays
Lecture 240 Setup Data Sets to Build Arrays using Spark.cmproj
Lecture 241 Read JSON Data into Spark Data Frame and Review Aggregate Operations
Lecture 242 Build Arrays from Flattened Rows of Spark Data Frame
Lecture 243 Getting Started with Spark Data Frames with Struct Columns
Lecture 244 Concatenate Struct Column Values in Spark Data Frame
Lecture 245 Filter Data on Struct Column Attributes in Spark Data Frame
Lecture 246 Create Spark Data Frame using Map Type Column
Lecture 247 Project Map Values as Columns using Spark Data Frame APIs
Lecture 248 Conclusion of Getting Started with Semi Structured Data using Spark
Section 20: Process Semi Structured Data using Spark Data Frame APIs
Lecture 249 Introduction to Process Semi Structured Data using Spark Data Frame APIs
Lecture 250 Review the Data Sets to generate denormalized JSON Data using Spark
Lecture 251 Setup JSON Data Sets in HDFS using HDFS Command
Lecture 252 Create Spark Data Frames using Data Frame APIs
Lecture 253 Join Orders and Order Items using Spark Data Frame APIs
Lecture 254 Generate Struct Field for Order Details using Spark
Lecture 255 Generate Array of Struct Field for Order Details using Spark
Lecture 256 Join Data Sets to generate denormalized JSON Data using Spark
Lecture 257 Denormalize Join Results using Spark Data Frame APIs
Lecture 258 Write Denormalized Customer Details to JSON Files using Spark
Lecture 259 Publish JSON Files for downstream applications
Lecture 260 Read Denormalized Data into Spark Data Frame
Lecture 261 Filter Denormalized Data Frame using Spark APIs
Lecture 262 Perform Aggregations on Denormalized Data Frame using Spark
Lecture 263 Flatten Semi Structured Data or Denormalized Data using Spark
Lecture 264 Compute Monthly Customer Revenue using Spark on Denormalized Data
Lecture 265 Conclusion of Processing Semi Structured Data using Spark Data Frame APIs
Section 21: Apache Spark - Application Development Life Cycle
Lecture 266 Setup Virtual Environment and Install Pyspark
Lecture 267 Getting Started with Pycharm
Lecture 268 Passing Run Time Arguments
Lecture 269 Accessing OS Environment Variables
Lecture 270 Getting Started with Spark
Lecture 271 Create Function for Spark Session
Lecture 272 Setup Sample Data
Lecture 273 Read data from files
Lecture 274 Process data using Spark APIs
Lecture 275 Write data to files
Lecture 276 Validating Writing Data to Files
Lecture 277 Productionizing the Code
Lecture 278 Setting up Data for Production Validation
Lecture 279 Running the application using YARN
Lecture 280 Detailed Validation of the Application
Section 22: Spark Application Execution Life Cycle and Spark UI
Lecture 281 Deploying and Monitoring Spark Applications - Introduction
Lecture 282 Overview of Types of Spark Cluster Managers
Lecture 283 Setup EMR Cluster with Hadoop and Spark
Lecture 284 Overall Capacity of Big Data Cluster with Hadoop and Spark
Lecture 285 Understanding YARN Capacity of an Enterprise Cluster
Lecture 286 Overview of Hadoop HDFS and YARN Setup on Multi-node Cluster
Lecture 287 Overview of Spark Setup on top of Hadoop
Lecture 288 Setup Data Set for Word Count application
Lecture 289 Develop Word Count Application
Lecture 290 Review Deployment Process of Spark Application
Lecture 291 Overview of Spark Submit Command
Lecture 292 Switch between Python Versions to run Spark Applications or launch Pyspark CLI
Lecture 293 Switch between Pyspark Versions to run Spark Applications or launch Pyspark CLI
Lecture 294 Review Spark Configuration Properties at Run Time
Lecture 295 Develop Shell Script to run Spark Application
Lecture 296 Run Spark Application and review default executors
Lecture 297 Overview of Spark History Server UI
Section 23: Setup SSH Proxy to access Spark Application logs
Lecture 298 Setup SSH Proxy to access Spark Application logs - Introduction
Lecture 299 Overview of Private and Public ips of servers in the cluster
Lecture 300 Overview of SSH Proxy
Lecture 301 Setup sshuttle on Mac or Linux
Lecture 302 Proxy using sshuttle on Mac or Linux
Lecture 303 Accessing Spark Application logs via SSH Proxy using sshuttle on Mac or Linux
Lecture 304 Side effects of using SSH Proxy to access Spark Application Logs
Lecture 305 Steps to setup SSH Proxy on Windows to access Spark Application Logs
Lecture 306 Setup PuTTY and PuTTYgen on Windows
Lecture 307 Quick Tour of PuTTY on Windows
Lecture 308 Configure Passwordless Login using PuTTYGen Keys on Windows
Lecture 309 Run Spark Application on Gateway Node using PuTTY
Lecture 310 Configure Tunnel to Gateway Node using PuTTY on Windows for SSH Proxy
Lecture 311 Setup Proxy on Windows and validate using Microsoft Edge browser
Lecture 312 Understanding Proxying Network Traffic overcoming Windows Caveats
Lecture 313 Update Hosts file for worker nodes using private ips
Lecture 314 Access Spark Application logs using SSH Proxy
Lecture 315 Overview of performing tasks related to Spark Applications using Mac
Section 24: Deployment Modes of Spark Applications
Lecture 316 Deployment Modes of Spark Applications - Introduction
Lecture 317 Default Execution Master Type for Spark Applications
Lecture 318 Launch Pyspark using local mode
Lecture 319 Running Spark Applications using Local Mode
Lecture 320 Overview of Spark CLI Commands such as Pyspark
Lecture 321 Accessing Local Files using Spark CLI or Spark Applications
Lecture 322 Overview of submitting spark application using client deployment mode
Lecture 323 Overview of submitting spark application using cluster deployment mode
Lecture 324 Review the default logging while submitting Spark Applications
Lecture 325 Changing Spark Application Log Level using custom log4j properties
Lecture 326 Submit Spark Application using client mode with log level info
Lecture 327 Submit Spark Application using cluster mode with log level info
Lecture 328 Submit Spark Applications using SPARK_CONF_DIR with custom properties files
Lecture 329 Submit Spark Applications using Properties File
Section 25: Passing Application Properties Files and External Dependencies
Lecture 330 Passing Application Properties Files and External Dependencies - Introduction
Lecture 331 Steps to pass application properties using JSON
Lecture 332 Setup Working Directory to pass application properties using JSON
Lecture 333 Build the JSON with Application Properties
Lecture 334 Explore APIs to process JSON Data using Pyspark
Lecture 335 Refactor the Spark Application Code to use properties from JSON
Lecture 336 Pass Application Properties to Spark Application using local files in client mod
Lecture 337 Pass Application Properties to Spark Application using local files in cluster mo
Lecture 338 Pass Application Properties to Spark Application using HDFS files
Lecture 339 Steps to pass external Python Libraries using pyfiles
Lecture 340 Create required YAML File to externalize application properties
Lecture 341 Install PyYAML into specific folder and build zip
Lecture 342 Explore APIs to process YAML Data using Pyspark
Lecture 343 Refactor the Spark Application Code to use properties from YAML
Lecture 344 Pass External Dependencies to Spark Application using local files in client mode
Lecture 345 Pass External Dependencies to Spark Apps using local files in cluster mode
Lecture 346 Pass External Dependencies to Spark Application using HDFS files
Any IT aspirant/professional willing to learn Data Engineering using Apache Spark,Python Developers who want to learn Spark to add the key skill to be a Data Engineer,Scala based Data Engineers who would like to learn Spark using Python as Programming Language