Summary

We provide a comprehensive overview  of big data processing using Apache Spark and Hadoop, as well as some important tips. We highlight the significance of big data, its continuous and fast-paced generation, and its applications in various domains such as social media advertising, IoT systems, and personal assistants. The advantages of parallel processing in reducing infrastructure costs are discussed, along with the essential tools required for acquiring, managing, and deriving insights from big data. The introduction to Hadoop covers the ingestion stage, while Apache Spark is introduced as a distributed data processing and iterative analysis framework. The components and functionalities of Spark, such as data storage, compute interface, and cluster management, are explained. Furthermore, the content explores data frames, SparkSQL, and various operations and formats supported by Spark, including Hive, Parquet, and JSON files.

About Big Data

  • Data generated in huge volumes and can be structured, semi-structured, or unstructured , describe big data
  • Big Data arrives continuously at enormous speed from multiple sources.
  • Social media advertising is driven by Big Data and complex, specialized algorithms that help organizations target their advertising spend to clients most likely to buy.
  • IOT Systems make use of big data and their associated algorithms.
  • Personal assistants  such as Google, Alexa, and Siri,  make use of Big Data and their associated algorithms.
  • Reducing infrastructure costs is a quantifiable advantage of parallel processing.
  • Data technologies, analytics and visualization, business intelligence, cloud service providers, NoSQL databases, and programming languages and their tools are all the tool types required for acquiring, housing, managing and benefitting from Big Data.

Introduction to Hadoop

  • Ingest data is the first stage of Big Data processing in the Hadoop Ecosystem. 
  • The maximum data size Hive can handle is petabytes.
  • Region has 2 components – HFile and Memstore

Introduction to Apache Spark

  • Apache Spark uses distributed data processing and iterative analysis.
  • Parallel and Distributed computing are similar,  with some differences as well.
  • Parallel computing utilizes shared memory.
  • Distributed computing utilizes each processors own memory.
  • Functional Programming follows the mathematical function format, like the algebra class with the f of x notation.
  • Functional Programming Emphasizes “What” instead of “How-to”.
  • The three Apache Spark components are: data storage, compute interface, and cluster management framework.
  • The data from a Hadoop file system flows into the compute interface or API, which then flows into different nodes to perform distributed/parallel tasks. 
  • DataFrames are conceptually equivalent to a data frame in R/Python.
  • There are three ways to create an RDD in Spark.
  • Create an external of local file involves Hadoop-supported file system like HDFS, Casandra, or HBase.
  • A spark application has two processes: driver and executor process.
  • The driver process can either be run on a cluster node or on another machine as a client to the cluster.
  • Jobs, Stages, Storage, Environment, Executor, and SQL, are all available within the Spark Application UI.
  • The collect() action triggers the job creation and schedules the tasks, as the previous operations are all lazily computed.
  • The Cluster Manager communicates with a cluster to acquire resources for an application
  • Spark Standalone is a type of Cluster Manager that  comes with Spark and is best for setting up a simple cluster. 
  • Running Spark on a cloud streamlines deployment with pre-existing default configuration is an advantage of using Spark on  IBM cloud.
  • IBM Cloud  also provides enterprise grade security.
  • Spark properties have precedence and are merged into a final configuration before running the application.
  • The correct precedence order is to perform programmatic configuration, set spark-submit configuration and lastly set configurations in the spark-defaults.conf file.

Introduction to Data-Frames & SparkSQL

  • A  Directed Acyclic  Graph (DAG) is a data structure with edges and vertices
  • In Apache Spark, RDDs are represented by the vertices of a DAG while the transformations and actions are represented by directed edges.
  • You can apply the toDS() function to create a dataset from a sequence.
  • If a DataFrame is not cached, then different random features would be generated with each action on the DataFrame, because the function ‘rand()’is called each time.
  • Tungsten places intermediate data in CPU registers.
  • Tungsten manages memory explicitly and does not rely on the JVM object model or garbage collection.
  • Read, Analyze, Transform, Load and Write is the order in which Spark performs Extract, Transform, and Load operations.
  • Spark SQL also supports reading and writing data stored in Hive.
  • Spark SQL supports reading and writing data from Parquet files, and SparkSQL preserves the data schema.
  • Spark SQL can load and write to JSON files by inferring the schema.