This tutorial describes a real time analytics frame work using spark streaming and window functions on AWS real time streaming application Kinesis.

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more

High Level Design :

Continues events are being sent to Kinesis , Spark Subscribes to Kinesis and reads events in real time performs aggregation on the fly and saves the results . This is quite a common scenario now a days for a data engineer

Steps

  1. Define Kinesis Data Stream
  2. Write Streaming data to Kinesis
  3. Create Spark Streaming Cluster
  4. Apply analytics on Spark streaming data using window functions

Define Kinesis Data Stream

Just with few clicks we can easily create data stream in AWS. …


Continuation to my tuning spark join series. In this article ,I would like to demonstrate every spark data engineer’s nightmare ‘shuffling’ and tuning tips. And then Spark’s underrated AQE’(Adaptive Query Execution) and how it helps in tuning Spark Joins . I always prefer to explain things with examples and code snippets which would definitely make some difference in learning things than just by reading theoretically. Let’s jump in …

Shuffling

While doing any multi-row operations like joins, grouping and aggregating all nodes in the spark cluster should exchange data so that each node should get a piece of data.

Shuffling is…


Parallelization is Spark’s bread and butter. The back bone of Spark architecture is Data should be split into pieces(Partitions) and allocate each piece to an executor in cluster, So multiple executors can work on different pieces of data in parallel .

A single row level operations like Mapping, Filtering makes Spark’s job easy , but when it comes to multi-row level operation like joining, grouping , data must be shuffled first before doing actual operation . Shuffling is very costly operation . It hits all your resources in the cluster

Shuffling : All the nodes and executors should exchange the data across the network and re-arrange partitions in such a way that each node/executor should receive a specific key data.

In this article, I would like to discuss most common…


Spark is an unified analytics engine for Bigdata Processing, with built-in modules for ETL,Streaming,SQL,Machine Learning and Graph Processing.

Why Spark ?

Rather than having zoo of products like one for for each module, Why can’t we just have a single core product to serve all modules. I like the below picture from Databricks which demonstrates how Spark can replace all wide variety of tools for various streams and businesses.

Spark provides in-memory parallel processing across multiple nodes in cluster. The parallel processing spark is controlled by few factors.

  1. Hardware
  2. Logical/Application

Hardware :

One simple statement, ‘many nodes and many cores gets greater parallelization’. …


As we are in bigdata era, Organizations continuously produce data, We do batch processing ETL’s and place it in decision making system (ex: warehouse) as nice structured format.
If we want to obtain key insights and capitalize the opportunities as they occur, We can’t wait for batch processing to complete. Real-time processing enables us to process the event/transaction as it occurs.
One of the classic example of real-time processing is ‘Credit Card Fraudulent Detection’/’Credit Loan decision making’ .

Kinesis

AWS/GCP/Azure are offering real-time message broker systems to publish and subscribe events on real-time. We can consider Kinesis as AWS managed alternative…


Writing ML algorithms is tedious job which requires you to know lot many things including strong programming in Python,R,Ruby etc.
But what if we can implement complex ML algorithms using everyone's favourite, simple and easy SQL statements. Some of the new age cloud data warehousing applications are offering SQL mode ML.
Certainly Bigquery is top on this list.

BigQuery Machine Learning (BQML) enables users to create and execute machine learning models in BigQuery using SQL queries. Benefits of BQML are

  1. Train and deploy ML models with simple SQL statements
  2. No need to move data from Bigquery
  3. Automate ML tasks

BigQuery…

Sivaprasad Mandapati

Data Engineer, Big Data and Machine Learning Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store