As we are in bigdata era, Organizations continuously produce data, We do batch processing ETL’s and place it in decision making system (ex: warehouse) as nice structured format.
If we want to obtain key insights and capitalize the opportunities as they occur, We can’t wait for batch processing to complete. Real-time processing enables us to process the event/transaction as it occurs.
One of the classic example of real-time processing is ‘Credit Card Fraudulent Detection’/’Credit Loan decision making’ .
AWS/GCP/Azure are offering real-time message broker systems to publish and subscribe events on real-time. We can consider Kinesis as AWS managed alternative to Kafka. It allows you to ingest high volumes of data(IoT,System Logs, Clickstream) per second and have the flexibility to scale automatically based on the demand.
Managing Kafka cluster itself is a complex task, The alternative is ‘Kinesis’ AWS manages it for us and we simply need to develop client applications to ingest and read data from Kinesis
Fully Managed, Highly available No SQL distributed database. Scales to Massive workloads. It Can handle Millions of Requests per second and trillions of rows and petabytes of storage. Fast and consistency very low latency retrieval. Widely used in big data world for
E-Commerce Shopping carts
Would like to demonstrate real-time processing framework using AWS big data tool kit apps Kinesis,Lamda and DynamoDB.
AWS EC2 server produces Transaction logs with Customer-order transactions and ingests into Kinesis Data stream, On the other side whenever data occurs in Kinesis data stream , Lambda function picks up and pre-process/transforms into DynamoDB Table.
- Create Kinesis Stream
- Create DynamoDB Table
- Ingest Data into Kinesis Data Stream
- Process data stream records using Lambda
- Insert processed records into DynamoDB table
Create Kinesis Data Stream
Let’s logon to EC2 server and create data steam by using AWS CLI
Let’s verify the Stream created or not in AWS console
The stream is ready. Now whenever an event occurs (let’s say customer made a transaction) EC2 server captures and writes to Kinesis data stream.
Let’s assume Customer-Orders transaction format
Create DynamoDB table
A DynamoDB table must have primary Key, A primary Key can be defined in two ways
Primary Key = Partition Key
Primary Key = Partition Key + Sort Key
Partition Key : Unique Value . Always define unique / high cardinal values column as Partition Key (Ex: UserId,DeviceId,CustomerID)
Sort Key : If we want to arrange values in an order inside a partition Key, then define Sort Key (Ex: Customer can make many orders-Transactions. So OrderID can de defined as Sort key. So all transactions are grouped and arranged by OrderID)
Define Lambda Function
Every time a new event created in Kinesis Data stream would trigger AWS Lambda function. We can write any custom code that suites our requirement . In this case, It’s basic ETL to load Kinesis data stream into DynamoDB table.
The below code snippet is written Python. You can choose your favourite from NodeJs,Python,Ruby… Anytime my favourite Python :)
Ingest Data into Kinesis
There are many ways to ingest data into Kinesis data stream . Kinesis Producer Library(KPL), Kinesis Agent, Third party applications like Apache Spark
I am ingesting using Kinesis agent (AWS CLI)
Verify DynamoDB table
Kinesis Data Stream Records are successfully pre-processed and ingested into DynamoDB by Lambda function
Hope it gives you some idea on Real time frame work on AWS!!