As the time passes with my internship, I started to learn some new concepts like Distributed Systems and Event Processing. These two concepts are heavily discussed and researched areas in modern computer science and thus, I think it would yield some good referring to and researching about what and hows about these areas.
Back in 2000, when we were little and our fathers used to connect to the internet with Dial-Up connections which I vaguely remember the modem making a huge noise when trying to connect to the internet via the telephone line, the amount of data that transmitted across the internet was nothing compared to the vast volumes of data that are being transmitted across worldwide today. Those days, the data requirements were simpler because of the lack of consumer devices that could easily get access to them. For an example, most of us did not have even a dial-up connection thus making us inaccessible to those data. However, with the passage of time, plenty of consumer devices originated starting from mobile phones and the inception of WiFi and Routers began Era of Big Data. No more the servers handle small data batches like they used to back in 2000s, bare-metal servers are replaced with Containers with the presence Kubernetes for orchestration, and centralized services are modularized running services in different containers (Micro-Service Architecture). In fact, the distribution transparency has become so apparent that (Say for example, EC2 Services) nobody knows where your code is and with technologies like serverless, even the developers are unaware that whether the logic is being run or not (Serverless technologies run your logic in an on-demand basis).
As mentioned above, we witnessed a expeditious change of technologies in terms of Software and Hardware. The smartphone market is ever increasing proving the fact that processing data in terms of batches is no longer useful. We’ll get back to this later.
These rises gave the motive to other consumer products which involves data such as Google Assistant, and other similar products. Smart Fridges, and other smart devices also came into play which does not help the overheads of batch processing systems i.e. Processing data in terms of batches. Batch processing of data involves storing data for quite a while and process these data as a batch. But is that the case now?
Think of some IoT device — lets say a temperature sensor. There is a concept called BAN — Biological Area Networks which senses the biological metrics of a human being, through sensors. These sensor outputs are streamed to a central source and this source is supposed to perform some sort of a realtime data analytics on sensor data that it receives. However, it is worthy to mention that some part of the aggregation of data can be done within the sensors themselves given that sensors have some kind of an OS because of the advancements of technologies. So think of a company which provides this BAN as a service. If the scale of the business is large, there would be millions of data coming towards the controller to be processed. Therefore, an effective solution should be there in order to process these data with minimum latency. These processing systems are called IFPS — Information Flow Processing Systems.
More on IFPS..
I’ll leave here some of the facts which I gathered from a research paper (Well, I can’t remember which paper it was so I cannot hyperlink it here).
There are two types of IFPS.
- Data Streaming Systems
- Complex Event Processing Systems
Data Streaming Systems
Think of a traditional DBMS. What we do there is we basically store and retrieve information using some sort of a query language. There are client APIs for different programming languages we can use in order to interact with the state of the database but at the core, the database is just a store and an engine to manage the content which comprises of a query language to interact with the content as well.
Is conventional DBMS applicable for the realtime data processing context? One might wonder. Traditional DBMS paradigms originated as an alternative for early file storage systems where each and every information (Record/Document) was stored inside a file itself. File storage has plenty of drawbacks including updating the records, concurrency, and security, and hence, the began the era of databases. But there is a significant fact in DBMS that we must attend to — that is, querying. Each of these queries has to be manually executed by some means of user intervention (Or some sort of application context intervention). This directly opposes our realtime data processing paradigm where processing with queries need to be executed as the data arrives with minimum latency. Thus, the concept of DBMS needed to be modified, and it did.
The modified principle is now called DSMS — Data Stream Management System. Here, the query that you specify will continuously be executed until you uninstall the query itself. A good example would be TelegraphCQ.
However good DSMSs were, the core principle deferred with the current paradigms of data streams. That is, more and more of data were begun to be considered as Events instead of data. Events are actual data that are resulted from some outcome. For examples, sensor readings like fire detection, log print in application servers, clickable event streams or key press streams and etc. Click streams and key press streams are a primary way of doing extensive user involvement analysis in sites like Netflix and Google. All of these data are visualized as events.
Thus, the principle of Event Processing came into play.
Complex Event Processing
The job of a DSMS is to produce answer queries (Or outputs). But, it cannot recognize any patterns or orders of the input data stream. Pattern recognition is vital for stream processing applications like fraud detection where a pattern mismatch is almost a critical fraud.
CEPs usually have a schema on the data which are being processed. For an example, in Siddhi you define a schema for the input stream in SiddhiSQL. In Kafka, you do this in KSQL. Shown below is a complete SiddhiSQL program for processing an input stream. (Taken from https://wso2.github.io/siddhi/documentation/siddhi-4.0/ but modified a little)
@app:name('Temperature-Analytics')# Input Stream Schema
define stream TempStream (deviceID long, roomNo int, temp double);# Output Stream Schema
define stream OutputStream(roomNo int, avgTemp long)# Processing Query
from TempStream#window.time(5 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert into OutputStream;
@sink annotation. This specifies the output sink that the output of the processed data must go to. There are plenty of output sinks for Siddhi like Elasticsearch, Kafka and HTTP. The input stream can be given through multiple protocols such as HTTP as well.
@source(type='http', receiver.url='http://0.0.0.0:8080/foo', basic.auth.enabled='true',
define stream InputStream (name string, age int, country string);
A quite useful input source would be Kafka. In this case, in Kafka’s point of view, there could be plenty of Kafka producers such as application servers (In context of a real-time log processing paradigm) and a Siddhi Kafka consumer. Elasticsearch could be placed as a sink to the Siddhi platform. Kibana could be used as a visualization platform to the Elasticsearch sink. The intention of utilizing Kafka would be fault tolerance i.e. no logs would be lost in case the Siddhi processing gets some latency.
There are plenty of other tools as well — a good one would be KSQL for Kafka Streaming API. Here you can use Kafka itself to process the stream using KSQL query language.
So what do we process here? Well, it defers according to the application context. If you’re dealing with fraud detection, it might be analyzing of patterns. If it is log analysis, it might be some GROK filters. If it is healthcare analysis like BAN, it might be the health metrics, etc.
More on Stream Processing and Event Processing can be read from Srinath Perera’s blogpost. It is a nice one and I recommend it to anyone who are interested.
Well as it turns out, deployment is said to be a huge bottleneck for stream processing applications. In WSO2 Con, one speaker says that a proper realtime processing application requires minimum of 5-node cluster. Another drawback could be regarded as SQL itself. Siddhi uses SQL because of its own architectural style. If I explain it a little more (Hope I don’t go beyond the scope here), Siddhi has input processors where the input is flattened out to a tuple where a separate ID is given for each input and data are categorized inside the tuple itself. This tuple is what makes it possible to do relational algebraic calculations like SQL. This was specifically stated in Siddhi research paper written at University of Moratuwa. I think this is the case in Kafka as well but don’t quote me on that. So the bottleneck of SQL would be writing complex processing logic — as most developers tend to get away from SQL logic using alternatives like Hibernate, Eloquent, Sequelize and Bookshelf.js.
This is just an overview of what I’ve accumulated on what I have studied and listened to in YouTube conferences. I have to say that WSO2 does a good job on their documentations (Including the core architectures and design decisions) and WSO2 Con where I’ve learned a lot from.
I’m currently working on a Log Processing project (Well it is a solo-research project) where Siddhi and Kafka are used along with Elasticsearch for indexing purposes. I’m hoping to write more on it as the project goes on.