Multimodality, Fusion of Sensory Data, and Computing at the Edge

Image Source:

How many computers have you got at this instance where you read this article? It should be at least one. In most cases, more than one. A proper rephrase of the question would be how many computers do you have at your home? In 2018, it should definitely be more than two. But now think of twenty years back, where things were much simpler. You had only one computer and all your family had to use that computer. Now if you’re coming from a statistical background, you can relate the latter to a many-to-one relationship between humans and computers. However, nowadays, it is a one-to-many relationship that exists between the aforementioned two categories. This was predicted by Mark Weiser in his famous article, named The Computer for the 21st Century, on Scientific American. If you’re in a computer science major and are interested in learning the propagation of technology, or if you’re a curious person of the rapid evolution of the world [1] is a recommended read. The most interesting fact about this article is that it is almost 28 years old!

The aforementioned two relationships paved the way for an entirely novel discipline called Ubiquitous Computing. The term Ubiquitous, as Merriam Webster states, means widespread. If you’ve ever been for a Google I/O conference or at least watched Google’s playlist on I/O, you might have encountered this term. As a Software Engineer, you need to be extremely cautious and aware of this new concept. All software that you engineer are being consumed by the general public (except for some cases where you expose APIs — in this case, you provide services for other developers) and as Alan Dix used to say in [2], you have to think the general user as a dumb one. Therefore, you need to think about all the platforms that your software needs to be running on, for examples laptops, tablets, mobile phones, smart watches! etc. If your software cannot run on mobile platforms, you are most likely to lose revenue. Have you ever wondered why Microsoft developed Office 365 for mobile phones? And why do you think Microsoft is shifting their Office packages onto the web? (WebAssembly does a quite a show there)

This ubiquity of computational devices has increased the complexity of integrated circuits as well. With all the sensors that are incorporated inside these devices, think of the mammoth task of processing these data which are acquired from these devices! Harari used to say in [3] , that the upcoming religion, despite all the legacy religions in the world, is Dataism i.e. the yearning for data. The COCO dataset [4], includes 330k images, involving 80 object categories, which makes data available for the public to consume. Government data [5] are demanded to be exposed to the public. 90% of the data that exist now, were created within the last two years [6]! This exponential increase of data due to the increase of devices has lead the engineers to figure out ways to process data in an efficient manner.

In the early 90s, when data amounts were much simpler, there existed this concept of dumb clients and intelligent servers. But when the number of devices grew, a centralized server could not handle all the data which came from edge devices. Thus, a novel concept called Edge Computing was introduced, and the idea behind this was to push processing towards the edge of the network, thus reducing the workload for the centralized server and to reduce the bandwidth limitations of the network. However hype-minded this might be, [7] states that the processing power of edge devices are still constrained by fundamental facts such as battery power, heat dissipation, and memory. Therefore, a new concept called Cloudlets has been introduced in [7] which essentially is a mini-data center which the edge devices can offload their processing in a single network hop. However, this introduces additional complexities such as VM (Virtual Machine) Synthesis, VM Migration, etc. But this is considered to be a good architecture until the edge devices catch up with processing revolution.

So what do we mean by processing? Apart from pre-processing data for consumption, a good example of processing of data would be Fusion. Fusion is one of the most prudently researched areas nowadays, because of the advancements in edge devices’ sensory peripherals. Take an example of Audio-Video Speech Recognition, as per [8] states. The accents of some people make it a little difficult to recognize what they are uttering. Thus, what we do is, knowingly or unknowingly, we watch their lip movements and understand what they are saying. This is a textbook case for real-world, natural, and multimodal sensor fusion. What the brain does, of course, unconscious to our conscious mind, it fuses the auditory signal (one modality; a modality is a data acquisition framework for sensors) with the visual signal (The other modality). To quote [8],

Decision Fusion has been shown to be effective in which the final decision is made by fusing the statistically independent decisions from different modalities with the emphasis on uncorrelated characteristics between different modalities.

The term Decision Fusion comes from the famous Dasararthy Model introduced in [9], which groups fusion of sensory data into three groups based on the input.

  1. Low-level fusion — This is sometimes called Direct Fusion, and it does fusing of raw data.
  2. Intermediate-level fusion — This is sometimes called Feature Fusion, which fuses features captured in each sensor.
  3. High-level fusion — This is sometimes called Decision Fusion, which fuses decisions made by each sensor. Hence placing an intelligent decision-making algorithm near each sensor.

However, this was later expanded into a five-layered model because of the ambiguity of input and output natures of sensory data. Sensor Fusion is demonstrated nicely in McGurk experiment [10]. This explains how two different sounds “vha” and “bha” are distinguished by the brain with means of fusion.

However interesting Fusion seems, still, fusion is a research area when it comes to academia. Famous example people discuss when it comes to sensor fusion is the Boeing F/A 18 superjet [11]. But most of the techniques of these are presumably confidential and a proper framework for fusion of multiple modalities does rarely exist. There are plenty of models that exist in order to do a fusion of sensory data but implementation details are rarely found. Most of these models, such as JDL (Joint Directors of Laboratories) model, are functional models which explain the functionality of each model. Some models exist such as Omnibus model, which provides a procedural idea on how fusion should be done. But both lack the implementation details as mentioned above.

Implementing a fusion framework for three modalities, namely visual, auditory and gestures has become a final year project in my university.


[1] M. Weiser, “The Computer for the 21st Century,” Scientific American, vol. 265, no. 3, pp. 94–105, Sep-1991.

[2] A. Dix, J. Finlay, G. D. Abowd, and R. Beale, Human-computer interaction. Third edition, 3rd ed. Pearson Education, 2004.

[3] Y. N. Harari, Homo Deus. Harvill Secker, 2016.

[4] “COCO — Common Objects in Context,” COCO — Common Objects in Context. [Online]. Available: [Accessed: 23-Dec-2018].

[5] “The home of the U.S. Government’s open data,” [Online]. Available: [Accessed: 23-Dec-2018].

[6] B. Marr, “How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read,” Forbes, 09-Jul-2018. [Online]. Available: [Accessed: 23-Dec-2018].

[7] Y. Ai and K. Zhang, “Edge computing technologies for Internet of Things,” Digital Communications and Networks, vol. 4, no. 2, pp. 77–86, Apr. 2018.

[8] A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition,” IEEE Access, vol. 5, pp. 22081–22091, Sep. 2017.

[9] B. V. Dasarathy. More the merrier … or is it? — sensor suite augmentation benefits assessment. In Proceedings of the 3rd International Conference on Information Fusion, volume 2, pages 20–25, Paris, France, July 2000.

[10] BBC, “Try The McGurk Effect! — Horizon: Is Seeing Believing? — BBC Two,” YouTube, 10-Nov-2010. [Online]. Available: [Accessed: 23-Dec-2018].

[11] U. S. M. Action, “Boeing F/A-18E/F Super Hornet demonstrates sensor fusion,” YouTube, 27-May-2018. [Online]. Available: [Accessed: 23-Dec-2018].



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store