GNOSIS employs MEKGs to provide a knowledge representation for the multimodal event stream’s semantic, spatial, and temporal content. A more content based definition is to follow:
For any multimodal stream event, the resulting Multimodal Event Knowledge Graph is a labelled graph represented as MEKG = (E, R, Ep, G) where:
- E is a set of nodes Oi , each representing an entity in the domain
- R is a set of edge labels representing spatial, semantic, and temporal relation types
- Ep is a set of properties mapped to each entity node such that Oi = (id, attributes, label, confidence, features)
- G ⊆ Ep × R × Ep
Multimodal Event Knowledge Graph schema
MEKG provides a top-level schema for shared representation across modalities. As MEKG is a knowledge graph it can support schema which is designed to capture the specifics features appropriate for the format of that modality.
The Video Event Knowledge Graph (VEKG) and the Audio Event Knowledge Graph (AEKG) are two such specialisation for video and audio data respectively. The figure below shows how an MEKG can be used to represent the content of a video and audio stream at different time instances. The content of the video stream is represented using VEKG while the content of the audio stream is represented using the AEKG. These single modal graphs are then merged to create the MEKG to represent both modalities together. The entity nodes in MEKG graphs are connected using spatial, semantic, and temporal edges. The semantic relation edge between entity nodes are created by identifying the same entity nodes across modalities (i.e., entity linking), while the temporal relation edge between entity nodes are created by identifying the same entity nodes at different times using appropriate techniques (i.e. object tracking). The figure shows the MEKG construction example over a video and audio stream at four time points with the construction process showing the application of intra-modal (t2), inter-modal (t3), and temporal relations over time (t1-t4). The resulting MEKG can be used to get insights into the different facets (modalities) of an event.