.. Links
.. _GstVideoRegionOfInterestMeta: https://gstreamer.freedesktop.org/documentation/video/gstvideometa.html?gi-language=c#GstVideoRegionOfInterestMeta
.. _GstGVATensorMeta: https://github.com/dlstreamer/dlstreamer/blob/master/include/dlstreamer/gst/metadata/gva_tensor_meta.h
.. _GstGVAJSONMeta: https://github.com/dlstreamer/dlstreamer/blob/master/include/dlstreamer/gst/metadata/gva_json_meta.h

Metadata
========

Inference plugins utilize standard GStreamer metadata `GstVideoRegionOfInterestMeta`_
for object detection and classification use cases (elements **gvadetect**, **gvaclassify**), and define two custom metadata types

- `GstGVATensorMeta`_ for output of **gvainference** element performing generic inference
  on any model with image-compatible input layer and any format of output layer(s)
- `GstGVAJSONMeta`_ for output of **gvametaconvert** element performing 
  conversion of **GstVideoRegionOfInterestMeta** into JSON format

The **gvadetect** element supports only object detection models and
checks that the model output layer has a known format convertible into a bounding
boxes list. The gvadetect element creates and attaches to output
GstBuffer as many instances of GstVideoRegionOfInterestMeta as objects
detected on the frame. The object bounding-box position and object label are
stored directly in GstVideoRegionOfInterestMeta fields
``x``, ``y``, ``w``, ``h``, ``roi_type``, while additional detection information such
as confidence (in range [0,1]), model name, and output layer name are stored
as GstStructure object and added into ``GList *params`` list of the same
GstVideoRegionOfInterestMeta.

The **gvaclassify** element is typically inserted into the pipeline after
gvadetect and executes inference on all objects detected by gvadetect
(i.e., as many times as GstVideoRegionOfInterestMeta attached to input
buffer) with input on crop area specified by GstVideoRegionOfInterestMeta.
The inference output is converted into as
many GstStructure objects as the number of output layers in the model and
added into the ``GList *params`` list of the GstVideoRegionOfInterestMeta.
Each GstStructure contains full inference results such as tensor data
and dimensions, model and layer names, and label in string format (if
post-processing rules are specified).

The **gvainference** element generates and attaches to the frame custom
metadata **GstGVATensorMeta** (as many instances as output layers in the
model) containing tensor raw data and additional information such as
tensor dimensions, data precision, etc.

Using the following pipeline as an example (more examples can be found in the `gst_launch <https://github.com/dlstreamer/dlstreamer/blob/master/samples/gstreamer/gst_launch>`__ folder)

.. code:: shell

   MODEL1=face-detection-adas-0001
   MODEL2=age-gender-recognition-retail-0013
   MODEL3=emotions-recognition-retail-0003

   gst-launch-1.0 --gst-plugin-path ${GST_PLUGIN_PATH} \
       filesrc location=${INPUT} ! decodebin ! video/x-raw ! videoconvert ! \
       gvadetect   model=$(MODEL_PATH $MODEL1) ! queue ! \
       gvaclassify model=$(MODEL_PATH $MODEL2) model-proc=$(PROC_PATH $MODEL2) ! queue ! \
       gvaclassify model=$(MODEL_PATH $MODEL3) model-proc=$(PROC_PATH $MODEL3) ! queue ! \
       gvawatermark ! videoconvert ! fpsdisplaysink sync=false

If the gvadetect element detected three faces, it will attach three metadata
objects each containing one GstStructure with detection results, then
gvaclassify will add two more GstStructure (model contains two output
layers, age, and gender) into each meta, and another gvaclassify will add
one more GstStructure (emotion), resulting in three metadata objects
each containing four GstStructure in ``GList *params`` field: detection,
age, gender, emotions.

"C" application can iterate objects and inference results using
GStreamer API similar to the code snippet below

.. code:: C

  #include <gst/video/video.h>
  
  void print_meta(GstBuffer *buffer) {
      gpointer state = NULL;
      GstMeta *meta = NULL;
      while ((meta = gst_buffer_iterate_meta(buffer, &state)) != NULL) {
          if (meta->info->api != GST_VIDEO_REGION_OF_INTEREST_META_API_TYPE)
              continue;
          GstVideoRegionOfInterestMeta *roi_meta = (GstVideoRegionOfInterestMeta*)meta;
          printf("Object bounding box %d,%d,%d,%d\n", roi_meta->x, roi_meta->y, roi_meta->w, roi_meta->h);
          for (GList *l = roi_meta->params; l; l = g_list_next(l)) {
              GstStructure *structure = (GstStructure *) l->data;
              printf("  Attribute %s\n", gst_structure_get_name(structure));
              if (gst_structure_has_field(structure, "label")) {
                  printf("    label=%s\n", gst_structure_get_string(structure, "label"));
              }
              if (gst_structure_has_field(structure, "confidence")) {
                  double confidence;
                  gst_structure_get_double(structure, "confidence", &confidence);
                  printf("    confidence=%.2f\n", confidence);
              }
          }
      }
  }

C++ application can access metadata much simpler utilizing C++ interface

.. code:: C++

  #include "gst/videoanalytics/video_frame.h"

  void PrintMeta(GstBuffer *buffer) {
      GVA::VideoFrame video_frame(buffer);
      for (GVA::RegionOfInterest &roi : video_frame.regions()) {
          auto rect = roi.rect();
          std::cout << "Object bounding box " << rect.x << "," << rect.y << "," << rect.w << "," << rect.h << "," << std::endl;
          for (GVA::Tensor &tensor : roi.tensors()) {
              std::cout << "  Attribute " << tensor.name() << std::endl;
              std::cout << "    label=" << tensor.label() << std::endl;
              std::cout << "    model=" << tensor.model_name() << std::endl;
          }
      }
  }

The following table summarizes the input and output of various elements

.. list-table::
  :header-rows: 1
  :widths: auto
  
  * - GStreamer element
    - Description
    - INPUT
    - OUTPUT
  * - gvainference
    - Generic inference
    - | GstBuffer
      | *or*
      | GstBuffer + GstVideoRegionOfInterestMeta
    - | INPUT + GvaTensorMeta
      | *or*
      | INPUT + extended GstVideoRegionOfInterestMeta
  * - gvadetect
    - Object detection
    - | GstBuffer
      | *or*
      | GstBuffer + GstVideoRegionOfInterestMeta
    - INPUT + GstVideoRegionOfInterestMeta
  * - gvaclassify
    - Object classification
    - | GstBuffer
      | *or*
      | GstBuffer + GstVideoRegionOfInterestMeta
    - | INPUT + GvaTensorMeta
      | *or*
      | INPUT + extended GstVideoRegionOfInterestMeta
  * - gvatrack
    - Object tracking
    - | GstBuffer
      | [ + GstVideoRegionOfInterestMeta]
    - INPUT + GstVideoRegionOfInterestMeta
  * - gvaaudiodetect
    - Audio event detection
    - GstBuffer
    - INPUT + GstGVAAudioEventMeta
  * - gvametaconvert
    - Metadata conversion
    - GstBuffer + GstVideoRegionOfInterestMeta, GvaTensorMeta
    - INPUT + GstGVAJSONMeta
  * - gvametapublish
    - Metadata publishing to Kafka or MQTT
    - GstBuffer + GstGVAJSONMeta
    - INPUT
  * - gvametaaggregate
    - Metadata aggregating
    - [GstBuffer + GstVideoRegionOfInterestMeta]
    - INPUT + extended GstVideoRegionOfInterestMeta
  * - gvawatermark
    - Overlay
    - GstBuffer + GstVideoRegionOfInterestMeta, GvaTensorMeta
    - GstBuffer with modified image