Event processing
As our event-based technology produces a type of data that is different from frame images produced by frame-based cameras, a new processing paradigm had to be developed. Events are transmitted from the sensor as a stream of encoded data, that needs to be decoded. In practice, this stream will be decoded sequentially and is provided to users in small buffers, allowing early and efficient processing of events. All these functions are already implemented in the Metavision SDK, but anyone is free to build their own plugin to work with their own event-based camera or sensor.
Basic C++ and Python examples based on Metavision SDK can be found in the event-based-get-started Github repository.
Language
The choice of the programming language is similar to classical processing: Python will be more relevant for fast prototyping, as well as ML inference from events, while C/C++ will be preferred when efficiency and low latency are required, for instance to port on embedded platforms. C++ also allows for more controlled and varied access to particular data types and as well as memory management.
Classical processing pipeline in C++/Python
From our experience, users will first try to process events into frames to then process them in a typical frame-based manner. This is generally not the most efficient method to process event data, as it doesn’t take advantage of the temporal precision for high speed movements and sparse data for low power processing. This part will focus on how to process events directly using Prophesee’s Metavision SDK.
Keep in mind that event-based technology offers a new paradigm to vision engineers, which means they need to think differently! In particular, the additional temporal dimension allows a 3D visualization of the observed space which itself favors a better understanding of what can be expected from the event data and be extracted from it. For instance, plane fitting in the XYT space can be a good way to extract speed information from the event stream (as is done in the Metavision SDK PlaneFittingFlowEstimator).
Metavision API v4.5.2 will be used to provide examples.
C++ SDK
Let’s start with the C++ API. The driver
module proposes a Metavision::Camera
class. The latter allows to read
events from either a record (.raw, .hdf5, .dat) or a Prophesee event-based camera.
Metavision::Camera cam; // create the camera
if (argc >= 2) {
// if we passed a file path, open it
cam = Metavision::Camera::from_file(argv[1]);
} else {
// open the first available camera
cam = Metavision::Camera::from_first_available();
}
Event retrieval
This class can be provided with a CD callback (for Contrast Detection). This callback is a user-defined function which will be provided pointers to the first event of the decoded batch of events, as well as a pointer to the past-the-end one. Together, they allow to iterate through the batch of events. This function will be called every time a new batch of events has been decoded.
Note
|
Depending on the source of the data (live camera, HDF5 record, RAW record, etc), batches can contain between a few hundreds to around a thousand events. |
// to analyze the events, we add a callback that will be called periodically to give access to the latest events
cam.cd().add_callback([](const Metavision::EventCD *begin, const Metavision::EventCD *end){
// DO SOME PROCESSING
});
Note
|
Similarly, classical frame-based processing involves iterating over the rows and columns of images. Event-based processing mainly relies on iterating over events with the help of iterators over storage structures such as std::vector. |
Then, the stream can be started, the processing loop can be entered, and the camera stopped when the stream is over.
// start the camera
cam.start();
// keep running while the camera is on or the recording is not finished
while (cam.is_running()) {}
// the recording/stream is finished, stop the camera.
// Note: we will never get here with a live camera
cam.stop();
The actual event processing can be done in several steps, hosted in several processing class instances with distinct
purposes. Typically (but not always), each processing class defines a public templated process_events
method, with a
template input type for the events iterator. This way, users can provide various iterators such as Metavision::EventCD*
,
const Metavision::EventCD*
, std::vector<Metavision::EventCD>::iterator
, std::vector<Metavision::EventCD>::const_iterator
,
etc and are not constrained to a single input type, even though the event format is supposed known in most cases (EventCD).
Depending on the algorithm’s objective, it can update the internal state of the class instance, call a callback function
because a condition has been reached, or return a value at each batch. Processing chains can then be created by associating
several algorithms. An example is detailed below.
std::vector<metavision::EventCD> output_A, output_B;
cam.cd().add_callback([](const Metavision::EventCD *begin, const Metavision::EventCD *end){
algo_A.process_events(begin, end, std::back_inserter(output_A));
algo_B.process_events(output_A.cbegin(), output_A.cend(), std::back_inserter(output_B));
const bool result = algo_C.process_events(output_B.cbegin(), output_B.cend());
});
Event filtering
A typical event-based pipeline will apply some filtering algorithm in the first place to the event stream. In particular, the STC filter provided in the Metavision SDK is often used to filter out noise, when it is not done by the ESP block.
Keep in mind that hardware preprocessing ESP functions are much more efficient than software post-processing ones, while software ones might be more precise as they are working with complete timestamps and not partial ones). Consequently, software filtering might be very useful for development and testing purposes, while ESP functions should be preferred for the final product.
Tip
|
If you think your application would benefit from STC filtering, do a record without filtering, then try your processing with and without a software STC and finally validate by running your algorithms with the hardware STC enabled instead of the software one. Enabling the hardware STC can be done as described below.
The camera configuration file should contain the filter parameters. The "filtering_type" and "threshold" should of course be adapted the use case.
|
#include <metavision/sdk/cv/algorithms/spatio_temporal_contrast_algorithm.h>
std::vector<metavision::EventCD> filtered_events;
Metavision::SpatioTemporalContrastAlgorithm stc_algo(width, height, threshold);
cam.cd().add_callback([](const Metavision::EventCD *begin, const Metavision::EventCD *end){
stc_algo.process_events(begin, end, std::back_inserter(filtered_events));
// DO SOME PROCESSING with filtered_events.cbegin() and filtered_events.cend()
});
Event slicing
In practice, the event stream comes out of the sensor as encoded data. It is decoded by the camera plugin as a sequence of small decoded event buffers of variable size.
In a lot of cases, users are interested in working with fixed-duration time slices, for instance 0 to 999 ms, and then with events from 1000 to 1999 ms etc, or fixed-number of events slices (N events, then N new events etc). This allows to acquire enough events to compute some information from, and then potentially track the produced information through slices. In practice, the slicing mode needs to be chosen wisely:
-
Fixed-duration slices are useful for synchronization, as time windows are explicit and regular but event density may vary depending on the observed movement, particularly the speed.
-
Fixed-number of events slices provide constant event density, it makes it easier to allocate computational resources efficiently. It also naturally adapts to the the dynamics of the scene, and takes advantage of the low latency of the sensor, at the cost of losing the temporal stability of the previous case. Decoding and processing time also turns constant with respect to the provided data.
To slice events, the Metavision SDK offers a very useful utility, the EventBufferReslicerAlgorithm
class. It is simply
fed with events and calls a callback method when the desired condition has been fulfilled (N events have been received,
or Δt microseconds have passed).
This process can be used for instance to select events to use to generate a timesurface, or a histogram as described below.
Note
|
Some algorithms already internally slice the events to produce some output. |
Python API
In Python, the same principles apply. Events will be retrieved from some iterator.
from metavision_core.event_io import EventsIterator
events_iterator = EventsIterator(input_path=args.event_file_path, delta_t=1000)
height, width = mv_iterator.get_size() # Camera Geometry
Iterating through this iterator allows to retrieve batches of events in the form of Numpy arrays of events. If events need to be added to a temporary storage structure, Metavision SDK algorithms methods can be used to retrieve it, as is done for the Sparse Optical Flow algorithm below. It basically provides a binding to a std::vector containing the Event type associated with the algorithm. It can be copied to a numpy array for further processing, but be conscious that as always, copy comes with a computational cost.
from metavision_sdk_cv import SparseOpticalFlowAlgorithm, SparseOpticalFlowConfigPreset, SparseFlowFrameGeneratorAlgorithm
# Optical flow algorithm
flow_algo = SparseOpticalFlowAlgorithm(
width, height, SparseOpticalFlowConfigPreset.FastObjects)
flow_buffer = SparseOpticalFlowAlgorithm.get_empty_output_buffer()
for evs in events_iterator:
processing_ts += events_iterator.delta_t
# Dispatch system events to the window
EventLoop.poll_and_dispatch()
# For instance, retrieve event information
if evs.size == 0:
print("The current event buffer is empty.")
else:
min_t = evs['t'][0] # Get the timestamp of the first event of this callback
max_t = evs['t'][-1] # Get the timestamp of the last event of this callback
# Provide events to the flow algorithm
flow_algo.process_events(evs, flow_buffer)
Note
|
The poll_and_dispatch
function is used to poll system events and push them to an internal queue, preventing CPU overload by the program.
|
Intermediate representation
Event-based cameras produce an almost continuous flow of events, processed in batches. Those events are mostly described by:
-
Position X and Y in the sensor
-
Polarity of the event (increase or decrease in intensity)
-
Timestamp of the event generation
In lots of cases, events cannot directly be processed to compute some other interesting data, a temporary data structure is necessary to store them and/or preprocess them. In the following sections, various event data representations that are widely used are presented.
Timesurface
This data structure is probably the most natural event representation, as it builds a history of the recent events, keeping in memory the most ones. It consists in a matrix of the sensor dimensions with one or two channels, depending whether polarity of the events is stored. Each pixel of the matrix will store the timestamp of the last event received at this location. It is commonly used as it preserves most of the recent information provided by the event, without too much redundancy.
Tip
|
OpenCV cv::COLORMAP_JET colormap can be useful to visualize the timesurface, as it associates vivid red color to
recent events, and colder blue for old ones, which is very human readable.
|
Histogram
Another commonly used event representation is histograms. A histogram is a 2-channel matrix of the sensor dimension counting negative and positive events triggered at each pixel location in respective channels. This can be sufficient to extract useful information from the event data. In particular, a histogram tracks the pixelwise amount of activity over a given duration.
Figure 3. Negative channel
|
Figure 4. Positive channel
|
Tip
|
In some cases, positive events provide enough and/or better information to compute data. |
Differential Frame
A differential frame will count events similarly to histograms, except in a single channel, with positive and negative values depending on the event count for each polarity. Basically, it stores the pixelwise total contrast change over a certain duration.
Event cube
The previous representations were mainly 2D. The event cube adds a third dimension which is the time. It consists in building a 4D tensor of dimension (polarity, nbins, height, width) where nbins is the number of temporal bins available. Typically, the recent ΔT duration is divided into temporal bins of Δt milliseconds, and filled adequately with the events. An event with timestamp t will fall within a specific tensor bin which is incremented. In fact, it can be viewed as a stack of histograms.
Binary Frame
Another much more simple way to work with event is to store only the polarity of the last event triggered at a pixel location. Polarities are continuously added, overwriting previous values in the frame. The temporal information is lost, but it is sufficient for some applications. For instance, this principle is used in the Aruco Marker detection and tracking, for the detection part.
Note
|
The image is not really "binary" but rather "ternary", as it stores the binary information of the polarity, plus the initial value when no events were triggered yet. |
Event Creation
In general, contrast variation events are produced by the sensor (EventCD for Prophesee’s Metavision SDK). In that case, an event describes a light intensity variation.
It is important to note that sensor event buffers come with a high time granularity, which is interesting to keep in the event processing pipelines. For this, software event structures can be defined. For instance, in the Metavision SDK are defined Bounding box events (containing width and height of the bounding box on top of the (x,y) location and timestamp), Optical flow events containing vertical and horizontal speed of an object, SourceId events identifying blinking objects (LEDs for instance), etc. It is up to the user to determine if an event structure might be relevant for a specific use-case, and which data it should contain, but in a lot of cases it can be relevant to preserve the temporal aspect of the data, by opposition to traditional frame-based processing where one input provides one output.
Good practices
There are some good practices when it comes to reaching high performance for event-based algorithms:
-
User-focused features, such as display or logging, are often very useful during development but are not always necessary. In Metavision SDK, callback functions are called regularly, which implies an important number of potential passes through the logging or display operation. These drastically impact the runtime performance of a program. Thus, it is important not to assess a program performance based on a state where it displays or logs information.
-
Often, especially before ESP filters are applied, many events can be produced. It is important NOT to copy them when unnecessary, as memory allocation introduces major delays in the CPU processes.
-
Don’t hesitate to use profiling tools to analyze the processing of your program with usual tools (GNU Profiler, Intel VTune Profiler, perf, etc) ad detect bottlenecks. The SDK offers additional profiling tools such as the TimingProfiler, which tracks runtime and calls between all the instance creations and deletion.
-
Power management is also relevant to ensure optimal performance, in particular during the development phase:
-
Laptops often have power modes (Energy saving, Balanced, Performance) which have a direct impact on the resources made available to your program. Ensure you are using the Performance mode for better resource usage.
-
Resources are also limited when the laptop is running on battery. Make sure to plug your device to the power line when looking for improved performance.
-
AI with event-based technology
Current vision AI approaches use images or video feeds to infer object position, movement, depth, etc. As discussed
previously, event-based sensors don’t produce height x width
images. However, events can be preprocessed into various
data structures with different formats. These format correspond to various shapes of tensors (including images), which
can be easily fed to a classical vision model. The Metavision SDK
(version 4.5.2) offers several examples of Machine learning
applications with simple (not optimized) models doing
object detection, optical flow, classification etc. GPUs can be used similarly to classical ML pipelines.
The previous approach is the frame-like way to apply existing AI models to event-based technology. However, new
architectures tailored to neuromorphic processing now allow directly feeding models with events, avoiding preprocessing
latency. These are called "Spiking neural networks". This requires a different approach to processing event data and is
not yet explored in Metavision SDK.
Processing platform
The choice of the host platform for the processing is major, as it will directly impact the type of processing you can expect, or on the other hand, the algorithms you want to run might trigger specific platform choices.
Location of the processing
Even though in controlled conditions, data transfer might be less critical than classical frame-based, it is still an attention point for event-based products. Indeed, if some data packets are dropped, then a few events will be missing, but not a whole frame, and received events can still be processed normally, without skipping a whole time period.
In that regard, having the processing unit as close as possible to the sensor can be interesting. In particular, very long USB cables might be dangerous for data integrity. If really needed, prefer a external power-supplied USB3.0 extension cable.
Also, as mentioned above, processing events into frames, and then sending a video stream to further process is not ideal in a lot of cases. Sending a stream of encoded or decoded events would also be challenging. Instead, local processing can often reduce latencies and bandwidth usage. Let’s take the example of surveillance cameras in which an event-based camera detecting a person triggers the record of a frame-based video stream. You don’t want to sent a continuous video stream to a server which will do some inference for people detection and then send back a record signal to the camera system. Instead, person detection from an inference platform next to the cameras might be more relevant. In practice, it highly depends on the system constraints.
CPU performance
Similarly to classical algorithmic processing, the performance will depend among other things on the number of available CPUs, clock frequency, quality of the CPUs etc. But in our particular case, they will also heavily depend on the camera settings (in particular the biases), event filtering (at sensor level or algorithmic level), as well as the quality of the algorithms themselves. Thus, some algorithms might run on a laptop but might not run on an embedded platform, or at least not "as is". If not, the platform choice, processing location, or bias tuning might be reevaluated to reach the desired performances.