Wednesday, November 9, 2022

MLPerf Tiny 1.0 confirms: Plumerai’s inference engine is again the world’s fastest

Earlier this year in April we presented our MLPerf Tiny 0.7 benchmark scores, showing that our inference engine runs your AI models faster than any other tool. Today, MLPerf released the Tiny 1.0 scores and Plumerai has done it again: we still have the world’s fastest inference engine for Arm Cortex-M architectures. Faster inferencing means you can run larger and more accurate AI models, go into sleep mode earlier to save power, and run them on smaller and lower cost MCUs. Our inference engine executes the AI model as-is and does no additional quantization, no binarization, no pruning, and no model compression. There is no accuracy loss. It simply runs faster and in a smaller memory footprint.

Here’s the overview of the MLPerf Tiny results for an Arm Cortex-M4 MCU:

Vendor Visual Wake Words Image Classification Keyword Spotting Anomaly Detection
Plumerai 208.6 ms 173.2 ms 71.7 ms 5.6 ms
STMicroelectronics 230.5 ms 226.9 ms 75.1 ms 7.6 ms
OctoML (microTVM, CMSIS-NN) 301.2 ms 389.5 ms 99.8 ms 8.6 ms
OctoML (microTVM, native codegen) 336.5 ms 389.2 ms 144.0 ms 11.7 ms
Official MLPerf Tiny 1.0 inference results for an Arm Cortex-M4 at 120MHz (STM32L4R5ZIT6U).

Compared to TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN optimized kernels, we run 1.74x faster:

Speed Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Speedup
TFLM latency 335.97 ms 376.08 ms 100.72 ms 8.45 ms
Plumerai latency 194.36 ms 170.42 ms 66.32 ms 5.59 ms
Speedup 1.73x 2.21x 1.52x 1.51x 1.74x
MLPerf Tiny inference latency for the Arm Cortex-M4 at 120MHz inside the STM32L4R9 MCU.

But not only latency is important. Since MCUs often have very limited memory on board it’s important that the inference engine uses minimal memory while executing the neural network. MLPerf Tiny does not report numbers for memory usage, but here are the memory savings we can achieve on the benchmarks compared to TensorFlow Lite for Microcontrollers. On average we use less than half the memory:

Memory Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Reduction
TFLM memory (KiB) 98.80 54.10 23.90 2.60
Plumerai memory (KiB) 36.50 37.80 17.00 1.00
Reduction 2.71x 1.43x 1.41x 2.60x 2.04x
MLPerf Tiny memory usage is lower by a factor of 2.04x.

MLPerf doesn’t report code size, so again we compare against TensorFlow Lite for Microcontrollers. The table below shows that we reduce code size on average by a factor of 2.18x. Using our inference engine means you can use MCUs that have significantly smaller flash size.

Code size Visual Wake Words Image Classification Keyword Spotting Anomaly Detection Reduction
TFLM code (KiB) 209.60 116.40 126.20 67.20
Plumerai code (KiB) 89.20 48.30 46.10 54.20
Reduction 2.35x 2.41x 2.74x 1.24x 2.18x
MLPerf Tiny code size is lower by a factor of 2.18x.

Want to see how fast your models can run? You can submit them for free on our Plumerai Benchmark service. We email you the results in minutes.

Are you deploying AI on microcontrollers? Let’s talk.


Wednesday, November 2, 2022

World’s fastest inference engine now supports LSTM-based recurrent neural networks

At Plumerai we enable our customers to perform increasingly complex AI tasks on tiny embedded hardware. We recently observed that more and more of such tasks are using recurrent neural networks (RNNs), in particular RNNs using the long-short-term-memory (LSTM) cell architecture. Example uses of LSTMs are analyzing time-series data coming from sensors like IMUs or microphones, human activity recognition for fitness and health monitoring, detecting if a machine will break down, and speech recognition. This led to us optimizing and extending our support for LSTMs and today we are proud to announce that Plumerai’s deep learning inference software greatly outperforms existing solutions for LSTMs on microcontrollers for all metrics: speed, accuracy, RAM usage, and code size.

To demonstrate this, we selected four common LSTM-based recurrent neural networks and measured latency and memory usage. In the table below, we compare our inference software against the September 2022 version of TensorFlow Lite for Microcontrollers (TFLM for short) with CMSIS-NN enabled. We choose TFLM because it is freely available and widely used. Note that ST’s X-CUBE-AI also supports LSTMs, but that only runs on ST chips and works with 32-bit floating-point, making it much slower.

These are the networks used for testing:


TFLM latency Plumerai latency TFLM RAM Plumerai RAM
Simple LSTM 941.4 ms 189.0 ms
(5.0x faster)
19.3 KiB 14.3 KiB
(1.4x lower)
Weather prediction 27.5 ms 9.2 ms
(3.0x faster)
4.1 KiB 1.8 KiB
(2.3x lower)
Text generation 7366.0 ms 1350.5 ms
(5.5x faster)
61.1 KiB 51.6 KiB
(1.2x lower)
Bi-directional LSTM 61.5 ms 15.1 ms
(4.1x faster)
12.8 KiB 2.5 KiB
(5.1x lower)

Board: STM32L4R9AI with an Arm Cortex-M4 at 120 MHz with 640 KiB RAM and 2 MiB flash. Similar results were obtained using an Arm Cortex-M7 board.

In the above table we report latency and RAM, which are the most important metrics for most users. The faster you can execute a model, the faster the system can go to sleep, saving power. Microcontrollers are also very memory constrained, making thrifty memory usage crucial. In many cases code size (ROM usage) is also important, and again there we outperform TFLM by a large margin. For example, Plumerai’s implementation of the weather prediction model uses 48 KiB including weights and support code, whereas TFLM uses 120 KiB.

The table above does not report accuracy, because accuracy is not changed by our inference engine. It performs the same computations as TFLM without extra quantization or pruning. Just like TFLM, our inference engine does internal LSTM computations in 16 bits instead of 8 bits to maintain accuracy.

In this blog post we highlight the LSTM feature of Plumerai’s inference software. However, other neural networks are also well supported, very fast, and low on RAM and ROM consumption, without losing accuracy. See our MobileNet blog post or the MLPerf blog post for examples and try out our inference software with your own model.

Besides Arm Cortex-M0/M0+/M4/M7/M33, we also optimize our inference software for Arm Cortex-A, ARC EM, and RISC-V architectures. Get in touch if you want to use the world’s fastest inference software with LSTM support on your embedded device.


Friday, September 30, 2022

Plumerai’s people detection powers Quividi’s audience measurement platform

We’re happy to report that Quividi adopted our people detection solution for its audience measurement platform. Quividi is a world leader in the domain of measuring audiences for digital displays and retail media. They have over 600 customers analyzing billions of shoppers every month, across tens of thousands of screens.

Quividi’s platform measures consumer engagement in all types of venues, outside, inside, and in-store. Shopping malls, vending machines, bus stops, kiosks, digital merchandising displays and retail media screens measure audience impressions, enabling monetization of the screens and leveraging shopper engagement data to drive sales up. The camera-based people detection is fully anonymous and compliant with privacy laws, since no images are recorded or transmitted. Plumerai powers Quividi audience measurement platform With the integration of Plumerai’s people detection, Quividi expands the range of its platform capabilities, since our tiny and fast AI software runs seamlessly on any Arm Cortex-A processor, instead of on costly and power-hungry hardware. Building a tiny and accurate people detection solution takes time: we collect and curate our own data, design and train our own model architectures with over 30 million images, and then run them on off-the-shelf Arm CPUs using our world’s fastest inference engine software.

Besides measuring audiences, people detection can really enhance the user experience in many other markets: tiny smart cameras that alert you when there is someone in your backyard, webcams that ensure you’re always optimally in view, or air conditioners and smart lights that turn off automatically when you leave the room are just a few examples.

More information on our people detection can be found here.

Contact us to evaluate our solution and learn how to integrate it in your product.


Friday, May 13, 2022

tinyML Summit 2022: ‘Tiny models with big appetites: cultivating the perfect data diet'

Although lots of research effort goes into developing small model architectures for computer vision, real gains cannot be made without focusing on the data pipeline. We already mentioned the importance of quality data and some pitfalls of public datasets in an earlier blog post, and have been further improving our data tooling a lot since then to make our data processes even more powerful. We have incorporated several machine learning techniques that enable us to curate our datasets at deep learning scale, for example by identifying images that contributed strongly to a specific false prediction during training.

In this recording of his talk at tinyML Summit 2022, Jelmer demonstrates how our fully in-house data tooling leverages these techniques to curate millions of images, and describes several other essential parts of our model and data pipelines. Together, these allow us to build accurate person detection models that run on the tiniest edge devices.


Friday, May 6, 2022

MLPerf Tiny benchmark shows Plumerai's inference engine on top for Cortex-M

We recently announced Plumerai’s participation in MLPerf Tiny, the best-known public benchmark suite for evaluation of machine learning inference tools and methods. In the latest v0.7 of the MLPerf Tiny results, we participated along with 7 other companies. The published results confirm the claims that we made earlier on our blog: our inference engine is indeed the world’s fastest on Arm Cortex-M microcontrollers. This has now been validated and tested using standardized methods and reviewed by third parties. And what’s more, everything was also externally certified for correctness by evaluating model accuracy on four representative neural networks and applications from the domain: anomaly detection, image classification, keyword spotting, and visual wake words. In addition, our inference engine is also very memory efficient and works well on Cortex-M devices from all major vendors.

Visual Wake Words Image Classification Keyword Spotting Anomaly Detection
STM32 L4R5 220 ms 185 ms 73 ms 5.9 ms
STM32 F746 59 ms 65 ms 19 ms 2.4 ms
Cypress PSoC-62 200 ms 203 ms 64 ms 6.8 ms

Official MLPerf Tiny 0.7 inference results for Plumerai’s inference engine on 3 example devices

You can read more about Plumerai’s inference engine in another blog post, or try it out with your own models using our public benchmarking service. Do contact us if you are interested in the world’s fastest inference software and the most advanced AI on your embedded device, or if you want to know more about the MLPerf Tiny results.


Tuesday, January 4, 2022

tinyML Talk: Demoing the world’s fastest inference engine for Arm Cortex-M

We recently announced Plumerai’s inference engine for 8-bit deep learning models on Arm Cortex-M microcontrollers. We showed that it is the world’s most efficient on MobileNetV2, beating TensorFlow Lite for Microcontrollers with CMSIS-NN kernels by 40% in terms of latency and 49% in terms of RAM usage with no loss in accuracy. However, that was just on a single network and it might have been cherry-picked. This presentation shows a live demonstration of our new service that you can use to test your own 8-bit deep learning models with Plumerai’s inference engine. In this talk Cedric explains what we did to get these speedups and memory improvements and shows benchmarks for the most important publicly available neural network models.


Monday, January 3, 2022

People detection for building automation with Texas Instruments

We’ve partnered with Texas Instruments. Together we’ll enable many more AI applications on tiny microcontrollers. Plumerai’s highly accurate people detection runs on TI’s tiny SimpleLink Wi-Fi CC3220SF MCU. Resource utilization is minimal: peak RAM usage is 170 kB, program binary size is 154 kB, and the detection runs in real-time on the low power Arm Cortex-M4 CPU at 80MHz. We’re controlling the lights here, resulting in a much better user experience. The product was demonstrated at CES 2022.


Thursday, December 9, 2021

Squeezing data center AI into a tiny microcontroller

How we replaced a $10,000 GPU with a $3 MCU, while reaching EfficientDet-level accuracy.

EfficientDet-level accuracy on an MCU

What’s the point of making deep learning tiny if the resulting system becomes unreliable and misses key detections or generates false positives? We like our systems to be tiny, but don’t want to compromise on detection accuracy. Something’s got to give, right? Not necessarily. Our person detection model fits on a tiny Arm Cortex-M7 microcontroller and is as accurate as Google’s much larger EfficientDet-D4 model running on a 250 Watt NVIDIA GPU.

Person detection

Whether it’s a doorbell that detects someone walking up to your door, a thermostat that stops heating when you go to bed, or a home audio system where music follows you from room to room, there are many applications where person detection greatly enhances the user experience. When expanded beyond the home, this technology can be applied in smart cities, stores, offices, and industry. To bring such people-aware applications to market, we need systems that are small, inexpensive, battery-powered, and keep the images at the edge for better privacy and security. Microcontrollers are an ideal platform, but they don’t have powerful compute capabilities, nor much memory. Plumerai takes a full system-level approach to bring AI to such a constrained platform: we collect our own data, use our own data pipeline, and develop our own models and inference engine. The end result is real-time person detection on a tiny Arm Cortex-M7 microcontroller. Quite an achievement to get deep learning running on such a small device.

Plumerai’s person detection recorded live from the MCU’s display
Plumerai’s person detection recorded live from the MCU’s display

But what about accuracy?

Our customers keep asking this question too. Unfortunately, measuring accuracy is not that simple. A common approach is to measure accuracy on a public dataset, such as COCO. We have found this to be flawed, as many datasets are not accurate representations of real-world applications. For example, a model that performs better on COCO does not perform better in the field. This is because the dataset includes many images that are not relevant for a real-world use-case. In addition, there’s bias in COCO, as it was crowd-sourced from the internet. These images are mostly of situations that people found interesting to capture while always-on intelligent cameras mostly take pictures of uninteresting office hallways or living rooms. For example, the dark images in COCO include many from concerts with many people in view where a dark office or home is usually empty. The only right dataset to measure accuracy on is a relevant and clean dataset that is specifically collected for the target application. We do extensive, proprietary data collection and curation and decide to use our own test dataset to measure accuracy on. It goes without saying that this test dataset is completely independent and separate from the training dataset — collected in different locations with different backgrounds, situations, and subjects.

After selecting the dataset to measure accuracy on, we need to choose a model to compare to. We found Google’s state-of-the-art EfficientDet the most appropriate model for evaluation. EfficientDet is a large and highly accurate model that is scalable, enabling a range of performance points. EfficientDet starts with D0, which has a model size of 15.6MB, and goes up to D7x, which requires 308MB.

The surprising result? Our model is just as accurate as the pre-trained EfficientDet-D4. Both achieve an mAP of 88.2 on our test dataset. D4 has a model size of 79.7MB and typically runs on a $10,000 NVIDIA Tesla V100 GPU in a data center. Our model size is just 737KB, making it 112 times smaller. Using our optimized inference engine, our model runs in real-time on a tiny MCU with less than 256KB of on-chip RAM.

Model Plumerai Person Detection Google EfficientDet‑D4
Model size 737 KB 79.7 MB
 
Hardware Arm
Cortex-M7 MCU
NVIDIA Tesla V100 GPU
 
Price <$3 $10,000
 
Real-world person detection mAP 88.2 88.2

Full stack optimization

Such a result can only be achieved with a relentless focus on the full AI stack.

First, choose your data wisely. A model is only as good as the data you train it with. We do extensive data collection and use our intelligent data pipeline for curation, augmentation, oversampling strategies, model debugging, data unit tests, and more.

Second, design and train the right model. Our model supports a single class, where EfficientDet can detect many. But such a multi-class model is not needed for our target application. In addition, we include our Binarized Neural Networks technology, significantly shrinking the network while also speeding it up.

Third, use a great inference engine. We developed a new inference engine from scratch, which greatly outperforms TensorFlow Lite for Microcontrollers. This inference engine works for both standard 8-bit as well as Binarized Neural Networks and does not affect accuracy.

Famous mathematician Blaise Pascal once said, “If I had more time, I would have written a shorter letter.” In deep learning, this translates to “If I had more time, I would have made a smaller model.” We did take that time, and it wasn’t easy, but we did build that smaller model. We’re excited to bring powerful deep learning to tiny devices that lack the storage capacity and the computational power for long letters.

Let us know if you’d like to continue the conversation. High accuracy numbers are nice, but nothing goes beyond running the model in the field. We’re always happy to jump on a call and show you our live demonstrations and discuss how we can bring AI to your products.


Tuesday, November 9, 2021

Person presence detection on the Lattice CrossLink-NX FPGA

At Plumerai we believe in building vertically integrated solutions to enable the most advanced AI on embedded devices. We do extensive data collection, build our own intelligent data pipeline, design our own inference software and tiny AI models, which we train using our own training algorithms. We recently showed that our inference engine for Arm Cortex-M is the fastest and smallest in the world. This way we bring powerful AI to small microcontrollers that previously could not run such complex deep learning tasks.

One of our unique technologies is our Binarized Neural Networks (BNNs), which consists of simple single-bit operations instead of 8-bit multiplications. BNNs save significant memory and power, enabling either more weights and activations in the same silicon footprint to reach higher accuracy, or the same networks to run on smaller and cheaper chips that are powered by smaller batteries. Until now, we have been deploying our BNNs on Arm Cortex-M and Arm Cortex-A processors with great results. However, we felt there was more room for improvement, since these CPUs are built to run typical 8-bit and 32-bit workloads and don’t provide native support for the single-bit operations that our BNNs rely on.

Some of our customers asked if our AI solutions also support FPGAs, since these provide incredible flexibility, cost efficiency, and tighter integration. FPGAs turn out to be an ideal platform to implement our models and inference engine on, as they enable us to unlock the full potential of our BNNs. In FPGAs we can natively implement the binary arithmetic that we need to run our models. We therefore decided to develop our own AI accelerator IP core named Ikva, which we introduce for the first time in this blog post. The Ikva accelerator runs our own BNNs and also efficiently supports 8-bit models. Of course, Ikva is fully supported by our extensive tool flow and ultra-fast and memory-efficient inference engine that’s integrated with TensorFlow Lite. A 32-bit RISC-V processor controls Ikva, captures the data from the camera and provides a programmer-friendly runtime environment. During the development of Ikva, we aimed to design a new hardware architecture for our optimized AI models while keeping it highly flexible and suitable for unknown future models. In contrast to other AI companies that seem to either develop models, or training software, or AI processors, we focus on the full AI stack and the Ikva core completes our offering. With Ikva, we now support the full AI stack starting from data collection, to training and model development, to very efficient inference engines, and now all the way down to providing the most optimized hardware implementations.

As you know, we like AI that is tiny, and Ikva fits in small and low-power FPGAs like the Lattice CrossLink-NX. The architecture is scalable, both in memory and in compute power. This means we can target a wide variety of FPGAs, ensure we fit next to other IP blocks, and extract maximum performance out of the resources that are available in the target FPGA device.

The video above showcases one of our proprietary person presence detection models together with our inference software running on the Ikva IP core in a Lattice CrossLink-NX LIFCL-40 FPGA. This is a low-power and low-cost 6x6mm FPGA that is available off-the-shelf and includes a native MIPI camera interface, further reducing the number of components in the system.

Ikva runs our robust and highly accurate person presence detection model 10x faster on the CrossLink-NX FPGA than on a typical Arm Cortex-M microcontroller. Alternatively, the frame rate can be scaled down to 1 or 2 FPS for those applications where low energy consumption is key.

The Lattice CrossLink-NX Voice & Vision Machine Learning Board with the CrossLink-NX LIFCL-40 FPGA The Lattice CrossLink-NX Voice & Vision Machine Learning Board with the CrossLink-NX LIFCL-40 FPGA

There are many target applications for person presence detection. For instance in your home, to automatically turn off your TV, your lights, or heating when there’s no one in the room. Outside your home, your doorbell can send you a signal when there’s someone walking up to your front door or a small camera can detect when there’s an unexpected visitor in your backyard. In the office, your PC can automatically lock the screen when you leave. Elderly care can be improved when you know how much time they spend in bed, their living room, or outside. The possibilities are endless, whether it’s in the home, on the road, in the city, at the office or on the factory floor. Accurate, inexpensive and battery-powered person detection will enhance our lives.

Of course, besides running Plumerai’s optimized BNN models, you can also run your own model on the Ikva core, or integrate Ikva into your FPGA-based device. We’re excited to enable extremely powerful AI to go to places it couldn’t go before.

The Ikva IP core, the supporting tool flow, and optimized person detection models are available today. Contact us to receive more information or schedule a video call to see our live demonstration. We’re eager to discuss how we can enable your products with Ikva.


Monday, October 4, 2021

The world’s fastest deep learning inference software for Arm Cortex-M

New: Official MLPerf results are in!

New: Try out our inference engine with your own model!

At Plumerai we enable our customers to perform increasingly complex AI tasks on tiny embedded hardware. We’re proud to announce that our inference software for Arm Cortex-M microcontrollers is the fastest and most memory-efficient in the world, for both Binarized Neural Networks and for 8-bit deep learning models. Our inference software is an essential component of our solution, since it directs resource management akin to an operating system. It has 40% lower latency and requires 49% less RAM than TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN kernels while retaining the same accuracy. It also outperforms any other deep learning inference software for Arm Cortex-M:

Inference time RAM usage
TensorFlow Lite for Microcontrollers 2.5 (with CMSIS-NN) 129 ms 155 KiB
Edge Impulse’s EON 120 ms 153 KiB
MIT’s TinyEngine 1 124 ms 98 KiB
STMicroelectronics’ X-CUBE-AI 103 ms 109 KiB
Plumerai’s inference software 77 ms 80 KiB

Model: MobileNetV2 2 3 (alpha=0.30, resolution=80x80, classes=1000)
Board: STM32F746G-Discovery at 216 MHz with 320 KiB RAM and 1 MiB flash

Our inference software builds upon TensorFlow Lite for Microcontrollers, such that it supports all of the same operations and more. But since resources are scarce on a microcontroller, we do not rely on TensorFlow or Arm’s kernels for the most performance-critical layers. Instead, for those layer types we developed custom kernel code, optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more. To become faster than the already heavily optimized Arm Cortex-M specific CMSIS-NN kernels, we had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.

Although these generic per-layer-type optimizations resulted in great speed-ups, we went further and squeezed out every last bit of performance from the Arm Cortex-M microcontroller. To do that, we perform specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, our inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc. Since we do not know upfront which neural networks our inference software might run, we make these optimizations together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. We then guide the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.

Memory usage is an important constraint on embedded devices; however fast or slow the software is, it has to fit in memory to run at all. TensorFlow Lite for Microcontrollers already comes with a memory planner that ensures a tensor only takes up space while there is a layer using it. We further optimized memory usage with a smart offline memory planner that analyzes the memory access patterns of each layer of the network. Depending on properties such as filter size, the memory planner allows the input and output of a layer to partially or even completely overlap, effectively computing the layer in-place.

Besides Arm Cortex-M, we also optimize our inference software for Arm Cortex-A and RISC-V architectures. And if the above results are still not fast enough for your application, we go even further. We make our AI tiny and radically more efficient by using Binarized Neural Networks (BNNs) - deep learning models that use only a single bit to encode each weight and activation. We are building improved deep learning model architectures and training algorithms for BNNs, we are designing a custom IP-core for customers with FPGAs and we are composing optimized training datasets. All these improvements mean that we can process more frames per second, save more energy, run larger and more accurate AI models and deploy on cheaper hardware.

Get in touch if you want to use the world’s fastest inference software and the most advanced AI on your embedded device.

Wednesday, October 13, 2021: Updated memory usage to match newest version of the inference software.


  1. Results copied from https://github.com/mit-han-lab/tinyml/tree/master/mcunet, all other results were measured by us. ↩︎

  2. The explicit padding layers in MobileNetV2 were fused, and to be able to compare with the TinyEngine results the number of filters in the final convolution layer (1280) was scaled by alpha to 384. The exact model can be downloaded here↩︎

  3. microTVM ran out of memory, other benchmarks show that microTVM is generally a bit slower than CMSIS-NN↩︎


Tuesday, August 17, 2021

Great TinyML needs high-quality data

So far we have mostly written about how we enable AI applications on tiny hardware by using Binarized Neural Networks (BNNs). The use of BNNs helps us to reduce the required memory, the inference latency and energy consumption of our AI models, but there is something that we have been less vocal about that is at least as important for AI in the real world: high-quality training data.

To train tiny models, choose your data wisely

Deep learning models are famously hungry for training data, as more training data is usually the most effective way to improve accuracy. But once we started to train deep learning models that are truly tiny — with model sizes of a few hundred KB or less for computer vision tasks like person detection — we discovered that it is not so much the quantity but the quality of the training data that matters.

The key insight is that these tiny deep learning models have limited information capacity, so you cannot afford to waste precious KBs on learning irrelevant features. You have to be very strict in telling the model what you do and do not find important. We can communicate this to the model by carefully selecting its training data. Furthermore, the balance in your dataset is important because a compressed model tends to limit itself to perform well only on the concepts for which there are many samples in the training dataset. So we curate our datasets to ensure that all the important use-cases are included, and in the right balance.

Know your model

As tiny AI models become a part of our world, it is crucial that we know these models well. We have to understand in what situations they are reliable and where their pitfalls are.

High-level metrics like accuracy, precision and recall don’t provide anything close to the level of detail required here. Instead — taking inspiration from Andrej Karpathy’s Software 2.0 essay — we test our models in very specific situations. And rather than writing code we express these tests with data. Every model that we ship needs to surpass an accuracy threshold for very specific subsets of our dataset. For example, for person detection we have implemented tests for people standing far away or for people who are only showing their back, and for scenes containing coats and other fabrics that might confuse the model.

Samples from our person_standing_back data unit test Several samples from our person_standing_back data unit test.

These data unit tests guarantee that the model will work reliably for our customers’ use-cases and also enable us to ensure our models are not making mistakes due to fairness-related attributes, such as skin color or gender. As we experiment with new training algorithms, model architectures and training datasets, our data unit tests allow us to track the progress these inventions make on the trained models that we ship to customers.

Results of some of our data unit tests The outcome of some of our data unit tests — a standard MobileNet person detection model (left) versus Plumerai’s person detection model (right). The person_hallway_beyond_7m test result shows us that this model can not yet be reliably used to detect people from large distances.

Public datasets: handle with care

Public datasets are usually composed of photos taken by people for the enjoyment of people, such as photos of concerts, food or art. These photos are typically very different from the scenes in which TinyML/low-power AI products are deployed. For example, the dark photos in public datasets are mostly from concerts and almost always have people in them. So if this dataset is used to train a tiny deep learning model, the model will try to take a shortcut and associate dark images with the presence of people. This hack works fine as long as the model is evaluated on the same distribution of images, as is the case in the validation sets of public datasets, but will cause problems once the model is deployed in a smart doorbell camera. In addition, public datasets often contain misclassified images, images encoding harmful correlations and images that although technically labelled correctly are too difficult to classify correctly.

A selection of dark photos from Open Images Most dark photos in the public Open Images dataset are from concerts and contain people. A tiny deep learning model trained on this dataset will try to take a problematic shortcut and associate dark images with the presence of people.

To circumvent the many problems of public datasets, we at Plumerai collect our own data — straight from the cameras used in TinyML products in the situations that our products are intended for. But that does not mean that we do not use any public datasets to train our models. We want to benefit from the scale and diversity of these datasets, while mitigating their quality issues. So we use our own data and analysis methods to automatically identify specific issues stemming from sampling biases in the public datasets and solve those problematic correlations that our models are sensitive to.

Public data versus real world data The images in public person detection datasets (left) are largely irrelevant to the scenes encountered by TinyML devices in the real-world (right).

Example saliency map One of the tools we can use to debug our tiny AI models are saliency maps. They allow us to see what our models are correctly (left) or incorrectly (right) triggered by.

Building the tiny AI model factory

Although new training optimizers and new model architectures get most of the attention, there are many other components required to build great AI applications on tiny hardware. Data unit tests and tools for dataset curation are some of the components that are mentioned above, but there are many more. We build tools to identify what data needs to be labelled, use large models in the cloud for auto labeling our datasets and combine these with human labelling. We build the whole infrastructure that is necessary to go as fast as possible through our model development cycle: train a tiny AI model, test it, identify the failure cases, collect and label data for those failure cases and then go through this cycle again and again. All these components together allow us to build tiny but highly accurate AI models for the real world.

Example saliency map ML code is just one small component of the large and complex tiny AI model factory that we are building. From Sculley et al. (2015).


Thursday, July 1, 2021

BNNs for TinyML: performance beyond accuracy — CVPR 2021 Workshop on Binarized Neural Networks

Tim de Bruin, a Deep Learning Scientist at Plumerai, was one of the invited speakers at last week’s CVPR 2021 Workshop on Binarized Neural Networks for Computer Vision. Tim presented some of Plumerai’s work on solving the remaining challenges with BNNs and explains why optimizing for accuracy is not enough.


Tuesday, April 20, 2021

tinyML Summit 2021: Person Detection under Extreme Constraints — Lessons from the Field

At this year’s tinyML Summit, we presented our new solution for person detection with bounding boxes. We have developed a person detection model that runs in real time (895 ms latency) on an STM32H7B3 board (Arm Cortex-M7), a popular off-the-shelf available microcontroller. To the best of our knowledge, this is the first time anyone runs person detection in real-time on Arm Cortex-M based microcontrollers, and we are very excited to be bringing this new capability to customers!

Watch the video below to see the live demo and learn how Binarized Neural Networks are an integral part of our solution.


Wednesday, April 7, 2021

MLSys 2021: Design, Benchmark, and Deploy Binarized Neural Networks with Larq Compute Engine

We are very excited to present our paper Larq Compute Engine: Design, Benchmark, and Deploy State-of-the-Art Binarized Neural Networks at the MLSys 2021 conference this week!

Larq Compute Engine (LCE) is a state-of-the-art inference engine for Binarized Neural Networks (BNNs). LCE makes it possible for researchers to easily benchmark BNNs on mobile devices. Real latency benchmarks are essential for developing BNN architectures that actually run fast on device, and we hope LCE will help people to build even better BNNs.

In the paper, we discuss the design of LCE and go into the technical details behind the framework. LCE was designed for usability and performance. It extends TensorFlow Lite, which makes it possible to integrate binarized layers into any model supported by TFLite and run it with LCE. Integration with Larq makes it very easy to move from training to deployment. And thanks to our hand-optimized inference kernels and sophisticated MLIR-based graph optimizations, inference speed is phenomenal.

12x - 17x speedup from binarization The impact of binarization on the latency of different convolutional blocks in ResNets - binary is up to 17x faster than 32-bit floating point and 12x faster than 8-bit integers on a Pixel 1 phone.

BNNs are all about efficient inference: their tiny memory footprint and compact bitwise computations make them perfect for edge applications and small, battery powered devices. However, this requires the people building BNNs to have access to real, empirical measurements of their model as it is deployed. Lacking the right tools, researchers too often refrain to proxy metrics such as the number of FLOPs. LCE aims to change this.

In the paper, we demonstrate the value of measuring latency directly by analyzing the execution of some of the best existing BNN designs, such as R2B and BinaryDenseNets. We identify suboptimal points in these models and present QuickNet, a new model family which outperforms all existing BNNs on ImageNet in terms of latency and accuracy.

Breakdown of QuickNet latency Breakdown of the latency of QuickNet-Large (QNL), our new state-of-the-art BNN. Note how compared to Real-to-Binary Net (R2B) and BinaryDenseNet, we win mostly on the high-precision first layer and ‘glue’ layers.

We will be presenting the work at the MLSys conference on Wednesday the 7th of April. The published paper is publicly available, and registered attendees will be able to view our oral talk. We hope you will all attend, and if you want to learn more about running BNNs on all sorts of devices, contact us at [email protected]!


Friday, January 22, 2021

tinyML Talks: Binarized Neural Networks on microcontrollers

For the past few months we have been working very hard on something new: Binarized Neural Networks on microcontrollers. By bringing deep learning to cheap, low-power microcontrollers we remove price and energy barriers and make it possible to embed AI into basically any device, even for relatively complex tasks such as person detection.

This week, we gave a presentation as part of the tinyML Talks webcast series where we explained what we had to build to make this work. We demonstrated how combining our custom training algorithms, inference software stack and datasets results in a highly accurate and efficient solution - in this case for person presence detection on the STM32L4R9, an ARM Cortex-M4 microcontroller from STMicroelectronics. This technology can be implemented to trigger push notifications for smart home cameras, wake up devices when a person is detected, detect occupancy of meeting rooms and for many more applications. This is a step towards our goal of making deep learning ultra low-power and a future where battery-powered peel-and-stick sensors can perform complex AI tasks everywhere.

Our BNN models with our proprietary inference software are faster and more accurate on ARM Cortex-M4 microcontrollers than the best publicly available 8-bit deep learning models with TensorFlow Lite for Microcontrollers.

We quickly found out that great training algorithms and inference software are not enough when building a solution for the real world. Collecting and labeling our own data turned out to be crucial in dealing with a wide variety of difficulties.

Testing our models thoroughly is equally important. Instead of relying only on simplistic metrics such as accuracy, we developed a suite of unit tests to ensure reliable performance in many settings.

Unit tests for Deep Learning Applications

Bringing all of this together results in a highly robust and efficient solution for person presence detection, as we showed in our live demo.

Live Demo of Person Detection running on Cortex-M4 microcontroller

We’re just scratching the surface with person detection and we have a lot more in store for the coming months - more applications on Arm Cortex-M microcontrollers, and even better performance with our own IP-core for BNN inference on low-powered FPGAs and with the xcore.ai platform from XMOS.

What&rsquo;s Next?

We are very happy with the high attendance during the live webcast and a ton of questions were submitted during the Q&A. There was no time to answer all of them live, so we shared our answers on the tinyML forum and we’ll be sharing more of our progress at the tinyML Summit in March.

If you’re thinking about using Binarized Neural Networks to enable highly accurate deep learning on microcontrollers, get in touch!


Wednesday, April 1, 2020

XMOS and Plumerai partner to accelerate commercialisation of binarized neural networks

XMOS and Plumerai bring together their deep learning expertise across chip design and algorithms in a Binarized Neural Network capability, advancing the deployment of intelligence at the edge.

Bristol & London UK, 1 April 2020 – British technology companies XMOS and Plumerai have agreed a new strategic partnership that will support the development of binarized neural network (BNN) capabilities that enable AI to be embedded in a wide range of everyday devices efficiently at low-power and at low-cost.

The partnership will combine Plumerai’s Larq software library for training BNNs and the xcore.ai crossover processor from XMOS which provides native support for inference of BNNs. The combination of the two technologies will deliver a BNN capability that’s 2 to 4x more efficient than existing edge AI solutions.

This solution will enable a new generation of devices to run tasks that make our lives simpler and safer. This could include everything from identifying that a shopping package has been delivered to a safe place to managing traffic flows more efficiently, supporting remote healthcare applications or keeping shelves in stores stocked more efficiently. While BNNs are an emerging technology, the future potential is enormous.

The deep learning revolution is all around us today. But a typical application uses deep learning models with tens of millions of parameters – and despite the move to 16-bit and 8-bit encoding there is still an insatiable demand to increase the speed and efficiency of deep learning and AI systems. That’s where BNNs come in.

BNNs are the most efficient form of deep learning, offering to transform the economics and efficiency of edge intelligence by going all the way down to just a single bit. However, there are significant challenges involved in making BNNs commercially viable – for example, they demand specific attention in chip design for efficient inference and new software algorithms for training.

XMOS and Plumerai have combined their respective expertise in embedded chip design and deep learning algorithms to enable this breakthrough technology and bring AI to the devices all around us.

Mark Lippett, XMOS CEO says “BNNs gained prominence in the news recently with Apple’s purchase of Xnor.ai for a reported $200m. It’s little surprise that Apple is exploring AI capabilities at the edge, with advanced machine learning algorithms that can run efficiently in low-power, offline environments.

“Regardless of other moves in the market, our partnership with Plumerai is exciting for AI developers around the world. The combination of Larq and xcore.ai offers the first consolidated path to commercially deploying BNNs, which will be highly disruptive in intelligent embedded systems.”

Roeland Nusselder, Plumerai CEO adds “We are thrilled to join forces with the experienced team from XMOS to bring BNNs to the edge, and we share their excitement about the emerging era of intelligent connectivity. Binarized deep learning has tremendous potential for enabling a new generation of energy-efficient, AI-powered applications. Our two companies are perfectly positioned to turn this potential into reality.”

About XMOS

XMOS is a deep tech company at the leading edge of the AIoT. Since its inception in 2005, XMOS has had its finger on the pulse recognising and addressing the evolving market need. The company’s processors put intelligence, connectivity and enhanced computation at the core of smart products.

About Plumerai

Plumerai is making deep learning tiny and computationally radically more efficient to enable real-time inference on the edge – for automated warehouses, retail, smart cameras, micromobility and many more. The team is based in London, Amsterdam and Warsaw and is backed by world-class investors.


Tuesday, March 24, 2020

The Larq Ecosystem

State-of-the-art binarized neural networks and even faster inference

In our previous blog post, we announced Larq Compute Engine (LCE), our deployment solution for Binarized Neural Networks (BNNs). The combination of LCE, our training library Larq and the models in Larq Zoo forms the first end-to-end solution for anyone building applications using BNNs. In this post, we take a step back and look at this integrated ecosystem as a whole.

Continue reading on the Larq blog…


Monday, February 17, 2020

Announcing Larq Compute Engine v0.1

Optimized BNN inference for edge devices

We believe BNNs are the future of efficient inference, which is why we’ve developed tools to make it easier to train and research these models. Our open-source library Larq enables developers to build and train BNNs and integrates seamlessly with TensorFlow Keras. Larq Zoo provides implementations of major BNNs from the literature together with pretrained weights for state-of-the-art models.

But the ultimate goal of BNNs is to solve real-world problems on the edge. So once you’ve built and trained a BNN with Larq, how do you get it ready for efficient inference? Today, we’re introducing Larq Compute Engine to tackle that problem.

Continue reading on the Larq blog…