Plumerai icon

Tuesday, November 9, 2021

Person presence detection on the Lattice CrossLink-NX FPGA

At Plumerai we believe in building vertically integrated solutions to enable the most advanced AI on embedded devices. We do extensive data collection, build our own intelligent data pipeline, design our own inference software and tiny AI models, which we train using our own training algorithms. We recently showed that our inference engine for Arm Cortex-M is the fastest and smallest in the world. This way we bring powerful AI to small microcontrollers that previously could not run such complex deep learning tasks.

One of our unique technologies is our Binarized Neural Networks (BNNs), which consists of simple single-bit operations instead of 8-bit multiplications. BNNs save significant memory and power, enabling either more weights and activations in the same silicon footprint to reach higher accuracy, or the same networks to run on smaller and cheaper chips that are powered by smaller batteries. Until now, we have been deploying our BNNs on Arm Cortex-M and Arm Cortex-A processors with great results. However, we felt there was more room for improvement, since these CPUs are built to run typical 8-bit and 32-bit workloads and don’t provide native support for the single-bit operations that our BNNs rely on.

Some of our customers asked if our AI solutions also support FPGAs, since these provide incredible flexibility, cost efficiency, and tighter integration. FPGAs turn out to be an ideal platform to implement our models and inference engine on, as they enable us to unlock the full potential of our BNNs. In FPGAs we can natively implement the binary arithmetic that we need to run our models. We therefore decided to develop our own AI accelerator IP core named Ikva, which we introduce for the first time in this blog post. The Ikva accelerator runs our own BNNs and also efficiently supports 8-bit models. Of course, Ikva is fully supported by our extensive tool flow and ultra-fast and memory-efficient inference engine that’s integrated with TensorFlow Lite. A 32-bit RISC-V processor controls Ikva, captures the data from the camera and provides a programmer-friendly runtime environment. During the development of Ikva, we aimed to design a new hardware architecture for our optimized AI models while keeping it highly flexible and suitable for unknown future models. In contrast to other AI companies that seem to either develop models, or training software, or AI processors, we focus on the full AI stack and the Ikva core completes our offering. With Ikva, we now support the full AI stack starting from data collection, to training and model development, to very efficient inference engines, and now all the way down to providing the most optimized hardware implementations.

As you know, we like AI that is tiny, and Ikva fits in small and low-power FPGAs like the Lattice CrossLink-NX. The architecture is scalable, both in memory and in compute power. This means we can target a wide variety of FPGAs, ensure we fit next to other IP blocks, and extract maximum performance out of the resources that are available in the target FPGA device.

The video above showcases one of our proprietary person presence detection models together with our inference software running on the Ikva IP core in a Lattice CrossLink-NX LIFCL-40 FPGA. This is a low-power and low-cost 6x6mm FPGA that is available off-the-shelf and includes a native MIPI camera interface, further reducing the number of components in the system.

Ikva runs our robust and highly accurate person presence detection model 10x faster on the CrossLink-NX FPGA than on a typical Arm Cortex-M microcontroller. Alternatively, the frame rate can be scaled down to 1 or 2 FPS for those applications where low energy consumption is key.

The Lattice CrossLink-NX Voice & Vision Machine Learning Board with the CrossLink-NX LIFCL-40 FPGA The Lattice CrossLink-NX Voice & Vision Machine Learning Board with the CrossLink-NX LIFCL-40 FPGA

There are many target applications for person presence detection. For instance in your home, to automatically turn off your TV, your lights, or heating when there’s no one in the room. Outside your home, your doorbell can send you a signal when there’s someone walking up to your front door or a small camera can detect when there’s an unexpected visitor in your backyard. In the office, your PC can automatically lock the screen when you leave. Elderly care can be improved when you know how much time they spend in bed, their living room, or outside. The possibilities are endless, whether it’s in the home, on the road, in the city, at the office or on the factory floor. Accurate, inexpensive and battery-powered person detection will enhance our lives.

Of course, besides running Plumerai’s optimized BNN models, you can also run your own model on the Ikva core, or integrate Ikva into your FPGA-based device. We’re excited to enable extremely powerful AI to go to places it couldn’t go before.

The Ikva IP core, the supporting tool flow, and optimized person detection models are available today. Contact us to receive more information or schedule a video call to see our live demonstration. We’re eager to discuss how we can enable your products with Ikva.


Monday, October 4, 2021

The world’s fastest deep learning inference software for Arm Cortex-M

At Plumerai we enable our customers to perform increasingly complex AI tasks on tiny embedded hardware. We’re proud to announce that our inference software for Arm Cortex-M microcontrollers is the fastest and most memory-efficient in the world, for both Binarized Neural Networks and for 8-bit deep learning models. Our inference software is an essential component of our solution, since it directs resource management akin to an operating system. It has 40% lower latency and requires 49% less RAM than TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN kernels while retaining the same accuracy. It also outperforms any other deep learning inference software for Arm Cortex-M:

Inference time RAM usage
TensorFlow Lite for Microcontrollers 2.5 (with CMSIS-NN) 129 ms 155 KiB
Edge Impulse’s EON 120 ms 153 KiB
MIT’s TinyEngine 1 124 ms 98 KiB
STMicroelectronics’ X-CUBE-AI 103 ms 109 KiB
Plumerai’s inference software 77 ms 80 KiB

Model: MobileNetV2 2 3 (alpha=0.30, resolution=80x80, classes=1000)
Board: STM32F746G-Discovery at 216 MHz with 320 KiB RAM and 1 MiB flash

Our inference software builds upon TensorFlow Lite for Microcontrollers, such that it supports all of the same operations and more. But since resources are scarce on a microcontroller, we do not rely on TensorFlow or Arm’s kernels for the most performance-critical layers. Instead, for those layer types we developed custom kernel code, optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more. To become faster than the already heavily optimized Arm Cortex-M specific CMSIS-NN kernels, we had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.

Although these generic per-layer-type optimizations resulted in great speed-ups, we went further and squeezed out every last bit of performance from the Arm Cortex-M microcontroller. To do that, we perform specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, our inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc. Since we do not know upfront which neural networks our inference software might run, we make these optimizations together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. We then guide the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.

Memory usage is an important constraint on embedded devices; however fast or slow the software is, it has to fit in memory to run at all. TensorFlow Lite for Microcontrollers already comes with a memory planner that ensures a tensor only takes up space while there is a layer using it. We further optimized memory usage with a smart offline memory planner that analyzes the memory access patterns of each layer of the network. Depending on properties such as filter size, the memory planner allows the input and output of a layer to partially or even completely overlap, effectively computing the layer in-place.

Besides Arm Cortex-M, we also optimize our inference software for Arm Cortex-A and RISC-V architectures. And if the above results are still not fast enough for your application, we go even further. We make our AI tiny and radically more efficient by using Binarized Neural Networks (BNNs) - deep learning models that use only a single bit to encode each weight and activation. We are building improved deep learning model architectures and training algorithms for BNNs, we are designing a custom IP-core for customers with FPGAs and we are composing optimized training datasets. All these improvements mean that we can process more frames per second, save more energy, run larger and more accurate AI models and deploy on cheaper hardware.

Get in touch if you want to use the world’s fastest inference software and the most advanced AI on your embedded device.

Wednesday, October 13, 2021: Updated memory usage to match newest version of the inference software.


  1. Results copied from https://github.com/mit-han-lab/tinyml/tree/master/mcunet, all other results were measured by us. ↩︎

  2. The explicit padding layers in MobileNetV2 were fused, and to be able to compare with the TinyEngine results the number of filters in the final convolution layer (1280) was scaled by alpha to 384. The exact model can be downloaded here. ↩︎

  3. microTVM ran out of memory, other benchmarks show that microTVM is generally a bit slower than CMSIS-NN. ↩︎


Tuesday, August 17, 2021

Great TinyML needs high-quality data

So far we have mostly written about how we enable AI applications on tiny hardware by using Binarized Neural Networks (BNNs). The use of BNNs helps us to reduce the required memory, the inference latency and energy consumption of our AI models, but there is something that we have been less vocal about that is at least as important for AI in the real world: high-quality training data.

To train tiny models, choose your data wisely

Deep learning models are famously hungry for training data, as more training data is usually the most effective way to improve accuracy. But once we started to train deep learning models that are truly tiny — with model sizes of a few hundred KB or less for computer vision tasks like person detection — we discovered that it is not so much the quantity but the quality of the training data that matters.

The key insight is that these tiny deep learning models have limited information capacity, so you cannot afford to waste precious KBs on learning irrelevant features. You have to be very strict in telling the model what you do and do not find important. We can communicate this to the model by carefully selecting its training data. Furthermore, the balance in your dataset is important because a compressed model tends to limit itself to perform well only on the concepts for which there are many samples in the training dataset. So we curate our datasets to ensure that all the important use-cases are included, and in the right balance.

Know your model

As tiny AI models become a part of our world, it is crucial that we know these models well. We have to understand in what situations they are reliable and where their pitfalls are.

High-level metrics like accuracy, precision and recall don’t provide anything close to the level of detail required here. Instead — taking inspiration from Andrej Karpathy’s Software 2.0 essay — we test our models in very specific situations. And rather than writing code we express these tests with data. Every model that we ship needs to surpass an accuracy threshold for very specific subsets of our dataset. For example, for person detection we have implemented tests for people standing far away or for people who are only showing their back, and for scenes containing coats and other fabrics that might confuse the model.

Samples from our person_standing_back data unit test Several samples from our person_standing_back data unit test.

These data unit tests guarantee that the model will work reliably for our customers’ use-cases and also enable us to ensure our models are not making mistakes due to fairness-related attributes, such as skin color or gender. As we experiment with new training algorithms, model architectures and training datasets, our data unit tests allow us to track the progress these inventions make on the trained models that we ship to customers.

Results of some of our data unit tests The outcome of some of our data unit tests — a standard MobileNet person detection model (left) versus Plumerai’s person detection model (right). The person_hallway_beyond_7m test result shows us that this model can not yet be reliably used to detect people from large distances.

Public datasets: handle with care

Public datasets are usually composed of photos taken by people for the enjoyment of people, such as photos of concerts, food or art. These photos are typically very different from the scenes in which TinyML/low-power AI products are deployed. For example, the dark photos in public datasets are mostly from concerts and almost always have people in them. So if this dataset is used to train a tiny deep learning model, the model will try to take a shortcut and associate dark images with the presence of people. This hack works fine as long as the model is evaluated on the same distribution of images, as is the case in the validation sets of public datasets, but will cause problems once the model is deployed in a smart doorbell camera. In addition, public datasets often contain misclassified images, images encoding harmful correlations and images that although technically labelled correctly are too difficult to classify correctly.

A selection of dark photos from Open Images Most dark photos in the public Open Images dataset are from concerts and contain people. A tiny deep learning model trained on this dataset will try to take a problematic shortcut and associate dark images with the presence of people.

To circumvent the many problems of public datasets, we at Plumerai collect our own data — straight from the cameras used in TinyML products in the situations that our products are intended for. But that does not mean that we do not use any public datasets to train our models. We want to benefit from the scale and diversity of these datasets, while mitigating their quality issues. So we use our own data and analysis methods to automatically identify specific issues stemming from sampling biases in the public datasets and solve those problematic correlations that our models are sensitive to.

Public data versus real world data The images in public person detection datasets (left) are largely irrelevant to the scenes encountered by TinyML devices in the real-world (right).

Example saliency map One of the tools we can use to debug our tiny AI models are saliency maps. They allow us to see what our models are correctly (left) or incorrectly (right) triggered by.

Building the tiny AI model factory

Although new training optimizers and new model architectures get most of the attention, there are many other components required to build great AI applications on tiny hardware. Data unit tests and tools for dataset curation are some of the components that are mentioned above, but there are many more. We build tools to identify what data needs to be labelled, use large models in the cloud for auto labeling our datasets and combine these with human labelling. We build the whole infrastructure that is necessary to go as fast as possible through our model development cycle: train a tiny AI model, test it, identify the failure cases, collect and label data for those failure cases and then go through this cycle again and again. All these components together allow us to build tiny but highly accurate AI models for the real world.

Example saliency map ML code is just one small component of the large and complex tiny AI model factory that we are building. From Sculley et al. (2015).


Thursday, July 1, 2021

BNNs for TinyML: performance beyond accuracy — CVPR 2021 Workshop on Binarized Neural Networks

Tim de Bruin, a Deep Learning Scientist at Plumerai, was one of the invited speakers at last week’s CVPR 2021 Workshop on Binarized Neural Networks for Computer Vision. Tim presented some of Plumerai’s work on solving the remaining challenges with BNNs and explains why optimizing for accuracy is not enough.


Tuesday, April 20, 2021

tinyML Summit 2021: Person Detection under Extreme Constraints — Lessons from the Field

At this year’s tinyML Summit, we presented our new solution for person detection with bounding boxes. We have developed a person detection model that runs in real time (895 ms latency) on an STM32H7B3 board (Arm Cortex-M7), a popular off-the-shelf available microcontroller. To the best of our knowledge, this is the first time anyone runs person detection in real-time on Arm Cortex-M based microcontrollers, and we are very excited to be bringing this new capability to customers!

Watch the video below to see the live demo and learn how Binarized Neural Networks are an integral part of our solution.


Wednesday, April 7, 2021

MLSys 2021: Design, Benchmark, and Deploy Binarized Neural Networks with Larq Compute Engine

We are very excited to present our paper Larq Compute Engine: Design, Benchmark, and Deploy State-of-the-Art Binarized Neural Networks at the MLSys 2021 conference this week!

Larq Compute Engine (LCE) is a state-of-the-art inference engine for Binarized Neural Networks (BNNs). LCE makes it possible for researchers to easily benchmark BNNs on mobile devices. Real latency benchmarks are essential for developing BNN architectures that actually run fast on device, and we hope LCE will help people to build even better BNNs.

In the paper, we discuss the design of LCE and go into the technical details behind the framework. LCE was designed for usability and performance. It extends TensorFlow Lite, which makes it possible to integrate binarized layers into any model supported by TFLite and run it with LCE. Integration with Larq makes it very easy to move from training to deployment. And thanks to our hand-optimized inference kernels and sophisticated MLIR-based graph optimizations, inference speed is phenomenal.

12x - 17x speedup from binarization The impact of binarization on the latency of different convolutional blocks in ResNets - binary is up to 17x faster than 32-bit floating point and 12x faster than 8-bit integers on a Pixel 1 phone.

BNNs are all about efficient inference: their tiny memory footprint and compact bitwise computations make them perfect for edge applications and small, battery powered devices. However, this requires the people building BNNs to have access to real, empirical measurements of their model as it is deployed. Lacking the right tools, researchers too often refrain to proxy metrics such as the number of FLOPs. LCE aims to change this.

In the paper, we demonstrate the value of measuring latency directly by analyzing the execution of some of the best existing BNN designs, such as R2B and BinaryDenseNets. We identify suboptimal points in these models and present QuickNet, a new model family which outperforms all existing BNNs on ImageNet in terms of latency and accuracy.

Breakdown of QuickNet latency Breakdown of the latency of QuickNet-Large (QNL), our new state-of-the-art BNN. Note how compared to Real-to-Binary Net (R2B) and BinaryDenseNet, we win mostly on the high-precision first layer and ‘glue’ layers.

We will be presenting the work at the MLSys conference on Wednesday the 7th of April. The published paper is publicly available, and registered attendees will be able to view our oral talk. We hope you will all attend, and if you want to learn more about running BNNs on all sorts of devices, contact us at [email protected]!


Friday, January 22, 2021

tinyML Talks: Binarized Neural Networks on microcontrollers

For the past few months we have been working very hard on something new: Binarized Neural Networks on microcontrollers. By bringing deep learning to cheap, low-power microcontrollers we remove price and energy barriers and make it possible to embed AI into basically any device, even for relatively complex tasks such as person detection.

This week, we gave a presentation as part of the tinyML Talks webcast series where we explained what we had to build to make this work. We demonstrated how combining our custom training algorithms, inference software stack and datasets results in a highly accurate and efficient solution - in this case for person presence detection on the STM32L4R9, an ARM Cortex-M4 microcontroller from STMicroelectronics. This technology can be implemented to trigger push notifications for smart home cameras, wake up devices when a person is detected, detect occupancy of meeting rooms and for many more applications. This is a step towards our goal of making deep learning ultra low-power and a future where battery-powered peel-and-stick sensors can perform complex AI tasks everywhere.

Our BNN models with our proprietary inference software are faster and more accurate on ARM Cortex-M4 microcontrollers than the best publicly available 8-bit deep learning models with TensorFlow Lite for Microcontrollers.

We quickly found out that great training algorithms and inference software are not enough when building a solution for the real world. Collecting and labeling our own data turned out to be crucial in dealing with a wide variety of difficulties.

Testing our models thoroughly is equally important. Instead of relying only on simplistic metrics such as accuracy, we developed a suite of unit tests to ensure reliable performance in many settings.

Unit tests for Deep Learning Applications

Bringing all of this together results in a highly robust and efficient solution for person presence detection, as we showed in our live demo.

Live Demo of Person Detection running on Cortex-M4 microcontroller

We’re just scratching the surface with person detection and we have a lot more in store for the coming months - more applications on Arm Cortex-M microcontrollers, and even better performance with our own IP-core for BNN inference on low-powered FPGAs and with the xcore.ai platform from XMOS.

What’s Next?

We are very happy with the high attendance during the live webcast and a ton of questions were submitted during the Q&A. There was no time to answer all of them live, so we shared our answers on the tinyML forum and we’ll be sharing more of our progress at the tinyML Summit in March.

If you’re thinking about using Binarized Neural Networks to enable highly accurate deep learning on microcontrollers, get in touch!


Wednesday, April 1, 2020

XMOS and Plumerai partner to accelerate commercialisation of binarized neural networks

XMOS and Plumerai bring together their deep learning expertise across chip design and algorithms in a Binarized Neural Network capability, advancing the deployment of intelligence at the edge.

Bristol & London UK, 1 April 2020 – British technology companies XMOS and Plumerai have agreed a new strategic partnership that will support the development of binarized neural network (BNN) capabilities that enable AI to be embedded in a wide range of everyday devices efficiently at low-power and at low-cost.

The partnership will combine Plumerai’s Larq software library for training BNNs and the xcore.ai crossover processor from XMOS which provides native support for inference of BNNs. The combination of the two technologies will deliver a BNN capability that’s 2 to 4x more efficient than existing edge AI solutions.

This solution will enable a new generation of devices to run tasks that make our lives simpler and safer. This could include everything from identifying that a shopping package has been delivered to a safe place to managing traffic flows more efficiently, supporting remote healthcare applications or keeping shelves in stores stocked more efficiently. While BNNs are an emerging technology, the future potential is enormous.

The deep learning revolution is all around us today. But a typical application uses deep learning models with tens of millions of parameters – and despite the move to 16-bit and 8-bit encoding there is still an insatiable demand to increase the speed and efficiency of deep learning and AI systems. That’s where BNNs come in.

BNNs are the most efficient form of deep learning, offering to transform the economics and efficiency of edge intelligence by going all the way down to just a single bit. However, there are significant challenges involved in making BNNs commercially viable – for example, they demand specific attention in chip design for efficient inference and new software algorithms for training.

XMOS and Plumerai have combined their respective expertise in embedded chip design and deep learning algorithms to enable this breakthrough technology and bring AI to the devices all around us.

Mark Lippett, XMOS CEO says “BNNs gained prominence in the news recently with Apple’s purchase of Xnor.ai for a reported $200m. It’s little surprise that Apple is exploring AI capabilities at the edge, with advanced machine learning algorithms that can run efficiently in low-power, offline environments.

“Regardless of other moves in the market, our partnership with Plumerai is exciting for AI developers around the world. The combination of Larq and xcore.ai offers the first consolidated path to commercially deploying BNNs, which will be highly disruptive in intelligent embedded systems.”

Roeland Nusselder, Plumerai CEO adds “We are thrilled to join forces with the experienced team from XMOS to bring BNNs to the edge, and we share their excitement about the emerging era of intelligent connectivity. Binarized deep learning has tremendous potential for enabling a new generation of energy-efficient, AI-powered applications. Our two companies are perfectly positioned to turn this potential into reality.”

About XMOS

XMOS is a deep tech company at the leading edge of the AIoT. Since its inception in 2005, XMOS has had its finger on the pulse recognising and addressing the evolving market need. The company’s processors put intelligence, connectivity and enhanced computation at the core of smart products.

About Plumerai

Plumerai is making deep learning tiny and computationally radically more efficient to enable real-time inference on the edge – for automated warehouses, retail, smart cameras, micromobility and many more. The team is based in London, Amsterdam and Warsaw and is backed by world-class investors.


Tuesday, March 24, 2020

The Larq Ecosystem

State-of-the-art binarized neural networks and even faster inference

In our previous blog post, we announced Larq Compute Engine (LCE), our deployment solution for Binarized Neural Networks (BNNs). The combination of LCE, our training library Larq and the models in Larq Zoo forms the first end-to-end solution for anyone building applications using BNNs. In this post, we take a step back and look at this integrated ecosystem as a whole.

Continue reading on the Larq blog…


Monday, February 17, 2020

Announcing Larq Compute Engine v0.1

Optimized BNN inference for edge devices

We believe BNNs are the future of efficient inference, which is why we’ve developed tools to make it easier to train and research these models. Our open-source library Larq enables developers to build and train BNNs and integrates seamlessly with TensorFlow Keras. Larq Zoo provides implementations of major BNNs from the literature together with pretrained weights for state-of-the-art models.

But the ultimate goal of BNNs is to solve real-world problems on the edge. So once you’ve built and trained a BNN with Larq, how do you get it ready for efficient inference? Today, we’re introducing Larq Compute Engine to tackle that problem.

Continue reading on the Larq blog…