Monday, October 4, 2021
At Plumerai we enable our customers to perform increasingly complex AI tasks on tiny embedded hardware. We’re proud to announce that our inference software for Arm Cortex-M microcontrollers is the fastest and most memory-efficient in the world, for both Binarized Neural Networks and for 8-bit deep learning models. Our inference software is an essential component of our solution, since it directs resource management akin to an operating system. It has 40% lower latency and requires 49% less RAM than TensorFlow Lite for Microcontrollers with Arm’s CMSIS-NN kernels while retaining the same accuracy. It also outperforms any other deep learning inference software for Arm Cortex-M:
|Inference time||RAM usage|
|TensorFlow Lite for Microcontrollers 2.5 (with CMSIS-NN)||129 ms||155 KiB|
|Edge Impulse’s EON||120 ms||153 KiB|
|MIT’s TinyEngine 1||124 ms||98 KiB|
|STMicroelectronics’ X-CUBE-AI||103 ms||109 KiB|
|Plumerai’s inference software||77 ms||80 KiB|
Our inference software builds upon TensorFlow Lite for Microcontrollers, such that it supports all of the same operations and more. But since resources are scarce on a microcontroller, we do not rely on TensorFlow or Arm’s kernels for the most performance-critical layers. Instead, for those layer types we developed custom kernel code, optimized for lowest latency and memory usage. This includes optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more. To become faster than the already heavily optimized Arm Cortex-M specific CMSIS-NN kernels, we had to go deep inside the inner-loops and also rethink the higher-level algorithms. This includes optimizations such as hand-written assembly blocks, improved register usage, pre-processing of weights and input activations and template-based loop-unrolling.
Although these generic per-layer-type optimizations resulted in great speed-ups, we went further and squeezed out every last bit of performance from the Arm Cortex-M microcontroller. To do that, we perform specific optimizations for each layer in a neural network. For instance, rather than only optimizing convolutions in general, our inference software makes specific improvements based on all actual values of layer parameters such as kernel sizes, strides, padding, etc. Since we do not know upfront which neural networks our inference software might run, we make these optimizations together with the compiler. This is achieved by generating code in an automated pre-processing step using the neural network as input. We then guide the compiler to do all the necessary constant propagation, function inlining and loop unrolling to achieve the lowest possible latency.
Memory usage is an important constraint on embedded devices; however fast or slow the software is, it has to fit in memory to run at all. TensorFlow Lite for Microcontrollers already comes with a memory planner that ensures a tensor only takes up space while there is a layer using it. We further optimized memory usage with a smart offline memory planner that analyzes the memory access patterns of each layer of the network. Depending on properties such as filter size, the memory planner allows the input and output of a layer to partially or even completely overlap, effectively computing the layer in-place.
Besides Arm Cortex-M, we also optimize our inference software for Arm Cortex-A and RISC-V architectures. And if the above results are still not fast enough for your application, we go even further. We make our AI tiny and radically more efficient by using Binarized Neural Networks (BNNs) - deep learning models that use only a single bit to encode each weight and activation. We are building improved deep learning model architectures and training algorithms for BNNs, we are designing a custom IP-core for customers with FPGAs and we are composing optimized training datasets. All these improvements mean that we can process more frames per second, save more energy, run larger and more accurate AI models and deploy on cheaper hardware.
Get in touch if you want to use the world’s fastest inference software and the most advanced AI on your embedded device.
Wednesday, October 13, 2021: Updated memory usage to match newest version of the inference software.
Results copied from https://github.com/mit-han-lab/tinyml/tree/master/mcunet, all other results were measured by us. ↩︎
The explicit padding layers in MobileNetV2 were fused, and to be able to compare with the TinyEngine results the number of filters in the final convolution layer (1280) was scaled by alpha to 384. The exact model can be downloaded here. ↩︎
Tuesday, August 17, 2021
So far we have mostly written about how we enable AI applications on tiny hardware by using Binarized Neural Networks (BNNs). The use of BNNs helps us to reduce the required memory, the inference latency and energy consumption of our AI models, but there is something that we have been less vocal about that is at least as important for AI in the real world: high-quality training data.
To train tiny models, choose your data wisely
Deep learning models are famously hungry for training data, as more training data is usually the most effective way to improve accuracy. But once we started to train deep learning models that are truly tiny — with model sizes of a few hundred KB or less for computer vision tasks like person detection — we discovered that it is not so much the quantity but the quality of the training data that matters.
The key insight is that these tiny deep learning models have limited information capacity, so you cannot afford to waste precious KBs on learning irrelevant features. You have to be very strict in telling the model what you do and do not find important. We can communicate this to the model by carefully selecting its training data. Furthermore, the balance in your dataset is important because a compressed model tends to limit itself to perform well only on the concepts for which there are many samples in the training dataset. So we curate our datasets to ensure that all the important use-cases are included, and in the right balance.
Know your model
As tiny AI models become a part of our world, it is crucial that we know these models well. We have to understand in what situations they are reliable and where their pitfalls are.
High-level metrics like accuracy, precision and recall don’t provide anything close to the level of detail required here. Instead — taking inspiration from Andrej Karpathy’s Software 2.0 essay — we test our models in very specific situations. And rather than writing code we express these tests with data. Every model that we ship needs to surpass an accuracy threshold for very specific subsets of our dataset. For example, for person detection we have implemented tests for people standing far away or for people who are only showing their back, and for scenes containing coats and other fabrics that might confuse the model.
Several samples from our
person_standing_back data unit test.
These data unit tests guarantee that the model will work reliably for our customers’ use-cases and also enable us to ensure our models are not making mistakes due to fairness-related attributes, such as skin color or gender. As we experiment with new training algorithms, model architectures and training datasets, our data unit tests allow us to track the progress these inventions make on the trained models that we ship to customers.
The outcome of some of our data unit tests — a standard MobileNet person detection model (left) versus Plumerai’s person detection model (right). The
person_hallway_beyond_7m test result shows us that this model can not yet be reliably used to detect people from large distances.
Public datasets: handle with care
Public datasets are usually composed of photos taken by people for the enjoyment of people, such as photos of concerts, food or art. These photos are typically very different from the scenes in which TinyML/low-power AI products are deployed. For example, the dark photos in public datasets are mostly from concerts and almost always have people in them. So if this dataset is used to train a tiny deep learning model, the model will try to take a shortcut and associate dark images with the presence of people. This hack works fine as long as the model is evaluated on the same distribution of images, as is the case in the validation sets of public datasets, but will cause problems once the model is deployed in a smart doorbell camera. In addition, public datasets often contain misclassified images, images encoding harmful correlations and images that although technically labelled correctly are too difficult to classify correctly.
Most dark photos in the public Open Images dataset are from concerts and contain people. A tiny deep learning model trained on this dataset will try to take a problematic shortcut and associate dark images with the presence of people.
To circumvent the many problems of public datasets, we at Plumerai collect our own data — straight from the cameras used in TinyML products in the situations that our products are intended for. But that does not mean that we do not use any public datasets to train our models. We want to benefit from the scale and diversity of these datasets, while mitigating their quality issues. So we use our own data and analysis methods to automatically identify specific issues stemming from sampling biases in the public datasets and solve those problematic correlations that our models are sensitive to.
The images in public person detection datasets (left) are largely irrelevant to the scenes encountered by TinyML devices in the real-world (right).
One of the tools we can use to debug our tiny AI models are saliency maps. They allow us to see what our models are correctly (left) or incorrectly (right) triggered by.
Building the tiny AI model factory
Although new training optimizers and new model architectures get most of the attention, there are many other components required to build great AI applications on tiny hardware. Data unit tests and tools for dataset curation are some of the components that are mentioned above, but there are many more. We build tools to identify what data needs to be labelled, use large models in the cloud for auto labeling our datasets and combine these with human labelling. We build the whole infrastructure that is necessary to go as fast as possible through our model development cycle: train a tiny AI model, test it, identify the failure cases, collect and label data for those failure cases and then go through this cycle again and again. All these components together allow us to build tiny but highly accurate AI models for the real world.
ML code is just one small component of the large and complex tiny AI model factory that we are building. From Sculley et al. (2015).
Thursday, July 1, 2021
Tim de Bruin, a Deep Learning Scientist at Plumerai, was one of the invited speakers at last week’s CVPR 2021 Workshop on Binarized Neural Networks for Computer Vision. Tim presented some of Plumerai’s work on solving the remaining challenges with BNNs and explains why optimizing for accuracy is not enough.
Tuesday, April 20, 2021
At this year’s tinyML Summit, we presented our new solution for person detection with bounding boxes. We have developed a person detection model that runs in real time (895 ms latency) on an STM32H7B3 board (Arm Cortex-M7), a popular off-the-shelf available microcontroller. To the best of our knowledge, this is the first time anyone runs person detection in real-time on Arm Cortex-M based microcontrollers, and we are very excited to be bringing this new capability to customers!
Watch the video below to see the live demo and learn how Binarized Neural Networks are an integral part of our solution.
Wednesday, April 7, 2021
We are very excited to present our paper Larq Compute Engine: Design, Benchmark, and Deploy State-of-the-Art Binarized Neural Networks at the MLSys 2021 conference this week!
Larq Compute Engine (LCE) is a state-of-the-art inference engine for Binarized Neural Networks (BNNs). LCE makes it possible for researchers to easily benchmark BNNs on mobile devices. Real latency benchmarks are essential for developing BNN architectures that actually run fast on device, and we hope LCE will help people to build even better BNNs.
In the paper, we discuss the design of LCE and go into the technical details behind the framework. LCE was designed for usability and performance. It extends TensorFlow Lite, which makes it possible to integrate binarized layers into any model supported by TFLite and run it with LCE. Integration with Larq makes it very easy to move from training to deployment. And thanks to our hand-optimized inference kernels and sophisticated MLIR-based graph optimizations, inference speed is phenomenal.
The impact of binarization on the latency of different convolutional blocks in ResNets - binary is up to 17x faster than 32-bit floating point and 12x faster than 8-bit integers on a Pixel 1 phone.
BNNs are all about efficient inference: their tiny memory footprint and compact bitwise computations make them perfect for edge applications and small, battery powered devices. However, this requires the people building BNNs to have access to real, empirical measurements of their model as it is deployed. Lacking the right tools, researchers too often refrain to proxy metrics such as the number of FLOPs. LCE aims to change this.
In the paper, we demonstrate the value of measuring latency directly by analyzing the execution of some of the best existing BNN designs, such as R2B and BinaryDenseNets. We identify suboptimal points in these models and present QuickNet, a new model family which outperforms all existing BNNs on ImageNet in terms of latency and accuracy.
Breakdown of the latency of QuickNet-Large (QNL), our new state-of-the-art BNN. Note how compared to Real-to-Binary Net (R2B) and BinaryDenseNet, we win mostly on the high-precision first layer and ‘glue’ layers.
We will be presenting the work at the MLSys conference on Wednesday the 7th of April. The published paper is publicly available, and registered attendees will be able to view our oral talk. We hope you will all attend, and if you want to learn more about running BNNs on all sorts of devices, contact us at [email protected]!
Friday, January 22, 2021
For the past few months we have been working very hard on something new: Binarized Neural Networks on microcontrollers. By bringing deep learning to cheap, low-power microcontrollers we remove price and energy barriers and make it possible to embed AI into basically any device, even for relatively complex tasks such as person detection.
This week, we gave a presentation as part of the tinyML Talks webcast series where we explained what we had to build to make this work. We demonstrated how combining our custom training algorithms, inference software stack and datasets results in a highly accurate and efficient solution - in this case for person presence detection on the STM32L4R9, an ARM Cortex-M4 microcontroller from STMicroelectronics. This technology can be implemented to trigger push notifications for smart home cameras, wake up devices when a person is detected, detect occupancy of meeting rooms and for many more applications. This is a step towards our goal of making deep learning ultra low-power and a future where battery-powered peel-and-stick sensors can perform complex AI tasks everywhere.
Our BNN models with our proprietary inference software are faster and more accurate on ARM Cortex-M4 microcontrollers than the best publicly available 8-bit deep learning models with TensorFlow Lite for Microcontrollers.
We quickly found out that great training algorithms and inference software are not enough when building a solution for the real world. Collecting and labeling our own data turned out to be crucial in dealing with a wide variety of difficulties.
Testing our models thoroughly is equally important. Instead of relying only on simplistic metrics such as accuracy, we developed a suite of unit tests to ensure reliable performance in many settings.
Bringing all of this together results in a highly robust and efficient solution for person presence detection, as we showed in our live demo.
We’re just scratching the surface with person detection and we have a lot more in store for the coming months - more applications on Arm Cortex-M microcontrollers, and even better performance with our own IP-core for BNN inference on low-powered FPGAs and with the xcore.ai platform from XMOS.
We are very happy with the high attendance during the live webcast and a ton of questions were submitted during the Q&A. There was no time to answer all of them live, so we shared our answers on the tinyML forum and we’ll be sharing more of our progress at the tinyML Summit in March.
If you’re thinking about using Binarized Neural Networks to enable highly accurate deep learning on microcontrollers, get in touch!
Wednesday, April 1, 2020
XMOS and Plumerai bring together their deep learning expertise across chip design and algorithms in a Binarized Neural Network capability, advancing the deployment of intelligence at the edge.
Bristol & London UK, 1 April 2020 – British technology companies XMOS and Plumerai have agreed a new strategic partnership that will support the development of binarized neural network (BNN) capabilities that enable AI to be embedded in a wide range of everyday devices efficiently at low-power and at low-cost.
The partnership will combine Plumerai’s Larq software library for training BNNs and the xcore.ai crossover processor from XMOS which provides native support for inference of BNNs. The combination of the two technologies will deliver a BNN capability that’s 2 to 4x more efficient than existing edge AI solutions.
This solution will enable a new generation of devices to run tasks that make our lives simpler and safer. This could include everything from identifying that a shopping package has been delivered to a safe place to managing traffic flows more efficiently, supporting remote healthcare applications or keeping shelves in stores stocked more efficiently. While BNNs are an emerging technology, the future potential is enormous.
The deep learning revolution is all around us today. But a typical application uses deep learning models with tens of millions of parameters – and despite the move to 16-bit and 8-bit encoding there is still an insatiable demand to increase the speed and efficiency of deep learning and AI systems. That’s where BNNs come in.
BNNs are the most efficient form of deep learning, offering to transform the economics and efficiency of edge intelligence by going all the way down to just a single bit. However, there are significant challenges involved in making BNNs commercially viable – for example, they demand specific attention in chip design for efficient inference and new software algorithms for training.
XMOS and Plumerai have combined their respective expertise in embedded chip design and deep learning algorithms to enable this breakthrough technology and bring AI to the devices all around us.
Mark Lippett, XMOS CEO says “BNNs gained prominence in the news recently with Apple’s purchase of Xnor.ai for a reported $200m. It’s little surprise that Apple is exploring AI capabilities at the edge, with advanced machine learning algorithms that can run efficiently in low-power, offline environments.
“Regardless of other moves in the market, our partnership with Plumerai is exciting for AI developers around the world. The combination of Larq and xcore.ai offers the first consolidated path to commercially deploying BNNs, which will be highly disruptive in intelligent embedded systems.”
Roeland Nusselder, Plumerai CEO adds “We are thrilled to join forces with the experienced team from XMOS to bring BNNs to the edge, and we share their excitement about the emerging era of intelligent connectivity. Binarized deep learning has tremendous potential for enabling a new generation of energy-efficient, AI-powered applications. Our two companies are perfectly positioned to turn this potential into reality.”
XMOS is a deep tech company at the leading edge of the AIoT. Since its inception in 2005, XMOS has had its finger on the pulse recognising and addressing the evolving market need. The company’s processors put intelligence, connectivity and enhanced computation at the core of smart products.
Plumerai is making deep learning tiny and computationally radically more efficient to enable real-time inference on the edge – for automated warehouses, retail, smart cameras, micromobility and many more. The team is based in London, Amsterdam and Warsaw and is backed by world-class investors.
Tuesday, March 24, 2020
State-of-the-art binarized neural networks and even faster inference
In our previous blog post, we announced Larq Compute Engine (LCE), our deployment solution for Binarized Neural Networks (BNNs). The combination of LCE, our training library Larq and the models in Larq Zoo forms the first end-to-end solution for anyone building applications using BNNs. In this post, we take a step back and look at this integrated ecosystem as a whole.
Monday, February 17, 2020
Optimized BNN inference for edge devices
We believe BNNs are the future of efficient inference, which is why we’ve developed tools to make it easier to train and research these models. Our open-source library Larq enables developers to build and train BNNs and integrates seamlessly with TensorFlow Keras. Larq Zoo provides implementations of major BNNs from the literature together with pretrained weights for state-of-the-art models.
But the ultimate goal of BNNs is to solve real-world problems on the edge. So once you’ve built and trained a BNN with Larq, how do you get it ready for efficient inference? Today, we’re introducing Larq Compute Engine to tackle that problem.