Thursday, May 21, 2026

Making Multi-Camera ReID work when mistakes cascade

Multi-Camera person ReID looks like an extension of face ID, but the signal is weaker, so every mistake compounds. Here's how we still achieved our accuracy goals.

Multi-Camera Re-Identification (ReID) recognizes the same person across different cameras and viewpoints without relying on face recognition. In smart home security cameras, it solves notification overload: instead of getting a separate alert every time the same person is detected again, the system groups those events into one. In enterprise security, it enables tracking individuals across a camera network and supports tasks like loitering detection, where you need to know that the person who has been lingering near a side entrance is the same one who was seen at the parking lot fifteen minutes ago.

In the video below, a gardener mowing the lawn triggers 34 push notifications in less than an hour on a multi-camera setup without ReID. With ReID, those repeated sightings of the same person collapse into a single event.

The setup

At Plumerai we previously built the automatic enrollment mode of FFID, our Familiar Face Identification system. FFID uses an end-to-end deep learning pipeline of three neural networks for detection, face representation, and face matching. We covered the technical details in our tinyML talk, and later expanded it with automatic enrollment. The core task is the same in both automatic FFID and Multi-Camera person ReID: automatically enroll previously unseen individuals and reliably match them when they reappear. Both systems do this by mapping image crops through an embedding network into a high-dimensional space, where crops of the same person should cluster together and crops of different people should sit far apart.

Faces are exceptional identity inputs: geometrically normalized and information-dense. Person appearance (clothing, build) is none of those things. Outfits change, and the same clothing can look entirely different from the front, the side, or the back. Identity clusters in appearance embedding space are broader, fuzzier, and more likely to overlap. This gap doesn’t fully close even as the embedder improves. It is an inherent property of the input signal. To build an accurate ReID system, we had to significantly expand our processing pipeline and improve every component in it.

Comparison of face ID and person ReID embedding clusters
Face embeddings (left) form tight, well-separated identity clusters, making each person easy to distinguish. Person appearance embeddings (right) produce broader, overlapping clusters, making every downstream decision harder.

Why this matters: errors cascade

Mistakes in a ReID pipeline are not independent. A tracking error that bundles embeddings from two different people corrupts the representation, which causes a wrong enrollment decision, which pollutes the gallery and causes future matching errors. In this context, “gallery” refers to the database of stored identity representations that the system matches new observations against. Each mistake seeds more. In face ID the margin is generous enough that the system usually self-corrects. In ReID, the weaker signal means the same cascade runs further before it’s caught.

In face ID, a mistake is mostly an isolated setback. In Multi-Camera person ReID, a mistake is the beginning of a chain of mistakes. That is what set the bar for everything we built.

Here is how we approached the problem, and what we had to get right. Eight components, one at a time.

Building an identity representation

A single embedding from a single frame is too noisy to act on. The goal is to aggregate observations over time into a compact, representative description of a person, one robust to any single bad crop. Getting that aggregation right required solving five things.

Component 1: Detecting people

Before we can embed crops of people, we need to find them in the video stream. This is where our object detectors and our advanced motion detection come in. A person that was never detected in the first place can’t be re-identified. A crop that includes multiple people, or only part of a person, will cause degraded embeddings. Our people detection is both accurate and efficient, setting the rest of the pipeline up for success. Our advanced motion detection allows our detector to spend its compute where it matters, ensuring that even an efficient detector won’t miss people walking by in the distance.

Component 2: The embedder

The next component is the embedding network: the model that maps a person crop into a point in high-dimensional space. A better embedder produces tighter clusters, which results in easier decisions everywhere downstream.

To create an embedder that was both highly efficient and highly accurate, we optimized every stage of the training process.

We invested heavily in assembling relevant training data rather than relying on off-the-shelf datasets. We also built specialized ReID labeling tools to make this job both efficient and accurate, since the volume of data required and the subtlety of appearance-based identity made general-purpose annotation workflows insufficient.

We trained with a mix of several losses, one of which we carefully designed to encourage the embedder to encode input quality into the embedding (which is helpful downstream, see Component 4). We optimized the architecture (through neural architecture search) to be accurate even on a tiny compute budget. We optimized our knowledge distillation process to preserve the input-quality information of the embeddings (in addition to preserving the embedding space structure of the teacher with a fraction of the compute). Our embedder training pipeline consists of several stages, with different datasets and objectives, which all together produce a model that is both accurate and efficient.

The challenge for ReID is that person appearance offers none of the structure that makes face embedders easy to train: viewpoints vary wildly, scale changes, and clothing shifts dramatically under different lighting. There are no clean geometric constraints to exploit. As a result, even a very good ReID embedder produces appearance clusters that remain inherently broader than face clusters. A better ReID embedder narrows the gap but doesn’t close it. The remaining ambiguity between overlapping clusters still had to be absorbed by every downstream component. That realization is what shaped the rest of the system.

Component 3: Short-term tracking

Before you can aggregate embeddings, you need to know which ones belong together. Short-term tracking does this by running a multi-object tracker across consecutive frames and associating detections into per-person tracklets. When the tracker is correct, every embedding in a track comes from the same person. When it makes a mistake (swapping boxes during an occlusion, or continuing a track past an entry or exit), the aggregated representation silently starts describing a mix of two different people.

In face ID, these errors surface quickly: face embeddings from different people are distinctive enough that the mismatch is hard to miss. In ReID, similar clothing or build can make two people’s embeddings close enough that a tracking error goes unnoticed much longer. To make ReID work we had to roughly halve our tracker’s identity-switch rate compared to what FFID needed, combining location, appearance, detection metadata, and more in an optimal way.

The hardest part of tracking is data association: deciding, frame by frame, which detection belongs to which track. Classical approaches rely on hand-tuned heuristics or fixed distance thresholds. We trained our data association policy with reinforcement learning, letting the system learn from experience which assignment decisions lead to the best long-term tracking accuracy rather than optimizing each frame in isolation. The result was not just fewer stolen tracklets, but also much longer and more consistent tracking.

Component 4: Embedding quality estimation

Even within a clean track, not all embeddings are worth keeping. A blurry thumbnail, a person half-obscured by a doorframe, a crop taken mid-motion at the edge of the frame: these can land far from where they should be in the identity space, and admitting them degrades the aggregated representation. A single bad embedding that slips through can tip a borderline aggregation toward the wrong cluster, so we had to work hard to keep them out.

Detecting poor crops required real research. We used the learned quality estimation we trained into the embedder together with additional signals from our pipeline to emphasize the best crops per track. This helps us get high-quality aggregated embeddings, even when many of the individual crops include multiple people, partial people, or degraded image quality. It made the difference between a system that stayed accurate and one that drifted.

Component 5: Embedding sub-sampling

Even after quality filtering, there’s a subtler problem: the remaining embeddings are not independent and identically distributed (IID). Consecutive frames share the same lighting, the same angle, the same partial occlusion. When samples are correlated like this, averaging them doesn’t reduce variance the way it would for independent observations. It just encodes whatever systematic bias happens to be present right now.

The fix is to select for diversity rather than take everything. An embedding earns a place in the stored set only if it is both high-quality and meaningfully different from what’s already there. This builds a spread-out sample of the person’s appearance distribution instead of a dense clump of near-identical frames from one viewpoint, and diversity-aware aggregation turned out to be one of the more important things we got right.

Once a robust identity representation has been built, it feeds into two further decisions that determine the overall accuracy of the system.

Component 6: Matching vs new enrollment

For each new observation of sufficient quality, the system faces a binary choice: known person, or someone new? Both directions of error matter. A false match corrupts the wrong person’s stored identity. A false enrollment fragments a known person across multiple gallery entries. Either way, future observations are matched against a gallery that’s less accurate than it was before, setting us up for more errors and an even further degraded gallery.

Calibrating this choice correctly is crucial for ReID, as mistakes on either side of the decision will start to snowball.

Even a well-tuned system makes mistakes. Some are obvious and self-correcting. Others are subtle: a match that just cleared the threshold, a near-duplicate enrollment that slipped past the filter. These accumulate quietly. Left unaddressed, a gallery that starts clean gradually becomes one full of fragmented identities, silently merged entries, and stale records.

Keeping a gallery healthy means actively monitoring for corruption and correcting it: consolidating fragments and pruning entries. In ReID, the higher error rate upstream means gallery maintenance isn’t optional housekeeping. It’s a core part of the system.

Putting it all together

This post primarily covers the ML and algorithmic side of Multi-Camera ReID, but making it work also required substantial system-level software engineering.

Component 8: System-level integration

To operate reliably across multiple cameras and deployment environments, the system had to keep galleries synchronized across cameras, integrate ReID with FFID, expose a clean API for on-device, on-prem, cloud, and hybrid deployments, and keep compute and memory usage low enough for consumer-priced cameras and cost-sensitive cloud infrastructure. Those are stories for another time.

Every component has to earn it

Person appearance embeddings are inherently harder to cluster than face embeddings: broader, fuzzier, more likely to overlap. Because errors cascade, there is very little slack for any component to be merely good enough. A weak tracker corrupts the aggregation. A leaky quality filter corrupts the representation. A poorly calibrated enrollment decision corrupts the gallery. And a gallery that isn’t actively maintained degrades over time, taking accuracy with it.

The only way to build a ReID system that stays accurate, even over longer periods of time, is to make each component genuinely good. That took serious research and engineering, and it’s what made all the difference.