By Tim-MemryX in new apps — 27 Oct 2025

Why YOLO-E is cool

YOLO-E: the latest open-vocab model for segmentation and detection, running on MX3

If you haven't heard about YOLO-E, it is the latest open-vocabulary object detection and segmentation model from researchers at Tsinghua University (paper), and it's the most impressive model of its kind yet!

Open-vocab, or "zero-shot" models are a new type of neural network that combines image and text embeddings, and a training technique called CLIP, to draw bounding boxes or segmentation maps around objects without having them in the training dataset.

0:00

/0:18

Previous models in this category have included YOLO-World, which we showcased at CES '25 and ISC West by the way, and the earliest model, OWL-ViT.

Why is this "cool"?

With open-vocabulary object detection and instance segmentation models like YOLO-E, you can detect classes objects without ever having trained on them.

Imagine you have a classic YOLO model trained for a "person" class. But you want to narrow down specifically to people wearing red shirts, or people walking dogs. You would normally have to:

Collect hundreds of images of red shirts and dog walkers
Label all of this data
Retrain your YOLO model
Re-deploy

But with YOLO-E, you simply add a class called "red shirt" or "dog walker", click Apply, and it's running within seconds.

The accuracy will of course be higher with a custom trained YOLO, but for quick turnaround time, nothing beats the open-vocab models like YOLO-E.

Running on MemryX

We've made a full end-to-end app with a GUI in our Examples repo on GitHub! It's simple to run and includes options to show/hide boxes, masks, labels, blur background, and more.

YOLOE app running on MX3

MemryX's unique strength

Thanks to the MX3's use of floating point math (or more specifically Group-BFloat16) for feature maps, there is no model quantization or accuracy tuning steps typically needed for INT8-based accelerators.

In models like YOLO-World and YOLO-E where the layer weights at the end of the model change whenever the prompted classes change, the model would have to be re-quantized (or even worse: fully retrained) to INT8 every single time the classes are modified. This can basically negate the advantage of open-vocab, which is the fast update time and lack of a need for data/images of your target classes.

With the MX3, the unmodified floating point model is compiled to DFP and run right away, without these prohibitively expensive and slow steps!