Why YOLO-E is cool
YOLO-E: the latest open-vocab model for segmentation and detection, running on MX3
If you haven't heard about YOLO-E, it is the latest open-vocabulary object detection and segmentation model from researchers at Tsinghua University (paper), and it's the most impressive model of its kind yet!
Open-vocab, or "zero-shot" models are a new type of neural network that combines image and text embeddings, and a training technique called CLIP, to draw bounding boxes or segmentation maps around objects without having them in the training dataset.
Previous models in this category have included YOLO-World, which we showcased at CES '25 and ISC West by the way, and the earliest model, OWL-ViT.
Why is this "cool"?
With open-vocabulary object detection and instance segmentation models like YOLO-E, you can detect classes objects without ever having trained on them.
Imagine you have a classic YOLO model trained for a "person" class. But you want to narrow down specifically to people wearing red shirts, or people walking dogs. You would normally have to:
- Collect hundreds of images of red shirts and dog walkers
- Label all of this data
- Retrain your YOLO model
- Re-deploy
But with YOLO-E, you simply add a class called "red shirt" or "dog walker", click Apply, and it's running within seconds.
The accuracy will of course be higher with a custom trained YOLO, but for quick turnaround time, nothing beats the open-vocab models like YOLO-E.
Running on MemryX
We've made a full end-to-end app with a GUI in our Examples repo on GitHub! It's simple to run and includes options to show/hide boxes, masks, labels, blur background, and more.

MemryX's unique strength
Thanks to the MX3's use of floating point math (or more specifically Group-BFloat16) for feature maps, there is no model quantization or accuracy tuning steps typically needed for INT8-based accelerators.
In models like YOLO-World and YOLO-E where the layer weights at the end of the model change whenever the prompted classes change, the model would have to be re-quantized (or even worse: fully retrained) to INT8 every single time the classes are modified. This can basically negate the advantage of open-vocab, which is the fast update time and lack of a need for data/images of your target classes.
With the MX3, the unmodified floating point model is compiled to DFP and run right away, without these prohibitively expensive and slow steps!