OpenMaskXR: Open-Vocabulary Scene Understanding in XR

Table of Contents

With OpenMaskXR, we brought OpenMask3D to Extended Reality. OpenMask3D is an approach to 3D instance segmentation with open-vocabulary querying. This means you’re no longer stuck with the typical fixed set of classes like tables, chairs, or windows – you can now ask for:

Uncommon objects, such as a “footrest” or “can of Pepsi”
Affordances like “sit” or “clean”
Properties, for example, “yellow pillow” or “rectangular table”
States, like distinguishing a “full garbage bin” from an “empty garbage bin”

In OpenMaskXR, we aimed to utilize this advanced scene understanding in XR and implemented various software components whose tasks range from scanning the environment using everyday hardware to displaying instances for open-vocabulary object querying aligned to the user’s real-world surroundings. Watch our video below or read our report to learn more!

High-Level Overview
#

While our report goes into much more detail, here’s a quick, high-level rundown of our method. As input, OpenMask3D requires posed RGBD frames together with a colored reconstruction mesh. If XR headsets manage to provide any camera access at all (not-so-fun fact: as of December 2024, Meta is camera-shy and Apple locks camera access behind an enterprise entitlement), you’re in luck—sort of. You’ll typically only get posed RGB images paired with an uncolored reconstruction mesh. To work around this, we synthesize depth images using the camera pose together with the reconstruction mesh, then project the image colors onto the reconstruction mesh.

This data can then be fed to OpenMask3D to obtain segmented 3D instances and a CLIP vector associated with each instance. Querying is straightforward: embed your textual query into the CLIP space and retrieve the 3D instances with the highest cosine similarity. We let the user query either through voice or a keyboard and provide a simple interface to tweak how closely objects need to match with the query in order to be highlighted.

For display, a user may choose whether to show the scene in diorama mode (resembling a dollhouse) or life-size display. Further, they can toggle between seeing their actual environment or not – though the real magic happens when you query your surroundings and see life-size instances highlighted!

Our code is published on GitHub.

High-Level Overview#

High-Level Overview
#