Why AI Glasses for Autonomous Kitchen Robots?

At Circus SE, our CA-1 autonomous kitchen robots are entering high-volume deployment worldwide — from REWE supermarkets in Düsseldorf to Mercedes-Benz canteens, the German Bundeswehr, and educational institutions in Beijing. Each CA-1 unit is a self-contained, glass-enclosed cooking system with dual robotic arms, 36 ingredient silos, AI-driven computer vision, and induction heating that produces up to 2,000 meals per day without human intervention. But behind every autonomous robot, there is still a critical human role: the operator.

Human operators are responsible for loading ingredients, performing assembly procedures, handling incidents, and ensuring quality standards across all CA-1 locations. As we scale globally, the challenge of onboarding and continuously training these operators has become one of the most pressing bottlenecks. Flying every new hire to Munich for multi-day sessions simply does not scale. We needed a solution that brings expert-level guidance directly to operators on-site, in their language, personalized to their skill level — and available in real time.

That solution arrived when Meta opened early developer access to the Wearables Device Access Toolkit for the Ray-Ban Meta Gen 2 glasses. For Circus, this was a natural expansion of an existing technology partnership: we already integrate Meta’s Llama large language models into the CA-1’s voice-based customer ordering interface, where Llama powers intelligent menu consultation and personalized dietary recommendations. Moving from LLMs to hardware and vision models was the logical next step.

The Operator AI: A Wearable Sparring Partner

Our vision is straightforward: equip every operator with an “Operator AI” — a wearable, conversational AI assistant that acts as a knowledgeable sparring partner during day-to-day work. The system is trained on the complete set of operator training materials and continuously updated. It has access to operational data across all CA-1 locations in parallel, including live camera feeds inside each unit, system logs, real-time incident cases, and maintenance histories.

Through the Ray-Ban Meta glasses, an operator can simply ask a question (“What’s the next step for silo assembly?”), request a visual check (“Capture — does this look correct?”), or receive proactive alerts when the AI detects an anomaly through the glasses’ camera. The guidance is delivered via natural voice conversation and, with the upcoming Meta Ray-Ban Display glasses featuring an in-lens HUD and the Meta Neural Band for gesture control, operators will soon see step-by-step instructions directly in their field of view — hands-free.

Technical Architecture: From Glasses to Cloud and Back

The Meta Wearables SDK as a Low-Level I/O Platform

A key architectural decision was building on Meta's Wearables Device Access Toolkit, which gives us raw access to the glasses' camera feed, microphone, and speakers — no proprietary middleware, no vendor lock-in on the software side. We choose our own models, our own backend, our own processing logic. The SDK simply provides the I/O bridge between the glasses hardware and our native mobile application, which we built as a React Native wrapper. This level of freedom is critical for an industrial use case like ours, where we need full control over data routing, model selection, and latency optimization.

The SDK supports camera streaming at up to 720p resolution at 30 FPS via Bluetooth, photo capture on demand, and full bidirectional audio. This low-level access is a significant advantage: it means we can pipe visual data to any model of our choosing and route audio through any conversational AI backend.

Application Architecture

Our sample application serves as both a preparation tool and an on-device proxy. It extracts single photos from the live camera feed at configurable resolutions, manages LLM tool-use orchestration (triggering guided workflows, accepting/rejecting intermediate steps, launching application actions), and forwards microphone input and audio output bidirectionally between the cloud-hosted conversational AI and the operator.

For the conversational layer, we use the OpenAI Realtime API. The latency is remarkably low, and its interruption handling is impressive — operators can cut in mid-sentence to redirect the assistant, which is essential in fast-paced kitchen environments. The glasses’ directional microphones already do a solid job isolating the wearer’s voice, and the Realtime API’s voice isolation handles the remaining ambient noise. The only limitation we have observed: when a second person is within approximately one meter, the glasses occasionally pick up their speech.

Visual Intelligence: Our Proprietary Operator AI Pipeline

The core IP behind our Operator AI is not any single off-the-shelf model — it is the proprietary intelligence pipeline we built to understand what happens inside and around a CA-1 in real time. We combine open-source vision foundations like Meta's DINOv3 and SAM 3 as building blocks, but the value layer sits entirely on our side: a domain-specific visual reasoning system trained on thousands of hours of real CA-1 operational footage. It knows what a correctly seated silo looks like versus one that is 2mm off. It knows the difference between a clean induction pot ready for the next cycle and one that needs intervention. No foundation model ships with that knowledge — we built it, and we continuously train it with every operator interaction across every location worldwide.

Where foundation models like SAM 3 give us promptable segmentation — the ability to say "find every silo lid in this frame" — our system adds the operational logic on top: Does the lid orientation match the assembly spec? Is the seal fully engaged? Should the operator proceed or redo the step? This closed-loop visual verification is what turns a generic AI capability into a production-grade quality assurance system that works autonomously at every CA-1 site. We use DINOv3's dense feature extraction and SAM 3's object tracking as infrastructure, the same way a car manufacturer uses steel — the engineering that makes it drive is ours.

JSON-Defined Guided Workflows with Visual Confirmation

With this application, we can deploy any kind of step-by-step guide to operators worldwide — in their preferred language, at the right depth of detail, and in a tone personalized for each individual. We use a simple JSON schema to define guided workflows. Take our Silo Assembly Guide as an example: five steps, each with voice-guided instructions and fixed acceptance criteria based on visual confirmation through the glasses’ camera. The AI evaluates the camera feed against predefined visual benchmarks before advancing to the next step. This means the system is always contextually aware of the operator’s situation without requiring manual explanations.

This architecture turns the Operator AI into something resembling a personal education system: adaptive, multilingual, visual, and always patient. Whether an operator in Hamburg needs a refresher on the ingredient loading sequence or a new hire in Beijing is assembling their first CA-1 silo, the experience is consistent, high-quality, and scalable.

Practical Lessons and Developer Experience

Speed of Development

It took us approximately one month to build the working MVP — including model setup, SDK integration, and the React Native wrapper. AI-assisted coding was instrumental here, especially because the Wearables Device Access Toolkit was brand new when we started. Documentation and sample applications were still minimal, which is precisely why we decided to open-source both our NPM package and our sample React Native application, demonstrated in our accompanying video.

Designing for Real-World Constraints

Wearable hardware comes with real constraints — most notably, continuous camera streaming drains the battery in roughly one to two hours. Rather than treating this as a limitation, we designed our application architecture around intelligent resource usage from the start. The camera activates only on-demand when the operator triggers a visual check by saying "Capture," which invokes the LLM's tool-use function. Between captures, the system operates in audio-only mode, which is sufficient for the vast majority of guided interactions. This event-driven approach extends usable session time significantly and reflects a broader design principle: our Operator AI is built to work within the realities of industrial environments, not around idealized lab conditions.

Platform Confidence and Ecosystem Momentum

Building a production system on a worldwide new platform always carries challenges. What gave us confidence to move fast was the direct engagement with Meta's product team — they reached out for feedback, incorporated developer requests, and are actively evolving the toolkit based on real-world use cases like ours. We submitted feature requests including native physical button access and on-device transcription capabilities, both of which are being explored. Combined with Meta's strategic investment of approximately €3 billion for a ~3% stake in EssilorLuxottica (the parent company of Ray-Ban), the signal is clear: this hardware platform is here to stay and will scale globally. For us, that means we can invest deeply in our Operator AI knowing the underlying device ecosystem will keep advancing.

Meta’s commitment to this platform is further underscored by their acquisition of a ~3% stake in EssilorLuxottica (the parent company of Ray-Ban) for approximately €3 billion, with potential expansion to 5%. This investment signals that the hardware side of the glasses ecosystem will continue to advance rapidly and remain broadly available. For developers, this translates into long-term platform confidence.

Edge Computing with NVIDIA Jetson: Solving the Connectivity Challenge

Our specific use case surfaced another challenge: many CA-1 deployment locations have unreliable internet connectivity. The cloud-dependent architecture — streaming audio to conversational AI, sending images to vision models, querying operational databases — requires a stable connection by design. Our solution: we are integrating NVIDIA Jetson hardware directly into each CA-1 unit, creating an on-premises edge computing node that processes all information locally without external internet dependency. The Jetson serves as the “offline brain” of the CA-1, powering visual intelligence, conversational AI inference, and data access entirely at the edge. A dedicated technical deep-dive on this architecture is forthcoming.

The Bigger Picture: Camera Data as Robotic Training Infrastructure

One of the most strategically significant aspects of this project is the data it generates. Every minute that an operator anywhere in the world wears the glasses while working on a CA-1 produces valuable first-person visual data — showing how humans interact with autonomous systems in real operational environments. From day one, we are building the pipeline to capture, store, and structure this visual data for continuous training of our existing operator AI models. But the long-term vision goes further: this data becomes training material for future robotic models that will eventually take over parts of the operator’s responsibilities, progressively increasing the autonomy of each CA-1 unit.

This mirrors the data flywheel strategies of leading AI companies: the more operators use the glasses, the smarter the system becomes, the less manual intervention is needed, and the better the next generation of models will perform. Distributed globally across all CA-1 locations, this becomes a uniquely valuable, proprietary dataset for embodied AI training.

What’s Next

In the coming weeks, we will continue maintaining our open-source repositories while building our specialized, production-grade application on top. Key priorities include always-on operator support with optimized energy efficiency, deeper integration with the NVIDIA Jetson edge node for fully offline operation, expanded use of Meta’s vision models for real-time quality assurance, and preparation for the upcoming Meta Ray-Ban Display glasses with Neural Band — which will enable visual step-by-step overlays and gesture-based interaction for an even more immersive operator experience.

We are also closely tracking SDK updates from Meta and will incorporate new capabilities — such as native button access and local AI processing — as they become available.