# 👾 ElatoAI: Realtime Voice AI Models on Arduino ESP32 Realtime AI Speech powered by SoTA AI voice models on ESP32, with Secure WebSockets, and Deno Edge Functions for >15-minute uninterrupted conversations globally. We currently support OpenAI Realtime API, Gemini Live API, xAI Grok Voice Agents API, Eleven Labs Conversational AI Agents, and Hume AI EVI-4. - [🚀 Quick Start](https://www.elatoai.com/docs/quickstart) - [Build with PlatformIO](https://www.elatoai.com/docs/platformio) - [Build on Arduino IDE](https://www.elatoai.com/docs/arduino) - [Deploy globally](https://www.elatoai.com/docs/blog/deploying-globally) - [🤖🤖🤖 Deploy multiple devices](https://www.elatoai.com/docs/blog/multiple-devices) ## 📽️ Demo Video

Video links: [OpenAI Demo](https://youtu.be/o1eIAwVll5I) | [Gemini Demo](https://youtu.be/_zUBue3pfVI) | [Eleven Labs Demo](https://youtu.be/7LKTIuEW-hg) | [Hume AI EVI-4 Demo](https://youtu.be/Gtann5pdV0I) ## 👷‍♀️ DIY Hardware Design Hardware Setup

## 📱 App Design Control your ESP32 AI device from your phone with the ElatoAI webapp. App Screenshots

## ⭐️ Key Voice AI Features App Screenshots

## 🌟 Full feature list 1. **Realtime Speech-to-Speech**: Instant speech conversion powered by OpenAI's Realtime API, Gemini's Live API, xAI's Grok Voice Agent API, Eleven Labs Conversational AI Agents and Hume AI EVI4. 2. **Create Custom AI Agents**: Create custom agents with different personalities and voices. 3. **Customizable Voices**: Choose from a variety of voices and personalities. 4. **Secure WebSockets**: Reliable, encrypted WebSocket communication. 5. **Server VAD Turn Detection**: Intelligent conversation flow handling for smooth interactions. 6. **Opus Audio Compression**: High-quality audio streaming with minimal bandwidth. 7. **Global Edge Performance**: Low latency Deno Edge Functions ensuring seamless global conversations. 8. **ESP32 Arduino Framework**: Optimized and easy-to-use hardware integration. 9. **Conversation History**: View your conversation history. 10. **Device Management and Authentication**: Register and manage your devices. 11. **User Authentication**: Secure user authentication and authorization. 12. **Conversations with WebRTC and Websockets**: Talk to your AI with WebRTC on the NextJS webapp and with websockets on the ESP32. 13. **Volume Control**: Control the volume of the ESP32 speaker from the NextJS webapp. 14. **Realtime Transcripts**: The realtime transcripts of your conversations are stored in the Supabase DB. 15. **OTA Updates**: Over the Air Updates for the ESP32 firmware. 16. **Wifi Management with captive portal**: Connect to your Wifi network or Hotspot from the ESP32 device. 17. **Factory Reset**: Factory reset the ESP32 device from the NextJS webapp. 18. **Button and Touch Support**: Use the button OR touch sensor to control the ESP32 device. 19. **No PSRAM Required**: The ESP32 device does not require PSRAM to run the speech to speech AI. 20. **OAuth for Web client**: OAuth for your users to manage their AI characters and devices. 21. **Pitch Factor**: Control the pitch of the AI's voice from the NextJS webapp to create cartoon-like voices. 22. **Tool calling**: Call tools and functions from the ESP32 device to the Deno Edge Functions for a complete voice AI agent. 23. **Tap to turn on**: Tap the touchpad to turn it on from sleep. ## Project Architecture ElatoAI consists of three main components: 1. **Frontend Client** (`Next.js` hosted on Vercel) - to create and talk to your AI agents and 'send' it to your ESP32 device 2. **Edge Server Functions** (`Deno` running on Deno/Supabase Edge) - to handle the websocket connections from the ESP32 device and the LLM Provider API calls 3. **ESP32 IoT Client** (`PlatformIO/Arduino`) - to receive the websocket connections from the Edge Server Functions and send audio to the LLM Provider via the Deno edge server. ## 🛠 Tech Stack | Component | Technology Used | |-----------------|------------------------------------------| | Frontend | Next.js, Vercel | | Backend | Supabase DB | | Edge Functions | Deno Edge Functions on Deno/Supabase | | IoT Client | PlatformIO, Arduino Framework, ESP32-S3 | | Audio Codec | Opus | | Communication | Secure WebSockets | | Libraries | [ArduinoJson](https://github.com/bblanchon/ArduinoJson), [WebSockets](https://github.com/Links2004/arduinoWebSockets), [AsyncWebServer](https://github.com/ESP32Async/ESPAsyncWebServer), [ESP32_Button](https://github.com/esp-arduino-libs/ESP32_Button), [Arduino Audio Tools](https://github.com/pschatzmann/arduino-audio-tools), [ArduinoLibOpus](https://github.com/pschatzmann/arduino-libopus) | ## High-Level Flowchart ```mermaid flowchart TD subgraph UserLayer UserInput[User Speech Input] UserOutput[AI Generated Speech Output] end UserInput --> ESP32 ESP32[ESP32 Device] -->|WebSocket| Edge[Deno Edge Function] Edge -->|OpenAI API| OpenAI[OpenAI Realtime API] Edge -->|Gemini API| Gemini[Gemini Live API] Edge -->|xAI API| xAI[xAI Grok Voice Agent API] Edge -->|ElevenLabs API| ElevenLabs[ElevenLabs AI Agents] Edge -->|Hume API| Hume[Hume AI EVI4] OpenAI --> Edge Gemini --> Edge xAI --> Edge ElevenLabs --> Edge Hume --> Edge Edge -->|WebSocket| ESP32 ESP32 --> UserOutput ``` ## Project Structure ```mermaid graph TD repo[ElatoAI] repo --> frontend[Frontend Vercel NextJS] repo --> deno[Deno Edge Function] repo --> esp32[ESP32 Arduino Client] deno --> supabase[Supabase DB] frontend --> supabase esp32 --> websockets[Secure WebSockets] esp32 --> opus[Opus Codec] esp32 --> audio_tools[arduino-audio-tools] esp32 --> libopus[arduino-libopus] esp32 --> ESPAsyncWebServer[ESPAsyncWebServer] ``` ## 📊 Important Stats - **Latency**: <2s round-trip globally - **Audio Quality**: Opus codec at 12kbps (high clarity) 24kHz sampling rate - **Uninterrupted Conversations**: Up to 15 minutes continuous conversations globally - **Global Availability**: Optimized with edge computing ## 🛡 Security - Secure WebSockets (WSS) for encrypted data transfers - Optional: API Key encryption with 256-bit AES - Supabase DB for secure authentication - Postgres RLS for all tables ## 🚫 Limitations - 3-4s Cold start time while connecting to edge server - Tested with up to 17 minutes of uninterrupted conversations - Edge server stops when wall clock time is exceeded - No speech interruption detection on ESP32 ## 🙌 Contributing We value your contributions! Here are some ideas: 1. Speech Interruption on ESP32 2. ~~Adding Arduino IDE support~~ 3. ~~Add Hume API client for emotion detection~~ 4. Add MCP support on Deno Edge 5. ~~Plug in Eleven Labs API for voice generation~~ 6. Add Azure OpenAI Support (easy pickens) - in review 7. Add Cartesia Support 8. Add Amazon Nova Support 9. Add Deepgram ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. **Check out our hardware offerings at [ElatoAI Products](https://www.elatoai.com/). If you find this project interesting or useful, support us by starring this project on GitHub. ⭐**