211 lines
11 KiB
Markdown
211 lines
11 KiB
Markdown
English | [中文](README.zh.md)
|
|
|
|
<div align="center">
|
|
|
|
<a href="https://elatoai.com"><picture>
|
|
<source media="(prefers-color-scheme: dark)" srcset="assets/darkelato.png">
|
|
<source media="(prefers-color-scheme: light)" srcset="assets/lightelato.png">
|
|
<img alt="elato logo" src="assets/lightelato.png" height="70" style="max-width: 100%;">
|
|
</picture></a>
|
|
|
|
<div style="display:flex; flex-direction:row; align-items:center; flex-wrap:wrap; justify-content:center;">
|
|
<a style="display:inline-flex;" href="https://cookbook.openai.com/examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai"><img src="assets/oai.png" height="42" style="width: auto;"></a>
|
|
<a style="display:inline-flex;" href="https://www.elatoai.com/docs"><img src="assets/docs.png" height="42" style="width: auto;"></a>
|
|
<a style="display:inline-flex;" href="https://discord.gg/KJWxDPBRUj"><img src="assets/discord.png" height="42" style="width: auto;"></a>
|
|
<!-- <a style="display:inline-flex;" href="https://elatoai.com/home"><img src="assets/try.png" height="42" style="width: auto;"></a> -->
|
|
<a style="display:inline-flex;" href="https://www.kickstarter.com/projects/elatoai/elato-make-toys-talk-with-ai-voices"><img src="assets/ks.png" height="42" style="width: auto;"></a>
|
|
<!-- <a style="display:inline-flex;" href="https://www.elatoai.com/products/ai-devkit"><img src="assets/diy.png" height="42" style="width: auto;"></a> -->
|
|
</div>
|
|
<a href="https://www.kickstarter.com/projects/elatoai/elato-make-toys-talk-with-ai-voices" target="_blank">
|
|
<img src="assets/cover.png" alt="Elato Logo" width="100%">
|
|
</a>
|
|
|
|
<br />
|
|
</div>
|
|
|
|
## News
|
|
- **Apr 17 2026:** Create a Global Devices/Toys network with Cloudflare Voice Agents and Durable Objects. Cloudflare's Workers AI provides Deepgram STT/TTS natively so all you need to bring is an LLM API Key to create a scalable, low-latency voice AI pipeline.
|
|
- **Apr 15 2026:** You can launch over 100+ STT, LLM, TTS voice pipeline systems with a FastAPI server with Pipecat!
|
|
- **Mar 14 2026:** Elato just launched Local AI Toys on Pi Day.🎉🎉 Your ESP32 devices can now support local AI models and voice generation with frontier Local LLMs and TTS models like Qwen, Mistral, and more with MLX. Check it out [here](https://www.github.com/akdeb/local-ai-toys).
|
|
|
|
# 👾 ElatoAI: Realtime Voice AI Models on Arduino ESP32
|
|
|
|
Realtime AI Speech powered by 100+ Voice AI models on ESP32, with Secure WebSockets & Edge Functions for >20-minute uninterrupted conversations globally.
|
|
|
|
- [🚀 Quick Start](https://www.elatoai.com/docs/quickstart)
|
|
- [Build with PlatformIO](https://www.elatoai.com/docs/platformio)
|
|
- [Build on Arduino IDE](https://www.elatoai.com/docs/arduino)
|
|
- [Deploy globally](https://www.elatoai.com/docs/blog/deploying-globally)
|
|
- [🤖🤖🤖 Deploy multiple devices](https://www.elatoai.com/docs/blog/multiple-devices)
|
|
|
|
## 📽️ Demo Video
|
|
|
|
<div align="center">
|
|
<a href="https://www.youtube.com/watch?v=o1eIAwVll5I" target="_blank">
|
|
<img src="https://raw.githubusercontent.com/akdeb/ElatoAI/refs/heads/main/assets/thumbnail.png" alt="ElatoAI Demo Video" width="100%" style="border-radius:10px" />
|
|
</a>
|
|
</div>
|
|
|
|
Video links: [OpenAI Demo](https://youtu.be/o1eIAwVll5I) | [Gemini Demo](https://youtu.be/_zUBue3pfVI) | [Eleven Labs Demo](https://youtu.be/7LKTIuEW-hg) | [Hume AI EVI-4 Demo](https://youtu.be/Gtann5pdV0I)
|
|
|
|
## 🧠 Models
|
|
|
|
### Deno Edge
|
|
1. [OpenAI Realtime API](https://github.com/akdeb/ElatoAI/tree/main/server/deno/models/openai)
|
|
2. [Gemini Live API](https://github.com/akdeb/ElatoAI/tree/main/server/deno/models/gemini)
|
|
3. [xAI Grok Voice Agent API](https://github.com/akdeb/ElatoAI/tree/main/server/deno/models/grok)
|
|
4. [Eleven Labs Conversational AI Agents](https://github.com/akdeb/ElatoAI/tree/main/server/deno/models/elevenlabs)
|
|
5. [Hume AI EVI-4](https://github.com/akdeb/ElatoAI/tree/main/server/deno/models/hume)
|
|
|
|
### Cloudflare Workers
|
|
1. LLM - [80+ Models](https://developers.cloudflare.com/workers-ai/models/?tasks=Text+Generation) OpenAI, Gemini, xAI, and more.
|
|
2. TTS - [10+ Models](https://developers.cloudflare.com/workers-ai/models/?tasks=Text-to-Speech) Deepgram, MeloTTS and more.
|
|
3. STT - [5 Models](https://developers.cloudflare.com/workers-ai/models/?tasks=Automatic+Speech+Recognition) Whisper, Deepgram and more.
|
|
|
|
|
|
## 👷♀️ DIY Hardware Design
|
|
|
|
<img src="assets/pcb-design.png" alt="Hardware Setup" width="100%">
|
|
|
|
## 📱 App Design
|
|
|
|
Control your ESP32 AI device from your phone with the ElatoAI webapp.
|
|
|
|
<img src="assets/mockups.png" alt="App Screenshots" width="100%">
|
|
|
|
|
|
<!-- ## ⭐️ Key Voice AI Features
|
|
<img src="assets/features.png" alt="App Screenshots" width="100%"> -->
|
|
|
|
## 🌟 Full feature list
|
|
|
|
1. **Realtime Speech-to-Speech**: Instant speech conversion powered by OpenAI's Realtime API, Gemini's Live API, xAI's Grok Voice Agent API, Eleven Labs Conversational AI Agents and Hume AI EVI4.
|
|
2. **Create Custom AI Agents**: Create custom agents with different personalities and voices.
|
|
3. **Customizable Voices**: Choose from a variety of voices and personalities.
|
|
4. **Secure WebSockets**: Reliable, encrypted WebSocket communication.
|
|
5. **Server VAD Turn Detection**: Intelligent conversation flow handling for smooth interactions.
|
|
6. **Opus Audio Compression**: High-quality audio streaming with minimal bandwidth.
|
|
7. **Global Edge Performance**: Low latency Deno Edge Functions ensuring seamless global conversations.
|
|
8. **ESP32 Arduino Framework**: Optimized and easy-to-use hardware integration.
|
|
9. **Conversation History**: View your conversation history.
|
|
10. **Device Management and Authentication**: Register and manage your devices.
|
|
11. **User Authentication**: Secure user authentication and authorization.
|
|
12. **Conversations with WebRTC and Websockets**: Talk to your AI with WebRTC on the NextJS webapp and with websockets on the ESP32.
|
|
13. **Volume Control**: Control the volume of the ESP32 speaker from the NextJS webapp.
|
|
14. **Realtime Transcripts**: The realtime transcripts of your conversations are stored in the Supabase DB.
|
|
15. **OTA Updates**: Over the Air Updates for the ESP32 firmware.
|
|
16. **Wifi Management with captive portal**: Connect to your Wifi network or Hotspot from the ESP32 device.
|
|
17. **Factory Reset**: Factory reset the ESP32 device from the NextJS webapp.
|
|
18. **Button and Touch Support**: Use the button OR touch sensor to control the ESP32 device.
|
|
19. **No PSRAM Required**: The ESP32 device does not require PSRAM to run the speech to speech AI.
|
|
20. **OAuth for Web client**: OAuth for your users to manage their AI characters and devices.
|
|
21. **Pitch Factor**: Control the pitch of the AI's voice from the NextJS webapp to create cartoon-like voices.
|
|
22. **Tool calling**: Call tools and functions from the ESP32 device to the edge Functions for a complete voice AI agent.
|
|
23. **Tap to turn on**: Tap the touchpad to turn it on from sleep.
|
|
24. **Deploy on Cloudflare**: Connect to any LLM, TTS, STT service with Cloudflare Voice Agents and Durable Objects
|
|
|
|
## Project Architecture
|
|
|
|
ElatoAI consists of three main components:
|
|
|
|
1. **Frontend Client** (`Next.js` hosted on Vercel) - to create and talk to your AI agents and 'send' it to your ESP32 device
|
|
2. **Edge Server Functions** (`Deno Edge` or `Cloudflare Workers`) - to handle the websocket connections from the ESP32 device and the LLM Provider API calls
|
|
3. **ESP32 IoT Client** (`PlatformIO/Arduino`) - to receive the websocket connections from the Edge Server Functions and send audio to the LLM Provider via the Deno edge server or Cloudflare Durable Objects.
|
|
|
|
|
|
## 🛠 Tech Stack
|
|
|
|
| Component | Technology Used |
|
|
|-----------------|------------------------------------------|
|
|
| Frontend | Next.js, Vercel |
|
|
| Backend | Supabase DB |
|
|
| Edge Functions | Deno Edge or Cloudflare Workers |
|
|
| IoT Client | PlatformIO, Arduino Framework, ESP32-S3 |
|
|
| Audio Codec | Opus |
|
|
| Communication | Secure WebSockets |
|
|
| Libraries | [ArduinoJson](https://github.com/bblanchon/ArduinoJson), [WebSockets](https://github.com/Links2004/arduinoWebSockets), [AsyncWebServer](https://github.com/ESP32Async/ESPAsyncWebServer), [ESP32_Button](https://github.com/esp-arduino-libs/ESP32_Button), [Arduino Audio Tools](https://github.com/pschatzmann/arduino-audio-tools), [ArduinoLibOpus](https://github.com/pschatzmann/arduino-libopus) |
|
|
|
|
## High-Level Flowchart
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
subgraph UserLayer
|
|
UserInput[User Speech Input]
|
|
UserOutput[AI Generated Speech Output]
|
|
end
|
|
|
|
UserInput --> ESP32
|
|
ESP32[ESP32 Device] -->|WebSocket| Edge[Deno Edge / Cloudflare Workers]
|
|
Edge -->|OpenAI API| OpenAI[OpenAI Realtime API]
|
|
Edge -->|Gemini API| Gemini[Gemini Live API]
|
|
Edge -->|xAI API| xAI[xAI Grok Voice Agent API]
|
|
Edge -->|ElevenLabs API| ElevenLabs[ElevenLabs AI Agents]
|
|
Edge -->|Hume API| Hume[Hume AI EVI4]
|
|
OpenAI --> Edge
|
|
Gemini --> Edge
|
|
xAI --> Edge
|
|
ElevenLabs --> Edge
|
|
Hume --> Edge
|
|
Edge -->|WebSocket| ESP32
|
|
ESP32 --> UserOutput
|
|
```
|
|
|
|
|
|
## Project Structure
|
|
|
|
```mermaid
|
|
graph TD
|
|
repo[ElatoAI]
|
|
repo --> frontend[Frontend Vercel NextJS]
|
|
repo --> server[Deno Edge Function / Cloudflare Workers]
|
|
repo --> esp32[ESP32 Arduino Client]
|
|
server --> supabase[Supabase DB]
|
|
|
|
frontend --> supabase
|
|
esp32 --> websockets[Secure WebSockets]
|
|
esp32 --> opus[Opus Codec]
|
|
esp32 --> audio_tools[arduino-audio-tools]
|
|
esp32 --> libopus[arduino-libopus]
|
|
esp32 --> ESPAsyncWebServer[ESPAsyncWebServer]
|
|
```
|
|
|
|
## 📊 Important Stats
|
|
|
|
- **Latency**: <2s round-trip globally
|
|
- **Audio Quality**: Opus codec at 12kbps (high clarity) 24kHz sampling rate
|
|
- **Uninterrupted Conversations**: Up to 20 minutes continuous conversations globally
|
|
- **Global Availability**: Optimized with edge computing
|
|
|
|
## 🛡 Security
|
|
|
|
- Secure WebSockets (WSS) for encrypted data transfers
|
|
- Optional: API Key encryption with 256-bit AES
|
|
- Supabase DB for secure authentication
|
|
- Postgres RLS for all tables
|
|
|
|
## 🚫 Limitations
|
|
- 3-4s cold start time while connecting to edge server
|
|
- Tested with up to 17 minutes of uninterrupted conversations
|
|
- Edge server stops when wall clock time is exceeded
|
|
- No speech interruption detection on ESP32
|
|
|
|
## 🙌 Contributing
|
|
|
|
We value your contributions! Here are some ideas to get you started:
|
|
1. Speech Interruption on ESP32 (works with OpenAI)
|
|
2. ~~Adding Arduino IDE support~~
|
|
3. ~~Add Hume API client for emotion detection~~
|
|
4. Add MCP support on Deno Edge
|
|
5. ~~Plug in Eleven Labs API for voice generation~~
|
|
6. Add Azure OpenAI Support (easy pickens) - in review
|
|
7. Add Cartesia Support
|
|
8. Add Amazon Nova Support
|
|
9. Add Deepgram
|
|
10. ~~Add Cloudflare Workers support~~
|
|
|
|
|
|
## License
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
|
|
**Check out our hardware offerings at [ElatoAI Products](https://www.elatoai.com/). If you find this project interesting or useful, support us by starring this project on GitHub. ⭐**
|