diff --git a/api/CLAUDE.md b/api/CLAUDE.md
index 9c14ca9..8a69458 100644
--- a/api/CLAUDE.md
+++ b/api/CLAUDE.md
@@ -54,7 +54,7 @@ FastAPI application serving three architectural layers: routes (HTTP endpoints),
 ### Routers
 - **routers/chat.py**: POST /chat
 - **routers/source_chat.py**: POST /source/{source_id}/chat
-- **routers/podcasts.py**: POST /podcasts, GET /podcasts/{id}, etc.
+- **routers/podcasts.py**: POST /podcasts, GET /podcasts/{id}, POST /podcasts/episodes/{id}/retry, etc.
 - **routers/notes.py**: POST /notes, GET /notes/{id}
 - **routers/sources.py**: POST /sources, GET /sources/{id}, DELETE /sources/{id}
 - **routers/models.py**: GET /models, POST /models/config
@@ -70,7 +70,7 @@ FastAPI application serving three architectural layers: routes (HTTP endpoints),
 - **Async/await throughout**: All DB queries, graph invocations, AI calls are async
 - **SurrealDB transactions**: Services use repo_query, repo_create, repo_upsert from database layer
 - **Config override pattern**: Models/config override via models_service passed to graph.ainvoke(config=...)
-- **Error handling**: Services catch exceptions and return HTTP status codes (400 Bad Request, 404 Not Found, 500 Internal Server Error)
+- **Error handling**: Custom exception hierarchy (`open_notebook.exceptions`) with global FastAPI exception handlers mapping to HTTP status codes (see Error Handling section below). LangGraph nodes use `classify_error()` to convert raw LLM provider errors into typed exceptions with user-friendly messages.
 - **Logging**: loguru logger in main.py; services expected to log key operations
 - **Response normalization**: All responses follow standard schema (data + metadata structure)
 
@@ -101,6 +101,35 @@ FastAPI application serving three architectural layers: routes (HTTP endpoints),
 - **No OpenAPI security scheme**: API docs available without auth (disable before production)
 - **Services don't validate user permission**: All endpoints trust authentication layer; no per-notebook permission checks
 
+## Error Handling
+
+### Global Exception Handlers (`main.py`)
+
+FastAPI exception handlers map custom exception types from `open_notebook.exceptions` to HTTP status codes. All error responses include CORS headers.
+
+| Exception Class | HTTP Status | Use Case |
+|----------------|-------------|----------|
+| `NotFoundError` | 404 | Resource not found |
+| `InvalidInputError` | 400 | Bad request data |
+| `AuthenticationError` | 401 | Invalid/missing API key |
+| `RateLimitError` | 429 | Provider rate limit exceeded |
+| `ConfigurationError` | 422 | Wrong model name, missing config |
+| `NetworkError` | 502 | Cannot reach AI provider |
+| `ExternalServiceError` | 502 | Provider returned error (500/503, context length) |
+| `OpenNotebookError` (base) | 500 | Any other application error |
+
+### Error Classification (`open_notebook.utils.error_classifier`)
+
+The `classify_error()` function maps raw exceptions from LLM providers/Esperanto/LangChain into the typed exceptions above with user-friendly messages. Used in all LangGraph graph nodes and SSE streaming handlers.
+
+**Flow**: Raw exception → keyword matching → `(ExceptionClass, user_message)` → raised → caught by global handler → HTTP response with descriptive message.
+
+### Frontend Integration
+
+The frontend `getApiErrorMessage()` helper (`lib/utils/error-handler.ts`) tries i18n mapping first, then falls back to displaying the backend's descriptive error message directly.
+
+---
+
 ## How to Add New Endpoint
 
 1. Create router file in `routers/` (e.g., `routers/new_feature.py`)
diff --git a/docs/2-CORE-CONCEPTS/podcasts-explained.md b/docs/2-CORE-CONCEPTS/podcasts-explained.md
index 2a95d31..449551e 100644
--- a/docs/2-CORE-CONCEPTS/podcasts-explained.md
+++ b/docs/2-CORE-CONCEPTS/podcasts-explained.md
@@ -156,6 +156,37 @@ Audio files → Mix together → Final podcast MP3
 
 ---
 
+## When Things Go Wrong: Failures & Retry
+
+Podcast generation involves multiple steps (outline, transcript, TTS) and depends on external AI providers. Sometimes things fail.
+
+### What Happens on Failure
+
+When podcast generation fails (e.g., wrong model configured, API key expired, provider outage):
+
+- The episode is marked as **Failed** with a red badge
+- The **error message** from the AI provider is displayed so you can understand what went wrong
+- No duplicate episodes are created — automatic retries are disabled to prevent confusion
+
+### How to Retry a Failed Episode
+
+1. Go to the podcast's **Episodes** tab
+2. Find the failed episode — it shows a red "FAILED" badge and an error details box
+3. Click the **Retry** button
+4. The failed episode is deleted and a new generation job is submitted
+5. The new episode appears with "pending" status
+
+### Common Failure Causes
+
+| Error | What to Do |
+|-------|-----------|
+| Invalid API key | Check Settings -> Credentials for the TTS and language model providers |
+| Model not found | Verify the model name in your episode profile exists and is correctly configured |
+| Rate limit exceeded | Wait a few minutes and retry |
+| Provider unavailable | Check provider status page; retry later |
+
+---
+
 ## Key Architecture Decisions
 
 ### 1. Asynchronous Processing
diff --git a/docs/6-TROUBLESHOOTING/ai-chat-issues.md b/docs/6-TROUBLESHOOTING/ai-chat-issues.md
index 1f8376b..24ba567 100644
--- a/docs/6-TROUBLESHOOTING/ai-chat-issues.md
+++ b/docs/6-TROUBLESHOOTING/ai-chat-issues.md
@@ -2,6 +2,8 @@
 
 Problems with AI models, chat, and response quality.
 
+> **Note:** Open Notebook now shows descriptive error messages for AI provider failures. Instead of a generic "An unexpected error occurred", you'll see specific messages like "Authentication failed. Please check your API key" or "Rate limit exceeded. Please wait a moment and try again." These messages help you diagnose and fix issues faster.
+
 ---
 
 ## "Failed to send message" Error
diff --git a/docs/6-TROUBLESHOOTING/index.md b/docs/6-TROUBLESHOOTING/index.md
index 0ab2471..539668b 100644
--- a/docs/6-TROUBLESHOOTING/index.md
+++ b/docs/6-TROUBLESHOOTING/index.md
@@ -61,6 +61,7 @@ Having issues? Use this guide to diagnose and fix problems.
 ### Podcasts
 
 - **Can't generate podcast** → [Quick Fixes](quick-fixes.md#8-podcast-generation-failed)
+- **Podcast shows "FAILED" badge** → Check the error message displayed on the episode, then use the **Retry** button. See [Podcasts Explained](../2-CORE-CONCEPTS/podcasts-explained.md#when-things-go-wrong-failures--retry)
 - **Podcast audio is robotic** → [Quick Fixes](quick-fixes.md#8-podcast-generation-failed)
 - **Podcast generation times out** → [Quick Fixes](quick-fixes.md#8-podcast-generation-failed)
 
diff --git a/frontend/src/CLAUDE.md b/frontend/src/CLAUDE.md
index 8a18d5e..5959f83 100644
--- a/frontend/src/CLAUDE.md
+++ b/frontend/src/CLAUDE.md
@@ -113,6 +113,7 @@ User interactions trigger mutations/queries via hooks, which communicate with th
 
 ### Error Handling
 - **API errors**: All request failures propagate to consuming code; components show toast notifications
+- **Error message resolution** (`lib/utils/error-handler.ts`): `getApiErrorMessage()` tries i18n mapping first via `ERROR_MAP`, then falls back to displaying the backend's descriptive error message directly. This ensures user-friendly error messages from the error classification system are shown as-is.
 - **Toast feedback**: Mutations show success/error toasts (from `sonner` library)
 - **Error boundary**: App-level error boundary catches React render errors; shows fallback UI
 
diff --git a/frontend/src/lib/api/CLAUDE.md b/frontend/src/lib/api/CLAUDE.md
index 3bf3e43..46e8011 100644
--- a/frontend/src/lib/api/CLAUDE.md
+++ b/frontend/src/lib/api/CLAUDE.md
@@ -5,7 +5,7 @@ Axios-based client and resource-specific API modules for backend communication w
 ## Key Components
 
 - **`client.ts`**: Central Axios instance with request/response interceptors, auth headers, base URL resolution
-- **Resource modules** (`sources.ts`, `notebooks.ts`, `chat.ts`, `search.ts`, etc.): Endpoint-specific functions returning typed responses
+- **Resource modules** (`sources.ts`, `notebooks.ts`, `chat.ts`, `search.ts`, `podcasts.ts`, etc.): Endpoint-specific functions returning typed responses
 - **`query-client.ts`**: TanStack Query client configuration with default options
 - **`models.ts`, `notes.ts`, `embeddings.ts`, `settings.ts`**: Additional resource APIs
 
@@ -44,7 +44,7 @@ Axios-based client and resource-specific API modules for backend communication w
 - **Timeout for streaming**: 10-minute timeout may not cover very long-running LLM operations; consider extending if needed
 - **Auth token management**: Token stored in localStorage `auth-storage` key; uses Zustand persist middleware
 - **Headers mutation in interceptor**: Mutating `config.headers` directly; be careful with middleware order
-- **No retry logic**: Failed requests not automatically retried; must be handled in consuming code
+- **No automatic retry logic**: Failed requests not automatically retried; must be handled in consuming code. Podcast episodes have explicit retry via `retryEpisode()` in `podcasts.ts` and `useRetryPodcastEpisode()` hook
 - **Content-Type header precedence**: FormData interceptor deletes Content-Type after checking; subsequent interceptors won't re-add it
 
 ## Usage Example
diff --git a/open_notebook/CLAUDE.md b/open_notebook/CLAUDE.md
index 4ddbf70..fe9a722 100644
--- a/open_notebook/CLAUDE.md
+++ b/open_notebook/CLAUDE.md
@@ -109,7 +109,12 @@ User documentation is at @docs/
 - **Vector search**: Built-in semantic search across all content
 - **Transactions**: Repo functions handle ACID operations
 
-### 5. Authentication
+### 5. Error Handling
+- **Custom exceptions** (`exceptions.py`): Hierarchy rooted at `OpenNotebookError` with typed subclasses (`AuthenticationError`, `ConfigurationError`, `RateLimitError`, `ExternalServiceError`, `NetworkError`, etc.)
+- **Error classification** (`utils/error_classifier.py`): `classify_error()` maps raw LLM provider exceptions to typed exceptions with user-friendly messages via keyword matching
+- **Global handlers**: FastAPI exception handlers in `api/main.py` convert typed exceptions to appropriate HTTP status codes (401, 422, 429, 502, etc.)
+
+### 6. Authentication
 - **Current**: Simple password middleware (insecure, dev-only)
 - **Production**: Replace with OAuth/JWT (see CONFIGURATION.md)
 
@@ -135,6 +140,8 @@ User documentation is at @docs/
 ### Podcast Generation
 - **Async job queue**: `podcast_service.py` submits jobs but doesn't wait
 - **Track status**: Use `/commands/{command_id}` endpoint to poll status
+- **Failure handling**: Failed jobs are marked as "failed" with error messages; retry via `POST /podcasts/episodes/{id}/retry`
+- **No automatic retries**: Podcast jobs use `max_attempts: 1` to prevent duplicate episode records
 - **TTS failures**: Fall back to silent audio if speech synthesis fails
 
 ### Content Processing
@@ -217,5 +224,5 @@ See dedicated CLAUDE.md files for detailed guidance:
 
 ---
 
-**Last Updated**: January 2026 | **Project Version**: 1.2.4+
+**Last Updated**: February 2026 | **Project Version**: 1.7.3
 
diff --git a/open_notebook/ai/CLAUDE.md b/open_notebook/ai/CLAUDE.md
index b6e3cc0..5e8de94 100644
--- a/open_notebook/ai/CLAUDE.md
+++ b/open_notebook/ai/CLAUDE.md
@@ -93,7 +93,7 @@ All models use Esperanto library as provider abstraction (OpenAI, Anthropic, Goo
 - **Large context threshold hard-coded**: 105,000 token threshold for large_context_model upgrade (not configurable)
 - **DefaultModels.get_instance() fresh fetch**: Intentionally bypasses parent singleton cache to pick up live config changes; creates new instance each call
 - **Type-specific getters use assertions**: get_speech_to_text() asserts isinstance (catches misconfiguration early)
-- **No validation of model existence**: ModelManager.get_model() raises ValueError if model not found (not caught upstream)
+- **ConfigurationError on missing model**: ModelManager.get_model() and provision_langchain_model() raise `ConfigurationError` (not ValueError) when a model is not found or not configured, so the global exception handler returns HTTP 422 with a descriptive message
 - **Esperanto caching**: Actual model instances cached by Esperanto (not by ModelManager); ModelManager stateless
 - **Fallback chain specificity**: "transformation" type falls back to default_chat_model if not explicitly set (convention-based)
 - **kwargs passed through**: provision_langchain_model() passes kwargs to AIFactory but doesn't validate what's accepted
diff --git a/open_notebook/graphs/CLAUDE.md b/open_notebook/graphs/CLAUDE.md
index 3406576..1b4660a 100644
--- a/open_notebook/graphs/CLAUDE.md
+++ b/open_notebook/graphs/CLAUDE.md
@@ -21,6 +21,23 @@ LangGraph-based workflow orchestration for content processing, chat interactions
 - **Checkpointing**: `chat.py` and `source_chat.py` use SqliteSaver for message history (LangGraph's built-in persistence)
 - **Content extraction**: `source.py` uses content-core library with provider/model from DefaultModels; URLs and files both supported
 
+## Error Handling in Graphs
+
+All graph nodes use `classify_error()` from `open_notebook.utils.error_classifier` to catch raw LLM provider exceptions and re-raise them as typed `OpenNotebookError` subclasses with user-friendly messages. This ensures that errors from any AI provider (authentication failures, rate limits, model not found, network issues) are surfaced to the user with actionable messages instead of opaque stack traces.
+
+**Pattern in nodes**:
+```python
+from open_notebook.utils.error_classifier import classify_error
+
+try:
+    result = await model.ainvoke(...)
+except Exception as e:
+    exc_class, message = classify_error(e)
+    raise exc_class(message) from e
+```
+
+---
+
 ## Quirks & Edge Cases
 
 - **Async loop gymnastics**: ThreadPoolExecutor workaround needed because LangGraph invokes sync nodes but we call async functions; fragile if event loop state changes
@@ -38,6 +55,7 @@ LangGraph-based workflow orchestration for content processing, chat interactions
 - `ai_prompter`: Prompter for Jinja2 template rendering
 - `content_core`: `extract_content()` for file/URL processing
 - `open_notebook.ai.provision`: `provision_langchain_model()` (async factory with fallback logic)
+- `open_notebook.utils.error_classifier`: `classify_error()` for user-friendly LLM error messages
 - `open_notebook.domain.notebook`: Domain models (Source, Note, SourceInsight, vector_search)
 - `loguru`: Logging
 
diff --git a/open_notebook/podcasts/CLAUDE.md b/open_notebook/podcasts/CLAUDE.md
index 95a8f6c..7e22ebb 100644
--- a/open_notebook/podcasts/CLAUDE.md
+++ b/open_notebook/podcasts/CLAUDE.md
@@ -34,6 +34,7 @@ All inherit from `ObjectModel` (SurrealDB base class with table_name and save/lo
 - Optional audio_file path, transcript/outline dicts
 - **Job tracking**: command field links to surreal-commands RecordID
 - `get_job_status()` fetches async job status via surreal-commands library
+- `get_job_detail()` returns both status and error_message from the job (used for retry validation and UI error display)
 - `_prepare_save_data()` ensures command field is always RecordID format for database
 
 ## Common Patterns
@@ -56,6 +57,7 @@ All inherit from `ObjectModel` (SurrealDB base class with table_name and save/lo
 
 - **Snapshot approach**: Episode/speaker profiles stored as dicts (not references), so profile updates don't retroactively affect past episodes
 - **Job status resilience**: get_job_status() catches all exceptions and returns "unknown" (no error propagation)
+- **No automatic retries**: Podcast generation commands use `retry={"max_attempts": 1}` to prevent duplicate episode records on failure; retry is user-initiated via `POST /podcasts/episodes/{id}/retry`
 - **validate_speakers executes late**: Validators run at instantiation; bulk inserts may not trigger full validation
 - **RecordID coercion**: ensure_record_id() handles both string and RecordID inputs; command field parsed during deserialization
 - **No cascade delete**: Removing a profile doesn't cascade to episodes using it
diff --git a/uv.lock b/uv.lock
index ef92e6b..78b844f 100644
--- a/uv.lock
+++ b/uv.lock
@@ -2095,7 +2095,7 @@ wheels = [
 
 [[package]]
 name = "open-notebook"
-version = "1.7.2"
+version = "1.7.3"
 source = { editable = "." }
 dependencies = [
     { name = "ai-prompter" },