feat: use token-based sizing for embedding chunking (#749)

* feat: make chunk sizing token-based with 512-token default

* fix: defer embedding debug token metrics

* chore: lower default chunk size to 400 tokens and document rationale

The previous 512-token default matched exactly the context window of
BERT-family embedders like mxbai-embed-large, leaving no margin for:
- tokenizer mismatch between our o200k_base measurement and the
  embedder's own WordPiece tokenizer
- occasional splitter overshoot (RecursiveCharacterTextSplitter can
  emit chunks slightly above chunk_size when separators are sparse)
- special tokens ([CLS], [SEP]) that consume context-window budget

400 tokens keeps ~20% headroom below 512 while still being a large
improvement over the old character-based default for most content.
Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE
via env var. Also adds a CHANGELOG entry for the full PR behavior
change.

* chore: move chunking changelog entry under 1.8.5

Target release is 1.8.5 — moving the Changed section out of Unreleased.

---------

Co-authored-by: Luis Novo <lfnovo@gmail.com>
This commit is contained in:
unendless314 2026-04-20 00:49:09 +08:00 committed by GitHub
parent 648f6d7808
commit 6aabacfca6
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 130 additions and 56 deletions

View file

@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [1.8.5] - 2026-04-14
### Changed
- Embedding chunking is now token-based instead of character-based, improving chunk sizing consistency for CJK and mixed-language content (#542, #749)
- `OPEN_NOTEBOOK_CHUNK_SIZE` and `OPEN_NOTEBOOK_CHUNK_OVERLAP` semantics changed from characters to tokens; default reduced from 1200 characters to 400 tokens to stay safely below the 512-token ceiling of BERT-family embedders (e.g. mxbai-embed-large) after accounting for tokenizer mismatch and splitter overshoot. Existing stored embeddings are unaffected; only new ingestions use the new chunking.
### Fixed
- Credentials endpoint no longer crashes (500) when encryption key doesn't match stored credentials (#740)
- Broken credentials are now shown with a decryption warning and can still be deleted

View file

@ -24,20 +24,20 @@ Each utility is stateless and can be imported independently.
The chunking behavior can be configured via environment variables:
- **OPEN_NOTEBOOK_CHUNK_SIZE**: Maximum chunk size in characters (default: 1200)
- Minimum: 100 characters
- Warnings: Values > 8192 characters or invalid values
- Use case: Smaller models (e.g., mxbai-embed-large with limited context window)
- **OPEN_NOTEBOOK_CHUNK_SIZE**: Maximum chunk size in tokens (default: 400)
- Minimum: 100 tokens
- Warnings: Values > 8192 tokens or invalid values
- Use case: Conservative baseline that leaves headroom below 512-token embedders (e.g. mxbai-embed-large). Buffer accounts for tokenizer mismatch between our `o200k_base` measurement and the embedder's own tokenizer, plus occasional splitter overshoot and special tokens.
- **OPEN_NOTEBOOK_CHUNK_OVERLAP**: Overlap between chunks in characters (default: 15% of CHUNK_SIZE)
- **OPEN_NOTEBOOK_CHUNK_OVERLAP**: Overlap between chunks in tokens (default: 15% of CHUNK_SIZE)
- Must be: >= 0 and < CHUNK_SIZE
- Warnings: Invalid values or values >= CHUNK_SIZE
- Use case: Control how much context is shared between adjacent chunks
Example for models with small context windows:
Example for embedders with larger context windows (e.g. OpenAI text-embedding-3 family, 8191 tokens):
```bash
export OPEN_NOTEBOOK_CHUNK_SIZE=512
export OPEN_NOTEBOOK_CHUNK_OVERLAP=50
export OPEN_NOTEBOOK_CHUNK_SIZE=1500
export OPEN_NOTEBOOK_CHUNK_OVERLAP=150
```
Note: Changes require restart of the application.
@ -63,7 +63,7 @@ Note: Changes require restart of the application.
### chunking.py
- **ContentType**: Enum (HTML, MARKDOWN, PLAIN)
- **CHUNK_SIZE**: Configurable via `OPEN_NOTEBOOK_CHUNK_SIZE` env var (default: 1200)
- **CHUNK_SIZE**: Configurable via `OPEN_NOTEBOOK_CHUNK_SIZE` env var (default: 400)
- **CHUNK_OVERLAP**: Configurable via `OPEN_NOTEBOOK_CHUNK_OVERLAP` env var (default: 15% of CHUNK_SIZE)
- **detect_content_type_from_extension(file_path)**: Detect type from file extension
- **detect_content_type_from_heuristics(text)**: Detect type from content patterns (returns type + confidence)
@ -74,7 +74,7 @@ Note: Changes require restart of the application.
- Uses LangChain splitters: HTMLHeaderTextSplitter, MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
- Extension-based detection is primary; heuristics can override PLAIN extensions with 0.8+ confidence
- Secondary chunking applied when HTML/Markdown splitters produce oversized chunks
- Returns list of strings, each ≤ CHUNK_SIZE characters
- Returns list of strings, each approximately ≤ CHUNK_SIZE tokens
### embedding.py
- **mean_pool_embeddings(embeddings)**: Combine multiple embeddings via normalized mean pooling
@ -83,7 +83,7 @@ Note: Changes require restart of the application.
**Key behavior**:
- Uses model_manager.get_model("embedding") for embedding model
- Short text (≤ CHUNK_SIZE): direct embedding
- Short text (≤ CHUNK_SIZE tokens): direct embedding
- Long text: chunk → embed each → mean pool results
- Mean pooling: normalize each → mean → normalize result (using numpy)
- Raises ValueError for empty/whitespace-only text
@ -103,7 +103,7 @@ Note: Changes require restart of the application.
- **token_count(text)**: Returns estimated token count for string (via tiktoken)
- **token_cost(text, model)**: Calculate cost estimate for text with given model
**Key behavior**: Uses cl100k_base encoding; may differ slightly from actual model tokenization
**Key behavior**: Uses `o200k_base` encoding; may differ slightly from actual model tokenization. If `tiktoken` is unavailable, `token_count()` falls back to a coarse estimate; this refactor keeps that existing contract.
### version_utils.py
- **compare_versions(v1, v2)**: Returns -1 (v1 < v2), 0 (equal), 1 (v1 > v2)
@ -135,8 +135,9 @@ Note: Changes require restart of the application.
## Important Quirks & Gotchas
- **Token count estimation**: Uses cl100k_base encoding; may differ 5-10% from actual model tokens
- **Chunk size for Ollama**: 1500 chars chosen to fit within Ollama embedding model context limits
- **Token count estimation**: Uses `o200k_base` encoding; may differ slightly from actual model tokens
- **Chunk size semantics changed**: `OPEN_NOTEBOOK_CHUNK_SIZE` and `OPEN_NOTEBOOK_CHUNK_OVERLAP` are token-based, not character-based
- **Default chunk size**: The token-based default is 400 — leaves ~20% margin below the 512-token ceiling of BERT-family embedders (e.g. mxbai-embed-large) to absorb tokenizer mismatch (we measure with `o200k_base`, they tokenize with WordPiece), splitter overshoot, and special tokens
- **Content type detection order**: Extension checked first, then heuristics; high-confidence heuristics (≥0.8) can override PLAIN extensions
- **Mean pooling normalization**: Each embedding normalized before mean, result normalized after
- **Priority weights default**: If not specified, ContextConfig uses default weights (source=1, note=0.8, insight=1.2)

View file

@ -9,8 +9,8 @@ Key functions:
- chunk_text(): Splits text into chunks using appropriate splitter for content type
Environment Variables:
OPEN_NOTEBOOK_CHUNK_SIZE: Maximum chunk size in characters (default: 1200)
OPEN_NOTEBOOK_CHUNK_OVERLAP: Overlap between chunks in characters (default: 15% of CHUNK_SIZE)
OPEN_NOTEBOOK_CHUNK_SIZE: Maximum chunk size in tokens (default: 400)
OPEN_NOTEBOOK_CHUNK_OVERLAP: Overlap between chunks in tokens (default: 15% of CHUNK_SIZE)
"""
import os
@ -26,6 +26,8 @@ from langchain_text_splitters import (
)
from loguru import logger
from .token_utils import token_count
def _get_chunk_size() -> int:
"""Get chunk size from environment variable or use default."""
@ -44,14 +46,14 @@ def _get_chunk_size() -> int:
f"OPEN_NOTEBOOK_CHUNK_SIZE ({chunk_size}) is very large. "
f"This may cause issues with some embedding models."
)
logger.info(f"Using custom chunk size: {chunk_size} characters")
logger.info(f"Using custom chunk size: {chunk_size} tokens")
return chunk_size
except ValueError:
logger.warning(
f"Invalid OPEN_NOTEBOOK_CHUNK_SIZE value: '{chunk_size_str}'. "
f"Using default: 1200"
f"Using default: 400"
)
return 1200
return 400
def _get_chunk_overlap(chunk_size: int) -> int:
@ -72,7 +74,7 @@ def _get_chunk_overlap(chunk_size: int) -> int:
f"Using 15% of chunk size: {int(chunk_size * 0.15)}"
)
return int(chunk_size * 0.15)
logger.info(f"Using custom chunk overlap: {overlap} characters")
logger.info(f"Using custom chunk overlap: {overlap} tokens")
return overlap
except ValueError:
logger.warning(
@ -358,14 +360,14 @@ def _get_plain_splitter() -> RecursiveCharacterTextSplitter:
return RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
length_function=len,
length_function=token_count,
separators=["\n\n", "\n", ". ", ", ", " ", ""],
)
def _apply_secondary_chunking(chunks: List[str]) -> List[str]:
"""
Apply secondary chunking to ensure no chunk exceeds CHUNK_SIZE.
Apply secondary chunking to ensure no chunk exceeds CHUNK_SIZE tokens.
Used when primary splitters (HTML/Markdown) produce oversized chunks.
"""
@ -373,7 +375,7 @@ def _apply_secondary_chunking(chunks: List[str]) -> List[str]:
secondary_splitter = _get_plain_splitter()
for chunk in chunks:
if len(chunk) > CHUNK_SIZE:
if token_count(chunk) > CHUNK_SIZE:
# Split oversized chunk
sub_chunks = secondary_splitter.split_text(chunk)
result.extend(sub_chunks)
@ -397,13 +399,14 @@ def chunk_text(
file_path: Optional file path for content type detection
Returns:
List of text chunks, each <= CHUNK_SIZE characters
List of text chunks, each approximately <= CHUNK_SIZE tokens
"""
if not text or not text.strip():
return []
# Short text doesn't need chunking
if len(text) <= CHUNK_SIZE:
text_tokens = token_count(text)
if text_tokens <= CHUNK_SIZE:
return [text]
# Detect content type if not provided
@ -441,5 +444,5 @@ def chunk_text(
# Filter out empty chunks
chunks = [c.strip() for c in chunks if c and c.strip()]
logger.debug(f"Created {len(chunks)} chunks from {len(text)} characters")
logger.debug(f"Created {len(chunks)} chunks from {text_tokens} tokens")
return chunks

View file

@ -17,6 +17,7 @@ import numpy as np
from loguru import logger
from .chunking import CHUNK_SIZE, ContentType, chunk_text
from .token_utils import token_count
EMBEDDING_BATCH_SIZE = 50
EMBEDDING_MAX_RETRIES = 3
@ -120,11 +121,28 @@ async def generate_embeddings(
model_name = getattr(embedding_model, "model_name", "unknown")
# Log text sizes for debugging
text_sizes = [len(t) for t in texts]
logger.debug(
f"Generating embeddings for {len(texts)} texts "
f"(sizes: min={min(text_sizes)}, max={max(text_sizes)}, "
f"total={sum(text_sizes)} chars)"
metrics: tuple[int, int, int, int] | None = None
def _get_size_metrics() -> tuple[int, int, int, int]:
nonlocal metrics
if metrics is None:
token_sizes = [token_count(t) for t in texts]
metrics = (
min(token_sizes),
max(token_sizes),
sum(token_sizes),
sum(len(t) for t in texts),
)
return metrics
logger.opt(lazy=True).debug(
"Generating embeddings for {} texts "
"(tokens: min={}, max={}, total={}; chars: total={})",
lambda: len(texts),
lambda: _get_size_metrics()[0],
lambda: _get_size_metrics()[1],
lambda: _get_size_metrics()[2],
lambda: _get_size_metrics()[3],
)
all_embeddings: List[List[float]] = []
@ -174,10 +192,10 @@ async def generate_embedding(
"""
Generate a single embedding for text, handling large content via chunking and mean pooling.
For short text (<= CHUNK_SIZE):
For short text (<= CHUNK_SIZE tokens):
- Embeds directly and returns the embedding
For long text (> CHUNK_SIZE):
For long text (> CHUNK_SIZE tokens):
- Chunks the text using appropriate splitter for content type
- Embeds all chunks in batches
- Combines embeddings via mean pooling
@ -199,16 +217,17 @@ async def generate_embedding(
raise ValueError("Cannot generate embedding for empty text")
text = text.strip()
text_tokens = token_count(text)
# Check if chunking is needed
if len(text) <= CHUNK_SIZE:
if text_tokens <= CHUNK_SIZE:
# Short text - embed directly
logger.debug(f"Embedding short text ({len(text)} chars) directly")
logger.debug(f"Embedding short text ({text_tokens} tokens) directly")
embeddings = await generate_embeddings([text], command_id=command_id)
return embeddings[0]
# Long text - chunk and mean pool
logger.debug(f"Text exceeds chunk size ({len(text)} chars), chunking...")
logger.debug(f"Text exceeds chunk size ({text_tokens} tokens), chunking...")
chunks = chunk_text(text, content_type=content_type, file_path=file_path)

View file

@ -14,6 +14,32 @@ from open_notebook.utils.chunking import (
detect_content_type_from_extension,
detect_content_type_from_heuristics,
)
from open_notebook.utils.token_utils import token_count
def _build_text_with_max_tokens(fragment: str, max_tokens: int) -> str:
"""Build text that stays within a token budget."""
text = ""
while True:
candidate = text + fragment
if token_count(candidate) > max_tokens:
return text
text = candidate
def _build_text_exceeding_tokens(fragment: str, threshold_tokens: int) -> str:
"""Build text that exceeds a token threshold."""
text = fragment
while token_count(text) <= threshold_tokens:
text += fragment
return text
def _assert_chunks_within_token_limit(chunks: list[str]) -> None:
"""Assert chunks stay within the configured token window."""
assert chunks
for chunk in chunks:
assert token_count(chunk) <= CHUNK_SIZE
# ============================================================================
# TEST SUITE 1: Content Type Detection from Extension
@ -222,20 +248,33 @@ class TestChunkText:
assert chunks[0] == text
def test_text_at_chunk_limit(self):
"""Test text at exactly chunk size limit."""
text = "x" * CHUNK_SIZE
"""Test text within the token chunk size limit."""
text = _build_text_with_max_tokens("This is a sentence. ", CHUNK_SIZE)
assert token_count(text) <= CHUNK_SIZE
chunks = chunk_text(text)
assert len(chunks) == 1
def test_long_text_is_chunked(self):
"""Test that long text is chunked."""
# Create text longer than chunk size
text = "This is a sentence. " * 200 # ~4000 chars
"""Test that long English text is chunked by token budget."""
text = _build_text_exceeding_tokens("This is a sentence. ", CHUNK_SIZE)
chunks = chunk_text(text)
assert len(chunks) > 1
# Each chunk should be <= CHUNK_SIZE
for chunk in chunks:
assert len(chunk) <= CHUNK_SIZE + 100 # Allow some flexibility for overlap
_assert_chunks_within_token_limit(chunks)
def test_cjk_text_is_chunked_by_tokens(self):
"""Test that long CJK text is chunked using token measurement."""
text = _build_text_exceeding_tokens("這是一段中文內容,用來驗證分塊邏輯。", CHUNK_SIZE)
chunks = chunk_text(text, content_type=ContentType.PLAIN)
assert len(chunks) > 1
_assert_chunks_within_token_limit(chunks)
def test_mixed_language_text_is_chunked_by_tokens(self):
"""Test that mixed-language text is chunked using token measurement."""
fragment = "This paragraph mixes English and 中文內容 to verify token-based chunking. "
text = _build_text_exceeding_tokens(fragment, CHUNK_SIZE)
chunks = chunk_text(text, content_type=ContentType.PLAIN)
assert len(chunks) > 1
_assert_chunks_within_token_limit(chunks)
def test_explicit_content_type_html(self):
"""Test chunking with explicit HTML content type."""
@ -269,9 +308,10 @@ Content for section 2.
def test_explicit_content_type_plain(self):
"""Test chunking with explicit plain content type."""
plain_text = "Word " * 500 # ~2500 chars
plain_text = _build_text_exceeding_tokens("Word ", CHUNK_SIZE)
chunks = chunk_text(plain_text, content_type=ContentType.PLAIN)
assert len(chunks) >= 1
assert len(chunks) > 1
_assert_chunks_within_token_limit(chunks)
def test_file_path_detection(self):
"""Test chunking with file path for content type detection."""
@ -280,16 +320,14 @@ Content for section 2.
assert len(chunks) == 1
def test_secondary_chunking_for_large_sections(self):
"""Test that large sections from HTML/MD splitters are further chunked."""
# Create text that would produce a single large section
large_section = "x" * 3000 # Larger than CHUNK_SIZE
"""Test that large Markdown sections are further chunked by tokens."""
large_section = _build_text_exceeding_tokens(
"這是一段很長的章節內容,用來測試次級分塊。", CHUNK_SIZE
)
md_text = f"# Title\n\n{large_section}"
chunks = chunk_text(md_text, content_type=ContentType.MARKDOWN)
# Should have multiple chunks due to secondary chunking
assert len(chunks) >= 1
for chunk in chunks:
# Allow some flexibility but chunks should be reasonable size
assert len(chunk) <= CHUNK_SIZE + 300
assert len(chunks) > 1
_assert_chunks_within_token_limit(chunks)
if __name__ == "__main__":

View file

@ -6,11 +6,21 @@ Tests embedding generation and mean pooling functionality.
import pytest
from open_notebook.utils.chunking import CHUNK_SIZE
from open_notebook.utils.embedding import (
generate_embedding,
generate_embeddings,
mean_pool_embeddings,
)
from open_notebook.utils.token_utils import token_count
def _build_text_exceeding_tokens(fragment: str, threshold_tokens: int) -> str:
"""Build text that exceeds a token threshold."""
text = fragment
while token_count(text) <= threshold_tokens:
text += fragment
return text
# ============================================================================
# TEST SUITE 1: Mean Pooling
@ -184,8 +194,7 @@ class TestGenerateEmbedding:
"""Test that long text is chunked and mean pooled."""
from unittest.mock import AsyncMock, MagicMock, patch
# Create text longer than chunk size
long_text = "This is a sentence. " * 200 # ~4000 chars
long_text = _build_text_exceeding_tokens("This is a sentence. ", CHUNK_SIZE)
mock_model = MagicMock()
# Return multiple embeddings (one per chunk)