feat: use token-based sizing for embedding chunking (#749)
* feat: make chunk sizing token-based with 512-token default * fix: defer embedding debug token metrics * chore: lower default chunk size to 400 tokens and document rationale The previous 512-token default matched exactly the context window of BERT-family embedders like mxbai-embed-large, leaving no margin for: - tokenizer mismatch between our o200k_base measurement and the embedder's own WordPiece tokenizer - occasional splitter overshoot (RecursiveCharacterTextSplitter can emit chunks slightly above chunk_size when separators are sparse) - special tokens ([CLS], [SEP]) that consume context-window budget 400 tokens keeps ~20% headroom below 512 while still being a large improvement over the old character-based default for most content. Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE via env var. Also adds a CHANGELOG entry for the full PR behavior change. * chore: move chunking changelog entry under 1.8.5 Target release is 1.8.5 — moving the Changed section out of Unreleased. --------- Co-authored-by: Luis Novo <lfnovo@gmail.com>
This commit is contained in:
parent
648f6d7808
commit
6aabacfca6
6 changed files with 130 additions and 56 deletions
|
|
@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||
|
||||
## [1.8.5] - 2026-04-14
|
||||
|
||||
### Changed
|
||||
- Embedding chunking is now token-based instead of character-based, improving chunk sizing consistency for CJK and mixed-language content (#542, #749)
|
||||
- `OPEN_NOTEBOOK_CHUNK_SIZE` and `OPEN_NOTEBOOK_CHUNK_OVERLAP` semantics changed from characters to tokens; default reduced from 1200 characters to 400 tokens to stay safely below the 512-token ceiling of BERT-family embedders (e.g. mxbai-embed-large) after accounting for tokenizer mismatch and splitter overshoot. Existing stored embeddings are unaffected; only new ingestions use the new chunking.
|
||||
|
||||
### Fixed
|
||||
- Credentials endpoint no longer crashes (500) when encryption key doesn't match stored credentials (#740)
|
||||
- Broken credentials are now shown with a decryption warning and can still be deleted
|
||||
|
|
|
|||
|
|
@ -24,20 +24,20 @@ Each utility is stateless and can be imported independently.
|
|||
|
||||
The chunking behavior can be configured via environment variables:
|
||||
|
||||
- **OPEN_NOTEBOOK_CHUNK_SIZE**: Maximum chunk size in characters (default: 1200)
|
||||
- Minimum: 100 characters
|
||||
- Warnings: Values > 8192 characters or invalid values
|
||||
- Use case: Smaller models (e.g., mxbai-embed-large with limited context window)
|
||||
- **OPEN_NOTEBOOK_CHUNK_SIZE**: Maximum chunk size in tokens (default: 400)
|
||||
- Minimum: 100 tokens
|
||||
- Warnings: Values > 8192 tokens or invalid values
|
||||
- Use case: Conservative baseline that leaves headroom below 512-token embedders (e.g. mxbai-embed-large). Buffer accounts for tokenizer mismatch between our `o200k_base` measurement and the embedder's own tokenizer, plus occasional splitter overshoot and special tokens.
|
||||
|
||||
- **OPEN_NOTEBOOK_CHUNK_OVERLAP**: Overlap between chunks in characters (default: 15% of CHUNK_SIZE)
|
||||
- **OPEN_NOTEBOOK_CHUNK_OVERLAP**: Overlap between chunks in tokens (default: 15% of CHUNK_SIZE)
|
||||
- Must be: >= 0 and < CHUNK_SIZE
|
||||
- Warnings: Invalid values or values >= CHUNK_SIZE
|
||||
- Use case: Control how much context is shared between adjacent chunks
|
||||
|
||||
Example for models with small context windows:
|
||||
Example for embedders with larger context windows (e.g. OpenAI text-embedding-3 family, 8191 tokens):
|
||||
```bash
|
||||
export OPEN_NOTEBOOK_CHUNK_SIZE=512
|
||||
export OPEN_NOTEBOOK_CHUNK_OVERLAP=50
|
||||
export OPEN_NOTEBOOK_CHUNK_SIZE=1500
|
||||
export OPEN_NOTEBOOK_CHUNK_OVERLAP=150
|
||||
```
|
||||
|
||||
Note: Changes require restart of the application.
|
||||
|
|
@ -63,7 +63,7 @@ Note: Changes require restart of the application.
|
|||
|
||||
### chunking.py
|
||||
- **ContentType**: Enum (HTML, MARKDOWN, PLAIN)
|
||||
- **CHUNK_SIZE**: Configurable via `OPEN_NOTEBOOK_CHUNK_SIZE` env var (default: 1200)
|
||||
- **CHUNK_SIZE**: Configurable via `OPEN_NOTEBOOK_CHUNK_SIZE` env var (default: 400)
|
||||
- **CHUNK_OVERLAP**: Configurable via `OPEN_NOTEBOOK_CHUNK_OVERLAP` env var (default: 15% of CHUNK_SIZE)
|
||||
- **detect_content_type_from_extension(file_path)**: Detect type from file extension
|
||||
- **detect_content_type_from_heuristics(text)**: Detect type from content patterns (returns type + confidence)
|
||||
|
|
@ -74,7 +74,7 @@ Note: Changes require restart of the application.
|
|||
- Uses LangChain splitters: HTMLHeaderTextSplitter, MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
|
||||
- Extension-based detection is primary; heuristics can override PLAIN extensions with 0.8+ confidence
|
||||
- Secondary chunking applied when HTML/Markdown splitters produce oversized chunks
|
||||
- Returns list of strings, each ≤ CHUNK_SIZE characters
|
||||
- Returns list of strings, each approximately ≤ CHUNK_SIZE tokens
|
||||
|
||||
### embedding.py
|
||||
- **mean_pool_embeddings(embeddings)**: Combine multiple embeddings via normalized mean pooling
|
||||
|
|
@ -83,7 +83,7 @@ Note: Changes require restart of the application.
|
|||
|
||||
**Key behavior**:
|
||||
- Uses model_manager.get_model("embedding") for embedding model
|
||||
- Short text (≤ CHUNK_SIZE): direct embedding
|
||||
- Short text (≤ CHUNK_SIZE tokens): direct embedding
|
||||
- Long text: chunk → embed each → mean pool results
|
||||
- Mean pooling: normalize each → mean → normalize result (using numpy)
|
||||
- Raises ValueError for empty/whitespace-only text
|
||||
|
|
@ -103,7 +103,7 @@ Note: Changes require restart of the application.
|
|||
- **token_count(text)**: Returns estimated token count for string (via tiktoken)
|
||||
- **token_cost(text, model)**: Calculate cost estimate for text with given model
|
||||
|
||||
**Key behavior**: Uses cl100k_base encoding; may differ slightly from actual model tokenization
|
||||
**Key behavior**: Uses `o200k_base` encoding; may differ slightly from actual model tokenization. If `tiktoken` is unavailable, `token_count()` falls back to a coarse estimate; this refactor keeps that existing contract.
|
||||
|
||||
### version_utils.py
|
||||
- **compare_versions(v1, v2)**: Returns -1 (v1 < v2), 0 (equal), 1 (v1 > v2)
|
||||
|
|
@ -135,8 +135,9 @@ Note: Changes require restart of the application.
|
|||
|
||||
## Important Quirks & Gotchas
|
||||
|
||||
- **Token count estimation**: Uses cl100k_base encoding; may differ 5-10% from actual model tokens
|
||||
- **Chunk size for Ollama**: 1500 chars chosen to fit within Ollama embedding model context limits
|
||||
- **Token count estimation**: Uses `o200k_base` encoding; may differ slightly from actual model tokens
|
||||
- **Chunk size semantics changed**: `OPEN_NOTEBOOK_CHUNK_SIZE` and `OPEN_NOTEBOOK_CHUNK_OVERLAP` are token-based, not character-based
|
||||
- **Default chunk size**: The token-based default is 400 — leaves ~20% margin below the 512-token ceiling of BERT-family embedders (e.g. mxbai-embed-large) to absorb tokenizer mismatch (we measure with `o200k_base`, they tokenize with WordPiece), splitter overshoot, and special tokens
|
||||
- **Content type detection order**: Extension checked first, then heuristics; high-confidence heuristics (≥0.8) can override PLAIN extensions
|
||||
- **Mean pooling normalization**: Each embedding normalized before mean, result normalized after
|
||||
- **Priority weights default**: If not specified, ContextConfig uses default weights (source=1, note=0.8, insight=1.2)
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ Key functions:
|
|||
- chunk_text(): Splits text into chunks using appropriate splitter for content type
|
||||
|
||||
Environment Variables:
|
||||
OPEN_NOTEBOOK_CHUNK_SIZE: Maximum chunk size in characters (default: 1200)
|
||||
OPEN_NOTEBOOK_CHUNK_OVERLAP: Overlap between chunks in characters (default: 15% of CHUNK_SIZE)
|
||||
OPEN_NOTEBOOK_CHUNK_SIZE: Maximum chunk size in tokens (default: 400)
|
||||
OPEN_NOTEBOOK_CHUNK_OVERLAP: Overlap between chunks in tokens (default: 15% of CHUNK_SIZE)
|
||||
"""
|
||||
|
||||
import os
|
||||
|
|
@ -26,6 +26,8 @@ from langchain_text_splitters import (
|
|||
)
|
||||
from loguru import logger
|
||||
|
||||
from .token_utils import token_count
|
||||
|
||||
|
||||
def _get_chunk_size() -> int:
|
||||
"""Get chunk size from environment variable or use default."""
|
||||
|
|
@ -44,14 +46,14 @@ def _get_chunk_size() -> int:
|
|||
f"OPEN_NOTEBOOK_CHUNK_SIZE ({chunk_size}) is very large. "
|
||||
f"This may cause issues with some embedding models."
|
||||
)
|
||||
logger.info(f"Using custom chunk size: {chunk_size} characters")
|
||||
logger.info(f"Using custom chunk size: {chunk_size} tokens")
|
||||
return chunk_size
|
||||
except ValueError:
|
||||
logger.warning(
|
||||
f"Invalid OPEN_NOTEBOOK_CHUNK_SIZE value: '{chunk_size_str}'. "
|
||||
f"Using default: 1200"
|
||||
f"Using default: 400"
|
||||
)
|
||||
return 1200
|
||||
return 400
|
||||
|
||||
|
||||
def _get_chunk_overlap(chunk_size: int) -> int:
|
||||
|
|
@ -72,7 +74,7 @@ def _get_chunk_overlap(chunk_size: int) -> int:
|
|||
f"Using 15% of chunk size: {int(chunk_size * 0.15)}"
|
||||
)
|
||||
return int(chunk_size * 0.15)
|
||||
logger.info(f"Using custom chunk overlap: {overlap} characters")
|
||||
logger.info(f"Using custom chunk overlap: {overlap} tokens")
|
||||
return overlap
|
||||
except ValueError:
|
||||
logger.warning(
|
||||
|
|
@ -358,14 +360,14 @@ def _get_plain_splitter() -> RecursiveCharacterTextSplitter:
|
|||
return RecursiveCharacterTextSplitter(
|
||||
chunk_size=CHUNK_SIZE,
|
||||
chunk_overlap=CHUNK_OVERLAP,
|
||||
length_function=len,
|
||||
length_function=token_count,
|
||||
separators=["\n\n", "\n", ". ", ", ", " ", ""],
|
||||
)
|
||||
|
||||
|
||||
def _apply_secondary_chunking(chunks: List[str]) -> List[str]:
|
||||
"""
|
||||
Apply secondary chunking to ensure no chunk exceeds CHUNK_SIZE.
|
||||
Apply secondary chunking to ensure no chunk exceeds CHUNK_SIZE tokens.
|
||||
|
||||
Used when primary splitters (HTML/Markdown) produce oversized chunks.
|
||||
"""
|
||||
|
|
@ -373,7 +375,7 @@ def _apply_secondary_chunking(chunks: List[str]) -> List[str]:
|
|||
secondary_splitter = _get_plain_splitter()
|
||||
|
||||
for chunk in chunks:
|
||||
if len(chunk) > CHUNK_SIZE:
|
||||
if token_count(chunk) > CHUNK_SIZE:
|
||||
# Split oversized chunk
|
||||
sub_chunks = secondary_splitter.split_text(chunk)
|
||||
result.extend(sub_chunks)
|
||||
|
|
@ -397,13 +399,14 @@ def chunk_text(
|
|||
file_path: Optional file path for content type detection
|
||||
|
||||
Returns:
|
||||
List of text chunks, each <= CHUNK_SIZE characters
|
||||
List of text chunks, each approximately <= CHUNK_SIZE tokens
|
||||
"""
|
||||
if not text or not text.strip():
|
||||
return []
|
||||
|
||||
# Short text doesn't need chunking
|
||||
if len(text) <= CHUNK_SIZE:
|
||||
text_tokens = token_count(text)
|
||||
if text_tokens <= CHUNK_SIZE:
|
||||
return [text]
|
||||
|
||||
# Detect content type if not provided
|
||||
|
|
@ -441,5 +444,5 @@ def chunk_text(
|
|||
# Filter out empty chunks
|
||||
chunks = [c.strip() for c in chunks if c and c.strip()]
|
||||
|
||||
logger.debug(f"Created {len(chunks)} chunks from {len(text)} characters")
|
||||
logger.debug(f"Created {len(chunks)} chunks from {text_tokens} tokens")
|
||||
return chunks
|
||||
|
|
|
|||
|
|
@ -17,6 +17,7 @@ import numpy as np
|
|||
from loguru import logger
|
||||
|
||||
from .chunking import CHUNK_SIZE, ContentType, chunk_text
|
||||
from .token_utils import token_count
|
||||
|
||||
EMBEDDING_BATCH_SIZE = 50
|
||||
EMBEDDING_MAX_RETRIES = 3
|
||||
|
|
@ -120,11 +121,28 @@ async def generate_embeddings(
|
|||
model_name = getattr(embedding_model, "model_name", "unknown")
|
||||
|
||||
# Log text sizes for debugging
|
||||
text_sizes = [len(t) for t in texts]
|
||||
logger.debug(
|
||||
f"Generating embeddings for {len(texts)} texts "
|
||||
f"(sizes: min={min(text_sizes)}, max={max(text_sizes)}, "
|
||||
f"total={sum(text_sizes)} chars)"
|
||||
metrics: tuple[int, int, int, int] | None = None
|
||||
|
||||
def _get_size_metrics() -> tuple[int, int, int, int]:
|
||||
nonlocal metrics
|
||||
if metrics is None:
|
||||
token_sizes = [token_count(t) for t in texts]
|
||||
metrics = (
|
||||
min(token_sizes),
|
||||
max(token_sizes),
|
||||
sum(token_sizes),
|
||||
sum(len(t) for t in texts),
|
||||
)
|
||||
return metrics
|
||||
|
||||
logger.opt(lazy=True).debug(
|
||||
"Generating embeddings for {} texts "
|
||||
"(tokens: min={}, max={}, total={}; chars: total={})",
|
||||
lambda: len(texts),
|
||||
lambda: _get_size_metrics()[0],
|
||||
lambda: _get_size_metrics()[1],
|
||||
lambda: _get_size_metrics()[2],
|
||||
lambda: _get_size_metrics()[3],
|
||||
)
|
||||
|
||||
all_embeddings: List[List[float]] = []
|
||||
|
|
@ -174,10 +192,10 @@ async def generate_embedding(
|
|||
"""
|
||||
Generate a single embedding for text, handling large content via chunking and mean pooling.
|
||||
|
||||
For short text (<= CHUNK_SIZE):
|
||||
For short text (<= CHUNK_SIZE tokens):
|
||||
- Embeds directly and returns the embedding
|
||||
|
||||
For long text (> CHUNK_SIZE):
|
||||
For long text (> CHUNK_SIZE tokens):
|
||||
- Chunks the text using appropriate splitter for content type
|
||||
- Embeds all chunks in batches
|
||||
- Combines embeddings via mean pooling
|
||||
|
|
@ -199,16 +217,17 @@ async def generate_embedding(
|
|||
raise ValueError("Cannot generate embedding for empty text")
|
||||
|
||||
text = text.strip()
|
||||
text_tokens = token_count(text)
|
||||
|
||||
# Check if chunking is needed
|
||||
if len(text) <= CHUNK_SIZE:
|
||||
if text_tokens <= CHUNK_SIZE:
|
||||
# Short text - embed directly
|
||||
logger.debug(f"Embedding short text ({len(text)} chars) directly")
|
||||
logger.debug(f"Embedding short text ({text_tokens} tokens) directly")
|
||||
embeddings = await generate_embeddings([text], command_id=command_id)
|
||||
return embeddings[0]
|
||||
|
||||
# Long text - chunk and mean pool
|
||||
logger.debug(f"Text exceeds chunk size ({len(text)} chars), chunking...")
|
||||
logger.debug(f"Text exceeds chunk size ({text_tokens} tokens), chunking...")
|
||||
|
||||
chunks = chunk_text(text, content_type=content_type, file_path=file_path)
|
||||
|
||||
|
|
|
|||
|
|
@ -14,6 +14,32 @@ from open_notebook.utils.chunking import (
|
|||
detect_content_type_from_extension,
|
||||
detect_content_type_from_heuristics,
|
||||
)
|
||||
from open_notebook.utils.token_utils import token_count
|
||||
|
||||
|
||||
def _build_text_with_max_tokens(fragment: str, max_tokens: int) -> str:
|
||||
"""Build text that stays within a token budget."""
|
||||
text = ""
|
||||
while True:
|
||||
candidate = text + fragment
|
||||
if token_count(candidate) > max_tokens:
|
||||
return text
|
||||
text = candidate
|
||||
|
||||
|
||||
def _build_text_exceeding_tokens(fragment: str, threshold_tokens: int) -> str:
|
||||
"""Build text that exceeds a token threshold."""
|
||||
text = fragment
|
||||
while token_count(text) <= threshold_tokens:
|
||||
text += fragment
|
||||
return text
|
||||
|
||||
|
||||
def _assert_chunks_within_token_limit(chunks: list[str]) -> None:
|
||||
"""Assert chunks stay within the configured token window."""
|
||||
assert chunks
|
||||
for chunk in chunks:
|
||||
assert token_count(chunk) <= CHUNK_SIZE
|
||||
|
||||
# ============================================================================
|
||||
# TEST SUITE 1: Content Type Detection from Extension
|
||||
|
|
@ -222,20 +248,33 @@ class TestChunkText:
|
|||
assert chunks[0] == text
|
||||
|
||||
def test_text_at_chunk_limit(self):
|
||||
"""Test text at exactly chunk size limit."""
|
||||
text = "x" * CHUNK_SIZE
|
||||
"""Test text within the token chunk size limit."""
|
||||
text = _build_text_with_max_tokens("This is a sentence. ", CHUNK_SIZE)
|
||||
assert token_count(text) <= CHUNK_SIZE
|
||||
chunks = chunk_text(text)
|
||||
assert len(chunks) == 1
|
||||
|
||||
def test_long_text_is_chunked(self):
|
||||
"""Test that long text is chunked."""
|
||||
# Create text longer than chunk size
|
||||
text = "This is a sentence. " * 200 # ~4000 chars
|
||||
"""Test that long English text is chunked by token budget."""
|
||||
text = _build_text_exceeding_tokens("This is a sentence. ", CHUNK_SIZE)
|
||||
chunks = chunk_text(text)
|
||||
assert len(chunks) > 1
|
||||
# Each chunk should be <= CHUNK_SIZE
|
||||
for chunk in chunks:
|
||||
assert len(chunk) <= CHUNK_SIZE + 100 # Allow some flexibility for overlap
|
||||
_assert_chunks_within_token_limit(chunks)
|
||||
|
||||
def test_cjk_text_is_chunked_by_tokens(self):
|
||||
"""Test that long CJK text is chunked using token measurement."""
|
||||
text = _build_text_exceeding_tokens("這是一段中文內容,用來驗證分塊邏輯。", CHUNK_SIZE)
|
||||
chunks = chunk_text(text, content_type=ContentType.PLAIN)
|
||||
assert len(chunks) > 1
|
||||
_assert_chunks_within_token_limit(chunks)
|
||||
|
||||
def test_mixed_language_text_is_chunked_by_tokens(self):
|
||||
"""Test that mixed-language text is chunked using token measurement."""
|
||||
fragment = "This paragraph mixes English and 中文內容 to verify token-based chunking. "
|
||||
text = _build_text_exceeding_tokens(fragment, CHUNK_SIZE)
|
||||
chunks = chunk_text(text, content_type=ContentType.PLAIN)
|
||||
assert len(chunks) > 1
|
||||
_assert_chunks_within_token_limit(chunks)
|
||||
|
||||
def test_explicit_content_type_html(self):
|
||||
"""Test chunking with explicit HTML content type."""
|
||||
|
|
@ -269,9 +308,10 @@ Content for section 2.
|
|||
|
||||
def test_explicit_content_type_plain(self):
|
||||
"""Test chunking with explicit plain content type."""
|
||||
plain_text = "Word " * 500 # ~2500 chars
|
||||
plain_text = _build_text_exceeding_tokens("Word ", CHUNK_SIZE)
|
||||
chunks = chunk_text(plain_text, content_type=ContentType.PLAIN)
|
||||
assert len(chunks) >= 1
|
||||
assert len(chunks) > 1
|
||||
_assert_chunks_within_token_limit(chunks)
|
||||
|
||||
def test_file_path_detection(self):
|
||||
"""Test chunking with file path for content type detection."""
|
||||
|
|
@ -280,16 +320,14 @@ Content for section 2.
|
|||
assert len(chunks) == 1
|
||||
|
||||
def test_secondary_chunking_for_large_sections(self):
|
||||
"""Test that large sections from HTML/MD splitters are further chunked."""
|
||||
# Create text that would produce a single large section
|
||||
large_section = "x" * 3000 # Larger than CHUNK_SIZE
|
||||
"""Test that large Markdown sections are further chunked by tokens."""
|
||||
large_section = _build_text_exceeding_tokens(
|
||||
"這是一段很長的章節內容,用來測試次級分塊。", CHUNK_SIZE
|
||||
)
|
||||
md_text = f"# Title\n\n{large_section}"
|
||||
chunks = chunk_text(md_text, content_type=ContentType.MARKDOWN)
|
||||
# Should have multiple chunks due to secondary chunking
|
||||
assert len(chunks) >= 1
|
||||
for chunk in chunks:
|
||||
# Allow some flexibility but chunks should be reasonable size
|
||||
assert len(chunk) <= CHUNK_SIZE + 300
|
||||
assert len(chunks) > 1
|
||||
_assert_chunks_within_token_limit(chunks)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
|||
|
|
@ -6,11 +6,21 @@ Tests embedding generation and mean pooling functionality.
|
|||
|
||||
import pytest
|
||||
|
||||
from open_notebook.utils.chunking import CHUNK_SIZE
|
||||
from open_notebook.utils.embedding import (
|
||||
generate_embedding,
|
||||
generate_embeddings,
|
||||
mean_pool_embeddings,
|
||||
)
|
||||
from open_notebook.utils.token_utils import token_count
|
||||
|
||||
|
||||
def _build_text_exceeding_tokens(fragment: str, threshold_tokens: int) -> str:
|
||||
"""Build text that exceeds a token threshold."""
|
||||
text = fragment
|
||||
while token_count(text) <= threshold_tokens:
|
||||
text += fragment
|
||||
return text
|
||||
|
||||
# ============================================================================
|
||||
# TEST SUITE 1: Mean Pooling
|
||||
|
|
@ -184,8 +194,7 @@ class TestGenerateEmbedding:
|
|||
"""Test that long text is chunked and mean pooled."""
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
# Create text longer than chunk size
|
||||
long_text = "This is a sentence. " * 200 # ~4000 chars
|
||||
long_text = _build_text_exceeding_tokens("This is a sentence. ", CHUNK_SIZE)
|
||||
|
||||
mock_model = MagicMock()
|
||||
# Return multiple embeddings (one per chunk)
|
||||
|
|
|
|||
Loading…
Reference in a new issue