History

Luis Novo d8006ff5cb feat: content-type aware chunking and unified embedding (#444 ) * feat: content-type aware chunking and unified embedding - Add chunking.py with HTML, Markdown, and plain text detection - Add embedding.py with mean pooling for large content - Create dedicated commands: embed_note, embed_insight, embed_source - Use fire-and-forget pattern for embedding via submit_command() - Refactor rebuild_embeddings_command to delegate to individual commands - Remove legacy commands and needs_embedding() methods - Reduce chunk size to 1500 chars for Ollama compatibility - Update CLAUDE.md documentation for new architecture Fixes #350, #142 * fix: address code review issues - Note.save() now returns command_id for tracking embedding jobs - Add length check after generate_embeddings() to fail fast on mismatch - Add numpy as explicit dependency (was transitive) - Remove hardcoded chunk sizes from docstrings * docs: address code review comments - Rename "SYNC PATH" to "DOMAIN MODEL PATH" in embedding router - Add test_chunking.py and test_embedding.py to Testing Strategy - Clarify auto-embedding behavior for each domain model * fix: clean thinking tags from prompt graph output Adds clean_thinking_content() to prompt.py to handle extended thinking models that return <think>...</think> tags. This fixes empty titles when saving notes from chat. * chore: remove local docker-compose from git * fix(frontend): handle null parent_id in search results Add defensive check for null parent_id in search results to prevent "Cannot read properties of null (reading 'split')" error. This can happen with orphaned records in the database. * fix: cascade delete embeddings and insights when source is deleted When deleting a Source, now also deletes associated: - source_embedding records - source_insight records This prevents orphaned records that cause null parent_id errors in vector search results. * fix: add cleanup for orphan embedding/insight records in migration 10 Deletes source_embedding and source_insight records where the linked source no longer exists (source.id = NONE). * chore: bump esperanto to 2.16 Increases ctx_num for Ollama models to accommodate larger notebook context windows. See: https://github.com/lfnovo/esperanto/pull/69		2026-01-21 23:49:08 -03:00
..
__init__.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
chunking.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
CLAUDE.md	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
context_builder.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
embedding.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
README.md	Version 1 (#160 )	2025-10-18 12:46:22 -03:00
text_utils.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
token_utils.py	Feat/localization tests docker (#371 )	2026-01-15 13:51:05 -03:00
version_utils.py	Feat/localization tests docker (#371 )	2026-01-15 13:51:05 -03:00

README.md

ContextBuilder

A flexible and generic ContextBuilder class for the Open Notebook project that can handle any parameters and build context from sources, notebooks, insights, and notes.

Features

Flexible Parameters: Accepts any parameters via **kwargs for future extensibility
Priority-based Management: Automatic prioritization and sorting of context items
Token Counting: Built-in token counting and truncation to fit limits
Deduplication: Automatic removal of duplicate items based on ID
Type-based Grouping: Separates sources, notes, and insights in output
Async Support: Fully async for database operations

Basic Usage

from open_notebook.utils.context_builder import ContextBuilder, ContextConfig

# Simple notebook context
builder = ContextBuilder(notebook_id="notebook:123")
context = await builder.build()

# Single source with insights
builder = ContextBuilder(
    source_id="source:456",
    include_insights=True,
    max_tokens=2000
)
context = await builder.build()

Convenience Functions

from open_notebook.utils.context_builder import (
    build_notebook_context,
    build_source_context,
    build_mixed_context
)

# Build notebook context
context = await build_notebook_context(
    notebook_id="notebook:123",
    max_tokens=5000
)

# Build single source context
context = await build_source_context(
    source_id="source:456",
    include_insights=True
)

# Build mixed context
context = await build_mixed_context(
    source_ids=["source:1", "source:2"],
    note_ids=["note:1", "note:2"],
    max_tokens=3000
)

Advanced Configuration

from open_notebook.utils.context_builder import ContextConfig

# Custom configuration
config = ContextConfig(
    sources={
        "source:doc1": "insights",
        "source:doc2": "full content", 
        "source:doc3": "not in"  # Exclude
    },
    notes={
        "note:summary": "full content",
        "note:draft": "not in"  # Exclude
    },
    include_insights=True,
    max_tokens=3000,
    priority_weights={
        "source": 120,  # Higher priority
        "note": 80,     # Medium priority  
        "insight": 100  # High priority
    }
)

builder = ContextBuilder(
    notebook_id="notebook:project",
    context_config=config
)
context = await builder.build()

Programmatic Item Management

from open_notebook.utils.context_builder import ContextItem

builder = ContextBuilder()

# Add custom items
item = ContextItem(
    id="source:important",
    type="source",
    content={"title": "Key Document", "summary": "..."},
    priority=150  # Very high priority
)
builder.add_item(item)

# Apply management operations
builder.remove_duplicates()
builder.prioritize()
builder.truncate_to_fit(1000)

context = builder._format_response()

Flexible Parameters

The ContextBuilder accepts any parameters via **kwargs, making it extensible for future features:

builder = ContextBuilder(
    notebook_id="notebook:123",
    include_insights=True,
    max_tokens=2000,
    
    # Custom parameters for future extensions
    user_id="user:456",
    custom_filter="advanced",
    experimental_feature=True
)

# Access custom parameters
user_id = builder.params.get('user_id')

Output Format

The ContextBuilder returns a structured response:

{
    "sources": [...],           # List of source contexts
    "notes": [...],             # List of note contexts  
    "insights": [...],          # List of insight contexts
    "total_tokens": 1234,       # Total token count
    "total_items": 10,          # Total number of items
    "notebook_id": "notebook:123",  # If provided
    "metadata": {
        "source_count": 5,
        "note_count": 3,
        "insight_count": 2,
        "config": {
            "include_insights": true,
            "include_notes": true,
            "max_tokens": 2000
        }
    }
}

Architecture

The ContextBuilder follows these design principles:

Separation of Concerns: Context building, item management, and formatting are separate
Extensibility: Uses **kwargs and flexible configuration for future features
Performance: Token-aware truncation and efficient deduplication
Type Safety: Proper type hints and data classes for structure
Error Handling: Graceful handling of missing items and database errors

Integration

The ContextBuilder integrates seamlessly with the existing Open Notebook architecture:

Uses existing domain models (Source, Notebook, Note)
Leverages the repository pattern for database access
Follows the same async patterns as other services
Integrates with the token counting utilities

Error Handling

The ContextBuilder handles errors gracefully:

Missing notebooks/sources/notes are logged but don't stop execution
Database errors are wrapped in DatabaseOperationError
Invalid parameters raise InvalidInputError
All errors include detailed context information