Files
embedding-buddy/CLAUDE.md
Austin Godber 5e95136aa4
Some checks failed
Security Scan / dependency-check (pull_request) Successful in 42s
Security Scan / security (pull_request) Successful in 47s
Test Suite / lint (pull_request) Successful in 31s
Test Suite / test (3.11) (pull_request) Successful in 1m30s
Test Suite / build (pull_request) Failing after 36s
fix CI more
2025-08-13 20:54:15 -07:00

6.2 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

EmbeddingBuddy is a modular Python Dash web application for interactive exploration and visualization of embedding vectors through dimensionality reduction techniques (PCA, t-SNE, UMAP). The app provides a drag-and-drop interface for uploading NDJSON files containing embeddings and visualizes them in 2D/3D plots. The codebase follows a clean, modular architecture that prioritizes testability and maintainability.

Development Commands

Install dependencies:

uv sync

Run the application:

uv run python main.py

The app will be available at http://127.0.0.1:8050

Run tests:

uv sync --extra test
uv run pytest tests/ -v

Development tools:

# Install all dev dependencies
uv sync --extra dev

# Linting and formatting
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Type checking
uv run mypy src/embeddingbuddy/

# Security scanning
uv run bandit -r src/
uv run safety check

Test with sample data: Use the included sample_data.ndjson and sample_prompts.ndjson files for testing the application functionality.

Architecture

Project Structure

The application follows a modular architecture with clear separation of concerns:

src/embeddingbuddy/
├── app.py              # Main application entry point and factory
├── main.py             # Application runner
├── config/
│   └── settings.py     # Centralized configuration management
├── data/
│   ├── parser.py       # NDJSON parsing logic
│   └── processor.py    # Data transformation and processing
├── models/
│   ├── schemas.py      # Data models and validation schemas
│   └── reducers.py     # Dimensionality reduction algorithms
├── visualization/
│   ├── plots.py        # Plot creation and factory classes
│   └── colors.py       # Color mapping and management
├── ui/
│   ├── layout.py       # Main application layout
│   ├── components/     # Reusable UI components
│   │   ├── sidebar.py  # Sidebar component
│   │   └── upload.py   # Upload components
│   └── callbacks/      # Organized callback functions
│       ├── data_processing.py  # Data upload/processing callbacks
│       ├── visualization.py    # Plot update callbacks
│       └── interactions.py     # User interaction callbacks
└── utils/              # Utility functions and helpers

Key Components

Data Layer:

  • data/parser.py - NDJSON parsing with error handling
  • data/processor.py - Data transformation and combination logic
  • models/schemas.py - Dataclasses for type safety and validation

Algorithm Layer:

  • models/reducers.py - Modular dimensionality reduction with factory pattern
  • Supports PCA, t-SNE (openTSNE), and UMAP algorithms
  • Abstract base class for easy extension

Visualization Layer:

  • visualization/plots.py - Plot factory with single and dual plot support
  • visualization/colors.py - Color mapping and grayscale conversion utilities
  • Plotly-based 2D/3D scatter plots with interactive features

UI Layer:

  • ui/layout.py - Main application layout composition
  • ui/components/ - Reusable, testable UI components
  • ui/callbacks/ - Organized callbacks grouped by functionality
  • Bootstrap-styled sidebar with controls and large visualization area

Configuration:

  • config/settings.py - Centralized settings with environment variable support
  • Plot styling, marker configurations, and app-wide constants

Data Format

The application expects NDJSON files where each line contains:

{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, ...], "text": "Sample text", "category": "news", "subcategory": "politics", "tags": ["election"]}

Required fields: embedding (array), text (string) Optional fields: id, category, subcategory, tags

Callback Architecture

The refactored callback system is organized by functionality:

Data Processing (ui/callbacks/data_processing.py):

  • File upload handling
  • NDJSON parsing and validation
  • Data storage in dcc.Store components

Visualization (ui/callbacks/visualization.py):

  • Dimensionality reduction pipeline
  • Plot generation and updates
  • Method/parameter change handling

Interactions (ui/callbacks/interactions.py):

  • Point click handling and detail display
  • Reset functionality
  • User interaction management

Testing Architecture

The modular design enables comprehensive testing:

Unit Tests:

  • tests/test_data_processing.py - Parser and processor logic
  • tests/test_reducers.py - Dimensionality reduction algorithms
  • tests/test_visualization.py - Plot creation and color mapping

Integration Tests:

  • End-to-end data pipeline testing
  • Component integration verification

Key Testing Benefits:

  • Fast test execution (milliseconds vs seconds)
  • Isolated component testing
  • Easy mocking and fixture creation
  • High code coverage achievable

Dependencies

Uses modern Python stack with uv for dependency management:

  • Core Framework: Dash + Plotly for web interface and visualization
  • Algorithms: scikit-learn (PCA), openTSNE, umap-learn for dimensionality reduction
  • Data: pandas/numpy for data manipulation
  • UI: dash-bootstrap-components for styling
  • Testing: pytest for test framework
  • Dev Tools: uv for package management

Development Guidelines

When adding new features:

  1. Data Models - Add/update schemas in models/schemas.py
  2. Algorithms - Extend models/reducers.py using the abstract base class
  3. UI Components - Create reusable components in ui/components/
  4. Configuration - Add settings to config/settings.py
  5. Tests - Write tests for all new functionality

Code Organization Principles:

  • Single responsibility principle
  • Clear module boundaries
  • Testable, isolated components
  • Configuration over hardcoding
  • Error handling at appropriate layers

Testing Requirements:

  • Unit tests for all core logic
  • Integration tests for data flow
  • Component tests for UI elements
  • Maintain high test coverage