Files
EmbeddingBuddy/CLAUDE.md
Austin Godber dfcfe4fd7c
All checks were successful
Test Suite / lint (push) Successful in 30s
Security Scan / security (push) Successful in 36s
Security Scan / dependency-check (push) Successful in 29s
Test Suite / test (3.11) (push) Successful in 1m37s
Test Suite / build (push) Successful in 36s
update release process and README
2025-10-01 07:38:56 -07:00

8.1 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

EmbeddingBuddy is a modular Python Dash web application for interactive exploration and visualization of embedding vectors through dimensionality reduction techniques (PCA, t-SNE, UMAP). The app provides a drag-and-drop interface for uploading NDJSON files containing embeddings and visualizes them in 2D/3D plots. The codebase follows a clean, modular architecture that prioritizes testability and maintainability.

Development Commands

Install dependencies:

uv sync

Run the application:

Development mode (with auto-reload):

uv run run_dev.py

Production mode (with Gunicorn WSGI server):

# First install production dependencies
uv sync --extra prod

# Then run in production mode
uv run run_prod.py

Legacy mode (basic Dash server):

uv run main.py

The app will be available at http://127.0.0.1:8050

Run tests:

uv sync --extra test
uv run pytest tests/ -v

Development tools:

# Install all dev dependencies
uv sync --extra dev

# Linting and formatting
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Type checking
uv run mypy src/embeddingbuddy/

# Security scanning
uv run bandit -r src/
uv run safety check

Test with sample data: Use the included sample_data.ndjson and sample_prompts.ndjson files for testing the application functionality.

Architecture

Project Structure

The application follows a modular architecture with clear separation of concerns:

src/embeddingbuddy/
├── app.py              # Main application entry point and factory
├── main.py             # Application runner
├── config/
│   └── settings.py     # Centralized configuration management
├── data/
│   ├── parser.py       # NDJSON parsing logic
│   └── processor.py    # Data transformation and processing
├── models/
│   ├── schemas.py      # Data models and validation schemas
│   └── reducers.py     # Dimensionality reduction algorithms
├── visualization/
│   ├── plots.py        # Plot creation and factory classes
│   └── colors.py       # Color mapping and management
├── ui/
│   ├── layout.py       # Main application layout
│   ├── components/     # Reusable UI components
│   │   ├── sidebar.py  # Sidebar component
│   │   └── upload.py   # Upload components
│   └── callbacks/      # Organized callback functions
│       ├── data_processing.py  # Data upload/processing callbacks
│       ├── visualization.py    # Plot update callbacks
│       └── interactions.py     # User interaction callbacks
└── utils/              # Utility functions and helpers

Key Components

Data Layer:

  • data/parser.py - NDJSON parsing with error handling
  • data/processor.py - Data transformation and combination logic
  • models/schemas.py - Dataclasses for type safety and validation

Algorithm Layer:

  • models/reducers.py - Modular dimensionality reduction with factory pattern
  • Supports PCA, t-SNE (openTSNE), and UMAP algorithms
  • Abstract base class for easy extension

Visualization Layer:

  • visualization/plots.py - Plot factory with single and dual plot support
  • visualization/colors.py - Color mapping and grayscale conversion utilities
  • Plotly-based 2D/3D scatter plots with interactive features

UI Layer:

  • ui/layout.py - Main application layout composition
  • ui/components/ - Reusable, testable UI components
  • ui/callbacks/ - Organized callbacks grouped by functionality
  • Bootstrap-styled sidebar with controls and large visualization area

Configuration:

  • config/settings.py - Centralized settings with environment variable support
  • Plot styling, marker configurations, and app-wide constants

Data Format

The application expects NDJSON files where each line contains:

{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, ...], "text": "Sample text", "category": "news", "subcategory": "politics", "tags": ["election"]}

Required fields: embedding (array), text (string) Optional fields: id, category, subcategory, tags

Callback Architecture

The refactored callback system is organized by functionality:

Data Processing (ui/callbacks/data_processing.py):

  • File upload handling
  • NDJSON parsing and validation
  • Data storage in dcc.Store components

Visualization (ui/callbacks/visualization.py):

  • Dimensionality reduction pipeline
  • Plot generation and updates
  • Method/parameter change handling

Interactions (ui/callbacks/interactions.py):

  • Point click handling and detail display
  • Reset functionality
  • User interaction management

Testing Architecture

The modular design enables comprehensive testing:

Unit Tests:

  • tests/test_data_processing.py - Parser and processor logic
  • tests/test_reducers.py - Dimensionality reduction algorithms
  • tests/test_visualization.py - Plot creation and color mapping

Integration Tests:

  • End-to-end data pipeline testing
  • Component integration verification

Key Testing Benefits:

  • Fast test execution (milliseconds vs seconds)
  • Isolated component testing
  • Easy mocking and fixture creation
  • High code coverage achievable

Dependencies

Uses modern Python stack with uv for dependency management:

  • Core Framework: Dash + Plotly for web interface and visualization
  • Algorithms: scikit-learn (PCA), openTSNE, umap-learn for dimensionality reduction
  • Data: pandas/numpy for data manipulation
  • UI: dash-bootstrap-components for styling
  • Testing: pytest for test framework
  • Dev Tools: uv for package management

CI/CD and Release Management

Repository Setup

This project uses a dual-repository workflow:

  • Primary repository: Gitea instance at git.hawt.cloud (read-write)
  • Mirror repository: GitHub (read-only mirror)

Workflow Organization

Gitea Workflows (.gitea/workflows/):

  • bump-and-release.yml - Manual version bumping workflow
    • Runs bump_version.py to update version in pyproject.toml
    • Commits changes and creates git tag
    • Pushes to Gitea (main branch + tag)
    • Triggered manually via workflow_dispatch with choice of patch/minor/major bump
  • release.yml - Automated release creation
    • Triggered when version tags are pushed
    • Runs tests, builds packages
    • Creates Gitea release with artifacts
  • test.yml - Test suite execution
  • security.yml - Security scanning

GitHub Workflows (.github/workflows/):

  • docker-release.yml - Builds and publishes Docker images
  • pypi-release.yml - Publishes packages to PyPI
  • These workflows are read-only (no git commits/pushes) and create artifacts only

Release Process

  1. Run manual bump workflow on Gitea: Actions → Bump Version and Release
  2. Select version bump type (patch/minor/major)
  3. Workflow commits version change and pushes tag to Gitea
  4. Tag push triggers release.yml on Gitea (creates release)
  5. GitHub mirror receives tag and triggers artifact builds (Docker, PyPI)

Version Management

Use bump_version.py for version updates:

python bump_version.py patch    # 0.3.0 -> 0.3.1
python bump_version.py minor    # 0.3.0 -> 0.4.0
python bump_version.py major    # 0.3.0 -> 1.0.0

Development Guidelines

When adding new features:

  1. Data Models - Add/update schemas in models/schemas.py
  2. Algorithms - Extend models/reducers.py using the abstract base class
  3. UI Components - Create reusable components in ui/components/
  4. Configuration - Add settings to config/settings.py
  5. Tests - Write tests for all new functionality

Code Organization Principles:

  • Single responsibility principle
  • Clear module boundaries
  • Testable, isolated components
  • Configuration over hardcoding
  • Error handling at appropriate layers

Testing Requirements:

  • Unit tests for all core logic
  • Integration tests for data flow
  • Component tests for UI elements
  • Maintain high test coverage