godber/EmbeddingBuddy

Fork 0

Files

Austin Godber dfcfe4fd7c

Test Suite / lint (push) Successful in 30s

Details

Security Scan / security (push) Successful in 36s

Details

Security Scan / dependency-check (push) Successful in 29s

Details

Test Suite / test (3.11) (push) Successful in 1m37s

Details

Test Suite / build (push) Successful in 36s

Details

update release process and README

2025-10-01 07:38:56 -07:00

8.1 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

EmbeddingBuddy is a modular Python Dash web application for interactive exploration and visualization of embedding vectors through dimensionality reduction techniques (PCA, t-SNE, UMAP). The app provides a drag-and-drop interface for uploading NDJSON files containing embeddings and visualizes them in 2D/3D plots. The codebase follows a clean, modular architecture that prioritizes testability and maintainability.

Development Commands

Install dependencies:

uv sync

Run the application:

Development mode (with auto-reload):

uv run run_dev.py

Production mode (with Gunicorn WSGI server):

# First install production dependencies
uv sync --extra prod

# Then run in production mode
uv run run_prod.py

Legacy mode (basic Dash server):

uv run main.py

The app will be available at http://127.0.0.1:8050

Run tests:

uv sync --extra test
uv run pytest tests/ -v

Development tools:

# Install all dev dependencies
uv sync --extra dev

# Linting and formatting
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Type checking
uv run mypy src/embeddingbuddy/

# Security scanning
uv run bandit -r src/
uv run safety check

Test with sample data: Use the included sample_data.ndjson and sample_prompts.ndjson files for testing the application functionality.

Architecture

Project Structure

The application follows a modular architecture with clear separation of concerns:

src/embeddingbuddy/
├── app.py              # Main application entry point and factory
├── main.py             # Application runner
├── config/
│   └── settings.py     # Centralized configuration management
├── data/
│   ├── parser.py       # NDJSON parsing logic
│   └── processor.py    # Data transformation and processing
├── models/
│   ├── schemas.py      # Data models and validation schemas
│   └── reducers.py     # Dimensionality reduction algorithms
├── visualization/
│   ├── plots.py        # Plot creation and factory classes
│   └── colors.py       # Color mapping and management
├── ui/
│   ├── layout.py       # Main application layout
│   ├── components/     # Reusable UI components
│   │   ├── sidebar.py  # Sidebar component
│   │   └── upload.py   # Upload components
│   └── callbacks/      # Organized callback functions
│       ├── data_processing.py  # Data upload/processing callbacks
│       ├── visualization.py    # Plot update callbacks
│       └── interactions.py     # User interaction callbacks
└── utils/              # Utility functions and helpers

Key Components

Data Layer:

data/parser.py - NDJSON parsing with error handling
data/processor.py - Data transformation and combination logic
models/schemas.py - Dataclasses for type safety and validation

Algorithm Layer:

models/reducers.py - Modular dimensionality reduction with factory pattern
Supports PCA, t-SNE (openTSNE), and UMAP algorithms
Abstract base class for easy extension

Visualization Layer:

visualization/plots.py - Plot factory with single and dual plot support
visualization/colors.py - Color mapping and grayscale conversion utilities
Plotly-based 2D/3D scatter plots with interactive features

UI Layer:

ui/layout.py - Main application layout composition
ui/components/ - Reusable, testable UI components
ui/callbacks/ - Organized callbacks grouped by functionality
Bootstrap-styled sidebar with controls and large visualization area

Configuration:

config/settings.py - Centralized settings with environment variable support
Plot styling, marker configurations, and app-wide constants

Data Format

The application expects NDJSON files where each line contains:

{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, ...], "text": "Sample text", "category": "news", "subcategory": "politics", "tags": ["election"]}

Required fields: embedding (array), text (string) Optional fields: id, category, subcategory, tags

Callback Architecture

The refactored callback system is organized by functionality:

Data Processing (ui/callbacks/data_processing.py):

File upload handling
NDJSON parsing and validation
Data storage in dcc.Store components

Visualization (ui/callbacks/visualization.py):

Dimensionality reduction pipeline
Plot generation and updates
Method/parameter change handling

Interactions (ui/callbacks/interactions.py):

Point click handling and detail display
Reset functionality
User interaction management

Testing Architecture

The modular design enables comprehensive testing:

Unit Tests:

tests/test_data_processing.py - Parser and processor logic
tests/test_reducers.py - Dimensionality reduction algorithms
tests/test_visualization.py - Plot creation and color mapping

Integration Tests:

End-to-end data pipeline testing
Component integration verification

Key Testing Benefits:

Fast test execution (milliseconds vs seconds)
Isolated component testing
Easy mocking and fixture creation
High code coverage achievable

Dependencies

Uses modern Python stack with uv for dependency management:

Core Framework: Dash + Plotly for web interface and visualization
Algorithms: scikit-learn (PCA), openTSNE, umap-learn for dimensionality reduction
Data: pandas/numpy for data manipulation
UI: dash-bootstrap-components for styling
Testing: pytest for test framework
Dev Tools: uv for package management

CI/CD and Release Management

Repository Setup

This project uses a dual-repository workflow:

Primary repository: Gitea instance at git.hawt.cloud (read-write)
Mirror repository: GitHub (read-only mirror)

Workflow Organization

Gitea Workflows (.gitea/workflows/):

bump-and-release.yml - Manual version bumping workflow
- Runs bump_version.py to update version in pyproject.toml
- Commits changes and creates git tag
- Pushes to Gitea (main branch + tag)
- Triggered manually via workflow_dispatch with choice of patch/minor/major bump
release.yml - Automated release creation
- Triggered when version tags are pushed
- Runs tests, builds packages
- Creates Gitea release with artifacts
test.yml - Test suite execution
security.yml - Security scanning

GitHub Workflows (.github/workflows/):

docker-release.yml - Builds and publishes Docker images
pypi-release.yml - Publishes packages to PyPI
These workflows are read-only (no git commits/pushes) and create artifacts only

Release Process

Run manual bump workflow on Gitea: Actions → Bump Version and Release
Select version bump type (patch/minor/major)
Workflow commits version change and pushes tag to Gitea
Tag push triggers release.yml on Gitea (creates release)
GitHub mirror receives tag and triggers artifact builds (Docker, PyPI)

Version Management

Use bump_version.py for version updates:

python bump_version.py patch    # 0.3.0 -> 0.3.1
python bump_version.py minor    # 0.3.0 -> 0.4.0
python bump_version.py major    # 0.3.0 -> 1.0.0

Development Guidelines

When adding new features:

Data Models - Add/update schemas in models/schemas.py
Algorithms - Extend models/reducers.py using the abstract base class
UI Components - Create reusable components in ui/components/
Configuration - Add settings to config/settings.py
Tests - Write tests for all new functionality

Code Organization Principles:

Single responsibility principle
Clear module boundaries
Testable, isolated components
Configuration over hardcoding
Error handling at appropriate layers

Testing Requirements:

Unit tests for all core logic
Integration tests for data flow
Component tests for UI elements
Maintain high test coverage

8.1 KiB Raw Permalink Blame History

CLAUDE.md

Project Overview

Development Commands

Architecture

Project Structure

Key Components

Data Format

Callback Architecture

Testing Architecture

Dependencies

CI/CD and Release Management

Repository Setup

Workflow Organization

Release Process

Version Management

Development Guidelines

8.1 KiB

Raw Permalink Blame History