Files
EmbeddingBuddy/CLAUDE.md
Austin Godber d66a20ddda
Some checks failed
Security Scan / dependency-check (push) Successful in 43s
Security Scan / security (push) Successful in 47s
Test Suite / lint (push) Failing after 29s
Test Suite / test (3.11) (push) Successful in 1m28s
Test Suite / build (push) Has been skipped
rework server startup and cli
This changes the dockerfile as well.
2025-10-01 19:04:27 -07:00

271 lines
8.1 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with
code in this repository.
## Project Overview
EmbeddingBuddy is a modular Python Dash web application for interactive exploration and
visualization of embedding vectors through dimensionality reduction techniques
(PCA, t-SNE, UMAP). The app provides a drag-and-drop interface for uploading
NDJSON files containing embeddings and visualizes them in 2D/3D plots. The codebase
follows a clean, modular architecture that prioritizes testability and maintainability.
## Development Commands
**Install dependencies:**
```bash
uv sync
```
**Run the application:**
Using the CLI (recommended):
```bash
# Production mode (no debug, no auto-reload)
embeddingbuddy serve
# Development mode (debug + auto-reload on code changes)
embeddingbuddy serve --dev
# Debug logging only (no auto-reload)
embeddingbuddy serve --debug
# With custom host/port
embeddingbuddy serve --host 0.0.0.0 --port 8080
```
The app will be available at <http://127.0.0.1:8050> by default
**Run tests:**
```bash
uv sync --extra test
uv run pytest tests/ -v
```
**Development tools:**
```bash
# Install all dev dependencies
uv sync --extra dev
# Linting and formatting
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Type checking
uv run mypy src/embeddingbuddy/
# Security scanning
uv run bandit -r src/
uv run safety check
```
**Test with sample data:**
Use the included `sample_data.ndjson` and `sample_prompts.ndjson` files for testing the application functionality.
## Architecture
### Project Structure
The application follows a modular architecture with clear separation of concerns:
```text
src/embeddingbuddy/
├── app.py # Main application entry point and factory
├── main.py # Application runner
├── config/
│ └── settings.py # Centralized configuration management
├── data/
│ ├── parser.py # NDJSON parsing logic
│ └── processor.py # Data transformation and processing
├── models/
│ ├── schemas.py # Data models and validation schemas
│ └── reducers.py # Dimensionality reduction algorithms
├── visualization/
│ ├── plots.py # Plot creation and factory classes
│ └── colors.py # Color mapping and management
├── ui/
│ ├── layout.py # Main application layout
│ ├── components/ # Reusable UI components
│ │ ├── sidebar.py # Sidebar component
│ │ └── upload.py # Upload components
│ └── callbacks/ # Organized callback functions
│ ├── data_processing.py # Data upload/processing callbacks
│ ├── visualization.py # Plot update callbacks
│ └── interactions.py # User interaction callbacks
└── utils/ # Utility functions and helpers
```
### Key Components
**Data Layer:**
- `data/parser.py` - NDJSON parsing with error handling
- `data/processor.py` - Data transformation and combination logic
- `models/schemas.py` - Dataclasses for type safety and validation
**Algorithm Layer:**
- `models/reducers.py` - Modular dimensionality reduction with factory pattern
- Supports PCA, t-SNE (openTSNE), and UMAP algorithms
- Abstract base class for easy extension
**Visualization Layer:**
- `visualization/plots.py` - Plot factory with single and dual plot support
- `visualization/colors.py` - Color mapping and grayscale conversion utilities
- Plotly-based 2D/3D scatter plots with interactive features
**UI Layer:**
- `ui/layout.py` - Main application layout composition
- `ui/components/` - Reusable, testable UI components
- `ui/callbacks/` - Organized callbacks grouped by functionality
- Bootstrap-styled sidebar with controls and large visualization area
**Configuration:**
- `config/settings.py` - Centralized settings with environment variable support
- Plot styling, marker configurations, and app-wide constants
### Data Format
The application expects NDJSON files where each line contains:
```json
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, ...], "text": "Sample text", "category": "news", "subcategory": "politics", "tags": ["election"]}
```
Required fields: `embedding` (array), `text` (string)
Optional fields: `id`, `category`, `subcategory`, `tags`
### Callback Architecture
The refactored callback system is organized by functionality:
**Data Processing (`ui/callbacks/data_processing.py`):**
- File upload handling
- NDJSON parsing and validation
- Data storage in dcc.Store components
**Visualization (`ui/callbacks/visualization.py`):**
- Dimensionality reduction pipeline
- Plot generation and updates
- Method/parameter change handling
**Interactions (`ui/callbacks/interactions.py`):**
- Point click handling and detail display
- Reset functionality
- User interaction management
### Testing Architecture
The modular design enables comprehensive testing:
**Unit Tests:**
- `tests/test_data_processing.py` - Parser and processor logic
- `tests/test_reducers.py` - Dimensionality reduction algorithms
- `tests/test_visualization.py` - Plot creation and color mapping
**Integration Tests:**
- End-to-end data pipeline testing
- Component integration verification
**Key Testing Benefits:**
- Fast test execution (milliseconds vs seconds)
- Isolated component testing
- Easy mocking and fixture creation
- High code coverage achievable
## Dependencies
Uses modern Python stack with uv for dependency management:
- **Core Framework:** Dash + Plotly for web interface and visualization
- **Algorithms:** scikit-learn (PCA), openTSNE, umap-learn for dimensionality reduction
- **Data:** pandas/numpy for data manipulation
- **UI:** dash-bootstrap-components for styling
- **Testing:** pytest for test framework
- **Dev Tools:** uv for package management
## CI/CD and Release Management
### Repository Setup
This project uses a **dual-repository workflow**:
- **Primary repository:** Gitea instance at `git.hawt.cloud` (read-write)
- **Mirror repository:** GitHub (read-only mirror)
### Workflow Organization
**Gitea Workflows (`.gitea/workflows/`):**
- **`bump-and-release.yml`** - Manual version bumping workflow
- Runs `bump_version.py` to update version in `pyproject.toml`
- Commits changes and creates git tag
- Pushes to Gitea (main branch + tag)
- Triggered manually via workflow_dispatch with choice of patch/minor/major bump
- **`release.yml`** - Automated release creation
- Triggered when version tags are pushed
- Runs tests, builds packages
- Creates Gitea release with artifacts
- **`test.yml`** - Test suite execution
- **`security.yml`** - Security scanning
**GitHub Workflows (`.github/workflows/`):**
- **`docker-release.yml`** - Builds and publishes Docker images
- **`pypi-release.yml`** - Publishes packages to PyPI
- These workflows are read-only (no git commits/pushes) and create artifacts only
### Release Process
1. Run manual bump workflow on Gitea: **Actions → Bump Version and Release**
2. Select version bump type (patch/minor/major)
3. Workflow commits version change and pushes tag to Gitea
4. Tag push triggers `release.yml` on Gitea (creates release)
5. GitHub mirror receives tag and triggers artifact builds (Docker, PyPI)
### Version Management
Use `bump_version.py` for version updates:
```bash
python bump_version.py patch # 0.3.0 -> 0.3.1
python bump_version.py minor # 0.3.0 -> 0.4.0
python bump_version.py major # 0.3.0 -> 1.0.0
```
## Development Guidelines
**When adding new features:**
1. **Data Models** - Add/update schemas in `models/schemas.py`
2. **Algorithms** - Extend `models/reducers.py` using the abstract base class
3. **UI Components** - Create reusable components in `ui/components/`
4. **Configuration** - Add settings to `config/settings.py`
5. **Tests** - Write tests for all new functionality
**Code Organization Principles:**
- Single responsibility principle
- Clear module boundaries
- Testable, isolated components
- Configuration over hardcoding
- Error handling at appropriate layers
**Testing Requirements:**
- Unit tests for all core logic
- Integration tests for data flow
- Component tests for UI elements
- Maintain high test coverage