8 Commits

Author SHA1 Message Date
1b6845774b fix formatting and bump version to v0.3.0
All checks were successful
Security Scan / dependency-check (pull_request) Successful in 44s
Test Suite / lint (pull_request) Successful in 34s
Test Suite / build (pull_request) Successful in 38s
Security Scan / security (pull_request) Successful in 49s
Test Suite / test (3.11) (pull_request) Successful in 1m32s
2025-08-14 19:02:17 -07:00
09e3c86f0a opensearch load improvements
Some checks failed
Security Scan / dependency-check (pull_request) Successful in 44s
Test Suite / lint (pull_request) Failing after 32s
Security Scan / security (pull_request) Successful in 45s
Test Suite / test (3.11) (pull_request) Successful in 1m31s
Test Suite / build (pull_request) Has been skipped
2025-08-14 14:30:52 -07:00
9cf2f0e6fa this will load data from Opensearch.
it doesn't have prompts as well
2025-08-14 13:49:46 -07:00
a2adc8b958 Merge pull request 'fixed refactored code and validated inputs' (#2) from validate-inputs into main
Some checks failed
Security Scan / dependency-check (push) Successful in 34s
Security Scan / security (push) Successful in 40s
Test Suite / lint (push) Successful in 27s
Test Suite / test (3.11) (push) Successful in 1m30s
Release / test (push) Successful in 59s
Release / build-and-release (push) Failing after 36s
Test Suite / build (push) Successful in 46s
Fixed the refactored version, removed app.py, added error feedback on bad input files.

Reviewed-on: #2
2025-08-14 08:11:28 -07:00
4867614474 reformat
All checks were successful
Security Scan / dependency-check (pull_request) Successful in 35s
Security Scan / security (pull_request) Successful in 39s
Test Suite / lint (pull_request) Successful in 30s
Test Suite / test (3.11) (pull_request) Successful in 1m26s
Test Suite / build (pull_request) Successful in 37s
2025-08-14 08:07:50 -07:00
6a995635ac remove upload success alert
Some checks failed
Security Scan / security (pull_request) Successful in 40s
Test Suite / test (3.11) (pull_request) Successful in 1m25s
Test Suite / build (pull_request) Has been skipped
Security Scan / dependency-check (pull_request) Successful in 35s
Test Suite / lint (pull_request) Failing after 26s
2025-08-14 08:00:47 -07:00
7b81c20a26 fixed refactored code
Some checks failed
Security Scan / dependency-check (pull_request) Successful in 38s
Security Scan / security (pull_request) Successful in 41s
Test Suite / lint (pull_request) Failing after 28s
Test Suite / test (3.11) (pull_request) Successful in 1m27s
Test Suite / build (pull_request) Has been skipped
2025-08-14 07:55:40 -07:00
1ec7e2c38c add ci workflows (#1)
All checks were successful
Security Scan / security (push) Successful in 30s
Security Scan / dependency-check (push) Successful in 25s
Test Suite / test (3.11) (push) Successful in 1m16s
Test Suite / lint (push) Successful in 20s
Test Suite / build (push) Successful in 35s
Reviewed-on: #1
2025-08-13 21:03:42 -07:00
47 changed files with 4567 additions and 1059 deletions

View File

@@ -6,6 +6,7 @@
"Bash(uv add:*)" "Bash(uv add:*)"
], ],
"deny": [], "deny": [],
"ask": [] "ask": [],
"defaultMode": "acceptEdits"
} }
} }

View File

@@ -0,0 +1,92 @@
name: Release
on:
push:
tags:
- 'v*'
workflow_dispatch:
inputs:
version:
description: 'Release version (e.g., v1.0.0)'
required: true
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Install dependencies
run: uv sync --extra test
- name: Run full test suite
run: uv run pytest tests/ -v --cov=src/embeddingbuddy --cov-report=term-missing
build-and-release:
runs-on: ubuntu-latest
needs: test
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Install dependencies
run: uv sync
- name: Build package
run: uv build
- name: Create release notes
run: |
echo "# Release Notes" > release-notes.md
echo "" >> release-notes.md
echo "## What's New" >> release-notes.md
echo "" >> release-notes.md
echo "- Modular architecture with improved testability" >> release-notes.md
echo "- Comprehensive test suite" >> release-notes.md
echo "- Enhanced documentation" >> release-notes.md
echo "- Security scanning and dependency management" >> release-notes.md
echo "" >> release-notes.md
echo "## Installation" >> release-notes.md
echo "" >> release-notes.md
echo '```bash' >> release-notes.md
echo 'uv sync' >> release-notes.md
echo 'uv run python main.py' >> release-notes.md
echo '```' >> release-notes.md
- name: Create Release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITEA_TOKEN }}
with:
tag_name: ${{ github.ref_name || github.event.inputs.version }}
release_name: Release ${{ github.ref_name || github.event.inputs.version }}
body_path: release-notes.md
draft: false
prerelease: false
- name: Upload Release Assets
uses: actions/upload-release-asset@v1
env:
GITHUB_TOKEN: ${{ secrets.GITEA_TOKEN }}
with:
upload_url: ${{ steps.create_release.outputs.upload_url }}
asset_path: dist/
asset_name: embeddingbuddy-dist
asset_content_type: application/zip

View File

@@ -0,0 +1,70 @@
name: Security Scan
on:
push:
branches: ["main", "master", "develop"]
pull_request:
branches: ["main", "master"]
schedule:
# Run security scan weekly on Sundays at 2 AM UTC
- cron: '0 2 * * 0'
jobs:
security:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Install dependencies
run: uv sync --extra security
- name: Run bandit security linter
run: uv run bandit -r src/ -f json -o bandit-report.json
continue-on-error: true
- name: Run safety vulnerability check
run: uv run safety check --json --save-json safety-report.json
continue-on-error: true
- name: Upload security reports
uses: actions/upload-artifact@v3
with:
name: security-reports
path: |
bandit-report.json
safety-report.json
dependency-check:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Check for dependency vulnerabilities
run: |
uv sync --extra security
uv run pip-audit --format=json --output=pip-audit-report.json
continue-on-error: true
- name: Upload dependency audit report
uses: actions/upload-artifact@v3
with:
name: dependency-audit
path: pip-audit-report.json

104
.gitea/workflows/test.yml Normal file
View File

@@ -0,0 +1,104 @@
name: Test Suite
on:
push:
branches:
- "main"
- "develop"
pull_request:
branches:
- "main"
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}
- name: Install dependencies
run: uv sync --extra test
- name: Run tests with pytest
run: uv run pytest tests/ -v --tb=short
- name: Run tests with coverage
run: uv run pytest tests/ --cov=src/embeddingbuddy --cov-report=term-missing --cov-report=xml
- name: Upload coverage reports
uses: codecov/codecov-action@v4
if: matrix.python-version == '3.11'
with:
file: ./coverage.xml
fail_ci_if_error: false
lint:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Install dependencies
run: uv sync --extra lint
- name: Run ruff linter
run: uv run ruff check src/ tests/
- name: Run ruff formatter check
run: uv run ruff format --check src/ tests/
# TODO fix this it throws errors
# - name: Run mypy type checker
# run: uv run mypy src/embeddingbuddy/ --ignore-missing-imports
build:
runs-on: ubuntu-latest
needs: [test, lint]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Set up Python
run: uv python install 3.11
- name: Install dependencies
run: uv sync
- name: Build package
run: uv build
- name: Test installation
run: |
uv run python -c "from src.embeddingbuddy.app import create_app; app = create_app(); print('✅ Package builds and imports successfully')"
- name: Upload build artifacts
uses: actions/upload-artifact@v3
with:
name: dist-files
path: dist/

76
.gitignore vendored
View File

@@ -1,12 +1,84 @@
# Python-generated files # Python-generated files
__pycache__/ __pycache__/
*.py[oc] *.py[oc]
*.py[cod]
*$py.class
*.so
.Python
build/ build/
develop-eggs/
dist/ dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/ wheels/
*.egg-info share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
*.manifest
*.spec
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Virtual environments # Virtual environments
.env
.venv .venv
env/
venv/
ENV/
env.bak/
venv.bak/
# IDEs
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# Project specific
*.log
.mypy_cache/
.dmypy.json
dmypy.json
temp/ temp/
todo/ todo/
# Security reports
bandit-report.json
safety-report.json
pip-audit-report.json
# Temporary files
*.tmp

View File

@@ -30,9 +30,28 @@ The app will be available at http://127.0.0.1:8050
**Run tests:** **Run tests:**
```bash ```bash
uv sync --extra test
uv run pytest tests/ -v uv run pytest tests/ -v
``` ```
**Development tools:**
```bash
# Install all dev dependencies
uv sync --extra dev
# Linting and formatting
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Type checking
uv run mypy src/embeddingbuddy/
# Security scanning
uv run bandit -r src/
uv run safety check
```
**Test with sample data:** **Test with sample data:**
Use the included `sample_data.ndjson` and `sample_prompts.ndjson` files for testing the application functionality. Use the included `sample_data.ndjson` and `sample_prompts.ndjson` files for testing the application functionality.
@@ -42,7 +61,7 @@ Use the included `sample_data.ndjson` and `sample_prompts.ndjson` files for test
The application follows a modular architecture with clear separation of concerns: The application follows a modular architecture with clear separation of concerns:
``` ```text
src/embeddingbuddy/ src/embeddingbuddy/
├── app.py # Main application entry point and factory ├── app.py # Main application entry point and factory
├── main.py # Application runner ├── main.py # Application runner
@@ -72,27 +91,32 @@ src/embeddingbuddy/
### Key Components ### Key Components
**Data Layer:** **Data Layer:**
- `data/parser.py` - NDJSON parsing with error handling - `data/parser.py` - NDJSON parsing with error handling
- `data/processor.py` - Data transformation and combination logic - `data/processor.py` - Data transformation and combination logic
- `models/schemas.py` - Dataclasses for type safety and validation - `models/schemas.py` - Dataclasses for type safety and validation
**Algorithm Layer:** **Algorithm Layer:**
- `models/reducers.py` - Modular dimensionality reduction with factory pattern - `models/reducers.py` - Modular dimensionality reduction with factory pattern
- Supports PCA, t-SNE (openTSNE), and UMAP algorithms - Supports PCA, t-SNE (openTSNE), and UMAP algorithms
- Abstract base class for easy extension - Abstract base class for easy extension
**Visualization Layer:** **Visualization Layer:**
- `visualization/plots.py` - Plot factory with single and dual plot support - `visualization/plots.py` - Plot factory with single and dual plot support
- `visualization/colors.py` - Color mapping and grayscale conversion utilities - `visualization/colors.py` - Color mapping and grayscale conversion utilities
- Plotly-based 2D/3D scatter plots with interactive features - Plotly-based 2D/3D scatter plots with interactive features
**UI Layer:** **UI Layer:**
- `ui/layout.py` - Main application layout composition - `ui/layout.py` - Main application layout composition
- `ui/components/` - Reusable, testable UI components - `ui/components/` - Reusable, testable UI components
- `ui/callbacks/` - Organized callbacks grouped by functionality - `ui/callbacks/` - Organized callbacks grouped by functionality
- Bootstrap-styled sidebar with controls and large visualization area - Bootstrap-styled sidebar with controls and large visualization area
**Configuration:** **Configuration:**
- `config/settings.py` - Centralized settings with environment variable support - `config/settings.py` - Centralized settings with environment variable support
- Plot styling, marker configurations, and app-wide constants - Plot styling, marker configurations, and app-wide constants
@@ -112,16 +136,19 @@ Optional fields: `id`, `category`, `subcategory`, `tags`
The refactored callback system is organized by functionality: The refactored callback system is organized by functionality:
**Data Processing (`ui/callbacks/data_processing.py`):** **Data Processing (`ui/callbacks/data_processing.py`):**
- File upload handling - File upload handling
- NDJSON parsing and validation - NDJSON parsing and validation
- Data storage in dcc.Store components - Data storage in dcc.Store components
**Visualization (`ui/callbacks/visualization.py`):** **Visualization (`ui/callbacks/visualization.py`):**
- Dimensionality reduction pipeline - Dimensionality reduction pipeline
- Plot generation and updates - Plot generation and updates
- Method/parameter change handling - Method/parameter change handling
**Interactions (`ui/callbacks/interactions.py`):** **Interactions (`ui/callbacks/interactions.py`):**
- Point click handling and detail display - Point click handling and detail display
- Reset functionality - Reset functionality
- User interaction management - User interaction management
@@ -131,15 +158,18 @@ The refactored callback system is organized by functionality:
The modular design enables comprehensive testing: The modular design enables comprehensive testing:
**Unit Tests:** **Unit Tests:**
- `tests/test_data_processing.py` - Parser and processor logic - `tests/test_data_processing.py` - Parser and processor logic
- `tests/test_reducers.py` - Dimensionality reduction algorithms - `tests/test_reducers.py` - Dimensionality reduction algorithms
- `tests/test_visualization.py` - Plot creation and color mapping - `tests/test_visualization.py` - Plot creation and color mapping
**Integration Tests:** **Integration Tests:**
- End-to-end data pipeline testing - End-to-end data pipeline testing
- Component integration verification - Component integration verification
**Key Testing Benefits:** **Key Testing Benefits:**
- Fast test execution (milliseconds vs seconds) - Fast test execution (milliseconds vs seconds)
- Isolated component testing - Isolated component testing
- Easy mocking and fixture creation - Easy mocking and fixture creation
@@ -167,6 +197,7 @@ Uses modern Python stack with uv for dependency management:
5. **Tests** - Write tests for all new functionality 5. **Tests** - Write tests for all new functionality
**Code Organization Principles:** **Code Organization Principles:**
- Single responsibility principle - Single responsibility principle
- Clear module boundaries - Clear module boundaries
- Testable, isolated components - Testable, isolated components
@@ -174,7 +205,8 @@ Uses modern Python stack with uv for dependency management:
- Error handling at appropriate layers - Error handling at appropriate layers
**Testing Requirements:** **Testing Requirements:**
- Unit tests for all core logic - Unit tests for all core logic
- Integration tests for data flow - Integration tests for data flow
- Component tests for UI elements - Component tests for UI elements
- Maintain high test coverage - Maintain high test coverage

View File

@@ -90,7 +90,7 @@ uv run python main.py
The application follows a modular architecture for improved maintainability and testability: The application follows a modular architecture for improved maintainability and testability:
``` ```text
src/embeddingbuddy/ src/embeddingbuddy/
├── config/ # Configuration management ├── config/ # Configuration management
│ └── settings.py # Centralized app settings │ └── settings.py # Centralized app settings
@@ -115,8 +115,8 @@ src/embeddingbuddy/
Run the test suite to verify functionality: Run the test suite to verify functionality:
```bash ```bash
# Install pytest # Install test dependencies
uv add pytest uv sync --extra test
# Run all tests # Run all tests
uv run pytest tests/ -v uv run pytest tests/ -v
@@ -128,6 +128,31 @@ uv run pytest tests/test_data_processing.py -v
uv run pytest tests/ --cov=src/embeddingbuddy uv run pytest tests/ --cov=src/embeddingbuddy
``` ```
### Development Tools
Install development dependencies for linting, type checking, and security:
```bash
# Install all dev dependencies
uv sync --extra dev
# Or install specific groups
uv sync --extra test # Testing tools
uv sync --extra lint # Linting and formatting
uv sync --extra security # Security scanning tools
# Run linting
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Run type checking
uv run mypy src/embeddingbuddy/
# Run security scans
uv run bandit -r src/
uv run safety check
```
### Adding New Features ### Adding New Features
The modular architecture makes it easy to extend functionality: The modular architecture makes it easy to extend functionality:

515
app.py
View File

@@ -1,515 +0,0 @@
import json
import uuid
from io import StringIO
import base64
import dash
from dash import dcc, html, Input, Output, State, callback
import dash_bootstrap_components as dbc
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import umap
from openTSNE import TSNE
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])
def parse_ndjson(contents):
"""Parse NDJSON content and return list of documents."""
content_type, content_string = contents.split(',')
decoded = base64.b64decode(content_string)
text_content = decoded.decode('utf-8')
documents = []
for line in text_content.strip().split('\n'):
if line.strip():
doc = json.loads(line)
if 'id' not in doc:
doc['id'] = str(uuid.uuid4())
documents.append(doc)
return documents
def apply_dimensionality_reduction(embeddings, method='pca', n_components=3):
"""Apply dimensionality reduction to embeddings."""
if method == 'pca':
reducer = PCA(n_components=n_components)
reduced = reducer.fit_transform(embeddings)
variance_explained = reducer.explained_variance_ratio_
return reduced, variance_explained
elif method == 'tsne':
reducer = TSNE(n_components=n_components, random_state=42)
reduced = reducer.fit(embeddings)
return reduced, None
elif method == 'umap':
reducer = umap.UMAP(n_components=n_components, random_state=42)
reduced = reducer.fit_transform(embeddings)
return reduced, None
else:
raise ValueError(f"Unknown method: {method}")
def create_color_mapping(documents, color_by):
"""Create color mapping for documents based on specified field."""
if color_by == 'category':
values = [doc.get('category', 'Unknown') for doc in documents]
elif color_by == 'subcategory':
values = [doc.get('subcategory', 'Unknown') for doc in documents]
elif color_by == 'tags':
values = [', '.join(doc.get('tags', [])) if doc.get('tags') else 'No tags' for doc in documents]
else:
values = ['All'] * len(documents)
return values
def create_plot(df, dimensions='3d', color_by='category', method='PCA'):
"""Create plotly scatter plot."""
color_values = create_color_mapping(df.to_dict('records'), color_by)
# Truncate text for hover display
df_display = df.copy()
df_display['text_preview'] = df_display['text'].apply(lambda x: x[:100] + "..." if len(x) > 100 else x)
# Include all metadata fields in hover
hover_fields = ['id', 'text_preview', 'category', 'subcategory']
# Add tags as a string for hover
df_display['tags_str'] = df_display['tags'].apply(lambda x: ', '.join(x) if x else 'None')
hover_fields.append('tags_str')
if dimensions == '3d':
fig = px.scatter_3d(
df_display, x='dim_1', y='dim_2', z='dim_3',
color=color_values,
hover_data=hover_fields,
title=f'3D Embedding Visualization - {method} (colored by {color_by})'
)
fig.update_traces(marker=dict(size=5))
else:
fig = px.scatter(
df_display, x='dim_1', y='dim_2',
color=color_values,
hover_data=hover_fields,
title=f'2D Embedding Visualization - {method} (colored by {color_by})'
)
fig.update_traces(marker=dict(size=8))
fig.update_layout(
height=None, # Let CSS height control this
autosize=True,
margin=dict(l=0, r=0, t=50, b=0)
)
return fig
def create_dual_plot(doc_df, prompt_df, dimensions='3d', color_by='category', method='PCA', show_prompts=None):
"""Create plotly scatter plot with separate traces for documents and prompts."""
# Create the base figure
fig = go.Figure()
# Helper function to convert colors to grayscale
def to_grayscale_hex(color_str):
"""Convert a color to grayscale while maintaining some distinction."""
import plotly.colors as pc
# Try to get RGB values from the color
try:
if color_str.startswith('#'):
# Hex color
rgb = tuple(int(color_str[i:i+2], 16) for i in (1, 3, 5))
else:
# Named color or other format - convert through plotly
rgb = pc.hex_to_rgb(pc.convert_colors_to_same_type([color_str], colortype='hex')[0][0])
# Convert to grayscale using luminance formula, but keep some color
gray_value = int(0.299 * rgb[0] + 0.587 * rgb[1] + 0.114 * rgb[2])
# Make it a bit more gray but not completely
gray_rgb = (gray_value * 0.7 + rgb[0] * 0.3,
gray_value * 0.7 + rgb[1] * 0.3,
gray_value * 0.7 + rgb[2] * 0.3)
return f'rgb({int(gray_rgb[0])},{int(gray_rgb[1])},{int(gray_rgb[2])})'
except:
return 'rgb(128,128,128)' # fallback gray
# Create document plot using plotly express for consistent colors
doc_color_values = create_color_mapping(doc_df.to_dict('records'), color_by)
doc_df_display = doc_df.copy()
doc_df_display['text_preview'] = doc_df_display['text'].apply(lambda x: x[:100] + "..." if len(x) > 100 else x)
doc_df_display['tags_str'] = doc_df_display['tags'].apply(lambda x: ', '.join(x) if x else 'None')
hover_fields = ['id', 'text_preview', 'category', 'subcategory', 'tags_str']
# Create documents plot to get the color mapping
if dimensions == '3d':
doc_fig = px.scatter_3d(
doc_df_display, x='dim_1', y='dim_2', z='dim_3',
color=doc_color_values,
hover_data=hover_fields
)
else:
doc_fig = px.scatter(
doc_df_display, x='dim_1', y='dim_2',
color=doc_color_values,
hover_data=hover_fields
)
# Add document traces to main figure
for trace in doc_fig.data:
trace.name = f'Documents - {trace.name}'
if dimensions == '3d':
trace.marker.size = 5
trace.marker.symbol = 'circle'
else:
trace.marker.size = 8
trace.marker.symbol = 'circle'
trace.marker.opacity = 1.0
fig.add_trace(trace)
# Add prompt traces if they exist
if prompt_df is not None and show_prompts and 'show' in show_prompts:
prompt_color_values = create_color_mapping(prompt_df.to_dict('records'), color_by)
prompt_df_display = prompt_df.copy()
prompt_df_display['text_preview'] = prompt_df_display['text'].apply(lambda x: x[:100] + "..." if len(x) > 100 else x)
prompt_df_display['tags_str'] = prompt_df_display['tags'].apply(lambda x: ', '.join(x) if x else 'None')
# Create prompts plot to get consistent color grouping
if dimensions == '3d':
prompt_fig = px.scatter_3d(
prompt_df_display, x='dim_1', y='dim_2', z='dim_3',
color=prompt_color_values,
hover_data=hover_fields
)
else:
prompt_fig = px.scatter(
prompt_df_display, x='dim_1', y='dim_2',
color=prompt_color_values,
hover_data=hover_fields
)
# Add prompt traces with grayed colors
for trace in prompt_fig.data:
# Convert the color to grayscale
original_color = trace.marker.color
if hasattr(trace.marker, 'color') and isinstance(trace.marker.color, str):
trace.marker.color = to_grayscale_hex(trace.marker.color)
trace.name = f'Prompts - {trace.name}'
if dimensions == '3d':
trace.marker.size = 6
trace.marker.symbol = 'diamond'
else:
trace.marker.size = 10
trace.marker.symbol = 'diamond'
trace.marker.opacity = 0.8
fig.add_trace(trace)
title = f'{dimensions.upper()} Embedding Visualization - {method} (colored by {color_by})'
fig.update_layout(
title=title,
height=None,
autosize=True,
margin=dict(l=0, r=0, t=50, b=0)
)
return fig
# Layout
app.layout = dbc.Container([
dbc.Row([
dbc.Col([
html.H1("EmbeddingBuddy", className="text-center mb-4"),
], width=12)
]),
dbc.Row([
# Left sidebar with controls
dbc.Col([
html.H5("Upload Data", className="mb-3"),
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin-bottom': '20px'
},
multiple=False
),
dcc.Upload(
id='upload-prompts',
children=html.Div([
'Drag and Drop Prompts or ',
html.A('Select Files')
]),
style={
'width': '100%',
'height': '60px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin-bottom': '20px',
'borderColor': '#28a745'
},
multiple=False
),
dbc.Button(
"Reset All Data",
id='reset-button',
color='danger',
outline=True,
size='sm',
className='mb-3',
style={'width': '100%'}
),
html.H5("Visualization Controls", className="mb-3"),
dbc.Label("Method:"),
dcc.Dropdown(
id='method-dropdown',
options=[
{'label': 'PCA', 'value': 'pca'},
{'label': 't-SNE', 'value': 'tsne'},
{'label': 'UMAP', 'value': 'umap'}
],
value='pca',
style={'margin-bottom': '15px'}
),
dbc.Label("Color by:"),
dcc.Dropdown(
id='color-dropdown',
options=[
{'label': 'Category', 'value': 'category'},
{'label': 'Subcategory', 'value': 'subcategory'},
{'label': 'Tags', 'value': 'tags'}
],
value='category',
style={'margin-bottom': '15px'}
),
dbc.Label("Dimensions:"),
dcc.RadioItems(
id='dimension-toggle',
options=[
{'label': '2D', 'value': '2d'},
{'label': '3D', 'value': '3d'}
],
value='3d',
style={'margin-bottom': '20px'}
),
dbc.Label("Show Prompts:"),
dcc.Checklist(
id='show-prompts-toggle',
options=[{'label': 'Show prompts on plot', 'value': 'show'}],
value=['show'],
style={'margin-bottom': '20px'}
),
html.H5("Point Details", className="mb-3"),
html.Div(id='point-details', children="Click on a point to see details")
], width=3, style={'padding-right': '20px'}),
# Main visualization area
dbc.Col([
dcc.Graph(
id='embedding-plot',
style={'height': '85vh', 'width': '100%'},
config={'responsive': True, 'displayModeBar': True}
)
], width=9)
]),
dcc.Store(id='processed-data'),
dcc.Store(id='processed-prompts')
], fluid=True)
@callback(
Output('processed-data', 'data'),
Input('upload-data', 'contents'),
State('upload-data', 'filename')
)
def process_uploaded_file(contents, filename):
if contents is None:
return None
try:
documents = parse_ndjson(contents)
embeddings = np.array([doc['embedding'] for doc in documents])
# Store original embeddings and documents
return {
'documents': documents,
'embeddings': embeddings.tolist()
}
except Exception as e:
return {'error': str(e)}
@callback(
Output('processed-prompts', 'data'),
Input('upload-prompts', 'contents'),
State('upload-prompts', 'filename')
)
def process_uploaded_prompts(contents, filename):
if contents is None:
return None
try:
prompts = parse_ndjson(contents)
embeddings = np.array([prompt['embedding'] for prompt in prompts])
# Store original embeddings and prompts
return {
'prompts': prompts,
'embeddings': embeddings.tolist()
}
except Exception as e:
return {'error': str(e)}
@callback(
Output('embedding-plot', 'figure'),
[Input('processed-data', 'data'),
Input('processed-prompts', 'data'),
Input('method-dropdown', 'value'),
Input('color-dropdown', 'value'),
Input('dimension-toggle', 'value'),
Input('show-prompts-toggle', 'value')]
)
def update_plot(data, prompts_data, method, color_by, dimensions, show_prompts):
if not data or 'error' in data:
return go.Figure().add_annotation(
text="Upload a valid NDJSON file to see visualization",
xref="paper", yref="paper",
x=0.5, y=0.5, xanchor='center', yanchor='middle',
showarrow=False, font=dict(size=16)
)
# Prepare embeddings for dimensionality reduction
doc_embeddings = np.array(data['embeddings'])
all_embeddings = doc_embeddings
has_prompts = prompts_data and 'error' not in prompts_data and prompts_data.get('prompts')
if has_prompts:
prompt_embeddings = np.array(prompts_data['embeddings'])
all_embeddings = np.vstack([doc_embeddings, prompt_embeddings])
n_components = 3 if dimensions == '3d' else 2
# Apply dimensionality reduction to combined data
reduced, variance_explained = apply_dimensionality_reduction(
all_embeddings, method=method, n_components=n_components
)
# Split reduced embeddings back
doc_reduced = reduced[:len(doc_embeddings)]
prompt_reduced = reduced[len(doc_embeddings):] if has_prompts else None
# Create dataframes
doc_df_data = []
for i, doc in enumerate(data['documents']):
row = {
'id': doc['id'],
'text': doc['text'],
'category': doc.get('category', 'Unknown'),
'subcategory': doc.get('subcategory', 'Unknown'),
'tags': doc.get('tags', []),
'dim_1': doc_reduced[i, 0],
'dim_2': doc_reduced[i, 1],
'type': 'document'
}
if dimensions == '3d':
row['dim_3'] = doc_reduced[i, 2]
doc_df_data.append(row)
doc_df = pd.DataFrame(doc_df_data)
prompt_df = None
if has_prompts and prompt_reduced is not None:
prompt_df_data = []
for i, prompt in enumerate(prompts_data['prompts']):
row = {
'id': prompt['id'],
'text': prompt['text'],
'category': prompt.get('category', 'Unknown'),
'subcategory': prompt.get('subcategory', 'Unknown'),
'tags': prompt.get('tags', []),
'dim_1': prompt_reduced[i, 0],
'dim_2': prompt_reduced[i, 1],
'type': 'prompt'
}
if dimensions == '3d':
row['dim_3'] = prompt_reduced[i, 2]
prompt_df_data.append(row)
prompt_df = pd.DataFrame(prompt_df_data)
return create_dual_plot(doc_df, prompt_df, dimensions, color_by, method.upper(), show_prompts)
@callback(
Output('point-details', 'children'),
Input('embedding-plot', 'clickData'),
[State('processed-data', 'data'),
State('processed-prompts', 'data')]
)
def display_click_data(clickData, data, prompts_data):
if not clickData or not data:
return "Click on a point to see details"
# Get point info from click
point_data = clickData['points'][0]
trace_name = point_data.get('fullData', {}).get('name', 'Documents')
if 'pointIndex' in point_data:
point_index = point_data['pointIndex']
elif 'pointNumber' in point_data:
point_index = point_data['pointNumber']
else:
return "Could not identify clicked point"
# Determine which dataset this point belongs to
if trace_name == 'Prompts' and prompts_data and 'prompts' in prompts_data:
item = prompts_data['prompts'][point_index]
item_type = 'Prompt'
else:
item = data['documents'][point_index]
item_type = 'Document'
return dbc.Card([
dbc.CardBody([
html.H5(f"{item_type}: {item['id']}", className="card-title"),
html.P(f"Text: {item['text']}", className="card-text"),
html.P(f"Category: {item.get('category', 'Unknown')}", className="card-text"),
html.P(f"Subcategory: {item.get('subcategory', 'Unknown')}", className="card-text"),
html.P(f"Tags: {', '.join(item.get('tags', [])) if item.get('tags') else 'None'}", className="card-text"),
html.P(f"Type: {item_type}", className="card-text text-muted")
])
])
@callback(
[Output('processed-data', 'data', allow_duplicate=True),
Output('processed-prompts', 'data', allow_duplicate=True),
Output('point-details', 'children', allow_duplicate=True)],
Input('reset-button', 'n_clicks'),
prevent_initial_call=True
)
def reset_data(n_clicks):
if n_clicks is None or n_clicks == 0:
return dash.no_update, dash.no_update, dash.no_update
return None, None, "Click on a point to see details"
if __name__ == '__main__':
app.run(debug=True)

View File

@@ -0,0 +1,157 @@
# Elasticsearch/OpenSearch Sample Data
This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.
## Files
### Original NDJSON Files
- `sample_data.ndjson` - Original sample documents in EmbeddingBuddy format
- `sample_prompts.ndjson` - Original sample prompts in EmbeddingBuddy format
### Elasticsearch Bulk Files
- `sample_data_es_bulk.ndjson` - Documents in ES bulk format (index: "embeddings")
- `sample_prompts_es_bulk.ndjson` - Prompts in ES bulk format (index: "prompts")
## Usage
### 1. Index the data using curl
```bash
# Index main documents
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_data_es_bulk.ndjson
# Index prompts
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_prompts_es_bulk.ndjson
```
### 2. Create proper mappings (recommended)
First create the index with proper dense_vector mapping:
```bash
# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
```
Then index the data using the bulk files above.
### 3. Test in EmbeddingBuddy
#### For "embeddings" index
- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings`
- **Field Mapping**:
- Embedding Field: `embedding`
- Text Field: `text`
- ID Field: `id`
- Category Field: `category`
- Subcategory Field: `subcategory`
- Tags Field: `tags`
#### For "embeddings-dense" index (alternative field names)
- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings-dense`
- **Field Mapping**:
- Embedding Field: `vector`
- Text Field: `content`
- ID Field: `doc_id`
- Category Field: `type`
- Subcategory Field: `subtopic`
- Tags Field: `keywords`
## Data Structure
### Original Format (from NDJSON files)
```json
{
"id": "doc_001",
"embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
"text": "Machine learning algorithms are transforming healthcare...",
"category": "technology",
"subcategory": "healthcare",
"tags": ["ai", "medicine", "prediction"]
}
```
### ES Bulk Format
```json
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
```
### Alternative Field Names (dense vector format)
```json
{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
```
## Notes
- All embedding vectors are 8-dimensional for these sample files
- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
- The `dense_vector` field type in Elasticsearch/OpenSearch enables vector similarity search

View File

@@ -0,0 +1,2 @@
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, 0.2], "text": "Binary junk at start"}
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2, 0.8], "text": "Normal line"}<7D><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>

View File

@@ -0,0 +1,6 @@
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, 0.2], "text": "First line"}
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2, 0.8], "text": "After empty line"}
{"id": "doc_003", "embedding": [0.3, 0.4, 0.1, -0.1], "text": "After multiple empty lines"}

View File

@@ -0,0 +1,4 @@
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, 0.2], "text": "4D embedding"}
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2], "text": "3D embedding"}
{"id": "doc_003", "embedding": [0.3, 0.4, 0.1, -0.1, 0.8], "text": "5D embedding"}
{"id": "doc_004", "embedding": [0.2, 0.1], "text": "2D embedding"}

View File

@@ -0,0 +1,8 @@
{"id": "doc_001", "embedding": "not_an_array", "text": "Embedding as string"}
{"id": "doc_002", "embedding": [0.1, "text", 0.7, 0.2], "text": "Mixed types in embedding"}
{"id": "doc_003", "embedding": [], "text": "Empty embedding array"}
{"id": "doc_004", "embedding": [0.1], "text": "Single dimension embedding"}
{"id": "doc_005", "embedding": null, "text": "Null embedding"}
{"id": "doc_006", "embedding": [0.1, 0.2, null, 0.4], "text": "Null value in embedding"}
{"id": "doc_007", "embedding": [0.1, 0.2, "NaN", 0.4], "text": "String NaN in embedding"}
{"id": "doc_008", "embedding": [0.1, 0.2, Infinity, 0.4], "text": "Infinity in embedding"}

View File

@@ -0,0 +1,5 @@
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, "text": "Valid line"}
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2, 0.8], "text": "Missing closing brace"
{"id": "doc_003" "embedding": [0.3, 0.4, 0.1, -0.1], "text": "Missing colon after id"}
{id: "doc_004", "embedding": [0.2, 0.1, 0.3, 0.4], "text": "Unquoted key"}
{"id": "doc_005", "embedding": [0.1, 0.2, 0.3, 0.4], "text": "Valid line again"}

View File

@@ -0,0 +1,3 @@
{"id": "doc_001", "text": "Sample text without embedding field", "category": "test"}
{"id": "doc_002", "text": "Another text without embedding", "category": "test"}
{"id": "doc_003", "text": "Third text missing embedding", "category": "test"}

View File

@@ -0,0 +1,3 @@
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, 0.2], "category": "test"}
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2, 0.8], "category": "test"}
{"id": "doc_003", "embedding": [0.3, 0.4, 0.1, -0.1], "category": "test"}

View File

@@ -0,0 +1,4 @@
[
{"id": "doc_001", "embedding": [0.1, -0.3, 0.7, 0.2], "text": "Regular JSON array"},
{"id": "doc_002", "embedding": [0.5, 0.1, -0.2, 0.8], "text": "Instead of NDJSON"}
]

View File

@@ -0,0 +1,40 @@
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3], "text": "Machine learning algorithms are transforming healthcare by enabling predictive analytics and personalized medicine.", "category": "technology", "subcategory": "healthcare", "tags": ["ai", "medicine", "prediction"]}
{"index": {"_index": "embeddings", "_id": "doc_002"}}
{"id": "doc_002", "embedding": [0.1, 0.4, -0.2, 0.6, 0.3, -0.4, 0.8, 0.2], "text": "Climate change poses significant challenges to global food security and agricultural sustainability.", "category": "environment", "subcategory": "agriculture", "tags": ["climate", "food", "sustainability"]}
{"index": {"_index": "embeddings", "_id": "doc_003"}}
{"id": "doc_003", "embedding": [-0.3, 0.7, 0.1, -0.2, 0.9, 0.4, -0.1, 0.5], "text": "The rise of electric vehicles is reshaping the automotive industry and urban transportation systems.", "category": "technology", "subcategory": "automotive", "tags": ["electric", "transport", "urban"]}
{"index": {"_index": "embeddings", "_id": "doc_004"}}
{"id": "doc_004", "embedding": [0.5, -0.6, 0.3, 0.8, -0.2, 0.1, 0.7, -0.4], "text": "Renewable energy sources like solar and wind are becoming increasingly cost-competitive with fossil fuels.", "category": "environment", "subcategory": "energy", "tags": ["renewable", "solar", "wind"]}
{"index": {"_index": "embeddings", "_id": "doc_005"}}
{"id": "doc_005", "embedding": [0.8, 0.2, -0.5, 0.1, 0.6, -0.3, 0.4, 0.9], "text": "Financial markets are experiencing volatility due to geopolitical tensions and inflation concerns.", "category": "finance", "subcategory": "markets", "tags": ["volatility", "inflation", "geopolitics"]}
{"index": {"_index": "embeddings", "_id": "doc_006"}}
{"id": "doc_006", "embedding": [-0.1, 0.5, 0.7, -0.4, 0.2, 0.8, -0.6, 0.3], "text": "Quantum computing research is advancing rapidly with potential applications in cryptography and drug discovery.", "category": "technology", "subcategory": "research", "tags": ["quantum", "cryptography", "research"]}
{"index": {"_index": "embeddings", "_id": "doc_007"}}
{"id": "doc_007", "embedding": [0.4, -0.3, 0.6, 0.7, -0.8, 0.2, 0.5, -0.1], "text": "Ocean pollution from plastic waste is threatening marine ecosystems and biodiversity worldwide.", "category": "environment", "subcategory": "marine", "tags": ["pollution", "plastic", "marine"]}
{"index": {"_index": "embeddings", "_id": "doc_008"}}
{"id": "doc_008", "embedding": [0.3, 0.8, -0.2, 0.5, 0.1, -0.7, 0.6, 0.4], "text": "Artificial intelligence is revolutionizing customer service through chatbots and automated support systems.", "category": "technology", "subcategory": "customer_service", "tags": ["ai", "chatbots", "automation"]}
{"index": {"_index": "embeddings", "_id": "doc_009"}}
{"id": "doc_009", "embedding": [-0.5, 0.3, 0.9, -0.1, 0.7, 0.4, -0.2, 0.8], "text": "Global supply chains are being redesigned for resilience after pandemic-related disruptions.", "category": "business", "subcategory": "logistics", "tags": ["supply_chain", "pandemic", "resilience"]}
{"index": {"_index": "embeddings", "_id": "doc_010"}}
{"id": "doc_010", "embedding": [0.7, -0.4, 0.2, 0.9, -0.3, 0.6, 0.1, -0.8], "text": "Space exploration missions are expanding our understanding of the solar system and potential for life.", "category": "science", "subcategory": "space", "tags": ["space", "exploration", "life"]}
{"index": {"_index": "embeddings", "_id": "doc_011"}}
{"id": "doc_011", "embedding": [-0.2, 0.6, 0.4, -0.7, 0.8, 0.3, -0.5, 0.1], "text": "Cryptocurrency adoption is growing among institutional investors despite regulatory uncertainties.", "category": "finance", "subcategory": "crypto", "tags": ["cryptocurrency", "institutional", "regulation"]}
{"index": {"_index": "embeddings", "_id": "doc_012"}}
{"id": "doc_012", "embedding": [0.6, 0.1, -0.8, 0.4, 0.5, -0.2, 0.9, -0.3], "text": "Remote work technologies are transforming traditional office environments and work-life balance.", "category": "technology", "subcategory": "workplace", "tags": ["remote", "work", "balance"]}
{"index": {"_index": "embeddings", "_id": "doc_013"}}
{"id": "doc_013", "embedding": [0.1, -0.7, 0.5, 0.8, -0.4, 0.3, 0.2, 0.6], "text": "Gene therapy breakthroughs are offering new hope for treating previously incurable genetic diseases.", "category": "science", "subcategory": "medicine", "tags": ["gene_therapy", "genetics", "medicine"]}
{"index": {"_index": "embeddings", "_id": "doc_014"}}
{"id": "doc_014", "embedding": [-0.4, 0.2, 0.7, -0.1, 0.9, -0.6, 0.3, 0.5], "text": "Urban planning is evolving to create more sustainable and livable cities for growing populations.", "category": "environment", "subcategory": "urban", "tags": ["urban_planning", "sustainability", "cities"]}
{"index": {"_index": "embeddings", "_id": "doc_015"}}
{"id": "doc_015", "embedding": [0.9, -0.1, 0.3, 0.6, -0.5, 0.8, -0.2, 0.4], "text": "Social media platforms are implementing new policies to combat misinformation and protect user privacy.", "category": "technology", "subcategory": "social_media", "tags": ["social_media", "misinformation", "privacy"]}
{"index": {"_index": "embeddings", "_id": "doc_016"}}
{"id": "doc_016", "embedding": [-0.3, 0.8, -0.1, 0.4, 0.7, -0.5, 0.6, -0.9], "text": "Educational technology is personalizing learning experiences and improving student outcomes.", "category": "education", "subcategory": "technology", "tags": ["education", "personalization", "technology"]}
{"index": {"_index": "embeddings", "_id": "doc_017"}}
{"id": "doc_017", "embedding": [0.5, 0.3, -0.6, 0.2, 0.8, 0.1, -0.4, 0.7], "text": "Biodiversity conservation efforts are critical for maintaining ecosystem balance and preventing species extinction.", "category": "environment", "subcategory": "conservation", "tags": ["biodiversity", "conservation", "extinction"]}
{"index": {"_index": "embeddings", "_id": "doc_018"}}
{"id": "doc_018", "embedding": [0.2, -0.8, 0.4, 0.7, -0.1, 0.5, 0.9, -0.3], "text": "Healthcare systems are adopting telemedicine to improve access and reduce costs for patients.", "category": "technology", "subcategory": "healthcare", "tags": ["telemedicine", "healthcare", "access"]}
{"index": {"_index": "embeddings", "_id": "doc_019"}}
{"id": "doc_019", "embedding": [-0.7, 0.4, 0.8, -0.2, 0.3, 0.6, -0.1, 0.9], "text": "Autonomous vehicles are being tested extensively with promises of safer and more efficient transportation.", "category": "technology", "subcategory": "automotive", "tags": ["autonomous", "safety", "efficiency"]}
{"index": {"_index": "embeddings", "_id": "doc_020"}}
{"id": "doc_020", "embedding": [0.4, 0.7, -0.3, 0.9, -0.6, 0.2, 0.5, -0.1], "text": "Mental health awareness is increasing with new approaches to therapy and workplace wellness programs.", "category": "health", "subcategory": "mental", "tags": ["mental_health", "therapy", "wellness"]}

View File

@@ -0,0 +1,20 @@
{"index": {"_index": "prompts", "_id": "prompt_001"}}
{"id": "prompt_001", "embedding": [0.15, -0.28, 0.65, 0.42, -0.11, 0.33, 0.78, -0.52], "text": "Find articles about machine learning applications", "category": "search", "subcategory": "technology", "tags": ["AI", "research"]}
{"index": {"_index": "prompts", "_id": "prompt_002"}}
{"id": "prompt_002", "embedding": [0.72, 0.18, -0.35, 0.51, 0.09, -0.44, 0.27, 0.63], "text": "Show me product reviews for smartphones", "category": "search", "subcategory": "product", "tags": ["mobile", "reviews"]}
{"index": {"_index": "prompts", "_id": "prompt_003"}}
{"id": "prompt_003", "embedding": [-0.21, 0.59, 0.34, -0.67, 0.45, 0.12, -0.38, 0.76], "text": "What are the latest political developments?", "category": "search", "subcategory": "news", "tags": ["politics", "current events"]}
{"index": {"_index": "prompts", "_id": "prompt_004"}}
{"id": "prompt_004", "embedding": [0.48, -0.15, 0.72, 0.31, -0.58, 0.24, 0.67, -0.39], "text": "Summarize recent tech industry trends", "category": "analysis", "subcategory": "technology", "tags": ["tech", "trends", "summary"]}
{"index": {"_index": "prompts", "_id": "prompt_005"}}
{"id": "prompt_005", "embedding": [-0.33, 0.47, -0.62, 0.28, 0.71, -0.18, 0.54, 0.35], "text": "Compare different smartphone models", "category": "analysis", "subcategory": "product", "tags": ["comparison", "mobile", "evaluation"]}
{"index": {"_index": "prompts", "_id": "prompt_006"}}
{"id": "prompt_006", "embedding": [0.64, 0.21, 0.39, -0.45, 0.13, 0.58, -0.27, 0.74], "text": "Analyze voter sentiment on recent policies", "category": "analysis", "subcategory": "politics", "tags": ["sentiment", "politics", "analysis"]}
{"index": {"_index": "prompts", "_id": "prompt_007"}}
{"id": "prompt_007", "embedding": [0.29, -0.43, 0.56, 0.68, -0.22, 0.37, 0.14, -0.61], "text": "Generate a summary of machine learning research", "category": "generation", "subcategory": "technology", "tags": ["AI", "research", "summary"]}
{"index": {"_index": "prompts", "_id": "prompt_008"}}
{"id": "prompt_008", "embedding": [-0.17, 0.52, -0.48, 0.36, 0.74, -0.29, 0.61, 0.18], "text": "Create a product recommendation report", "category": "generation", "subcategory": "product", "tags": ["recommendation", "report", "analysis"]}
{"index": {"_index": "prompts", "_id": "prompt_009"}}
{"id": "prompt_009", "embedding": [0.55, 0.08, 0.41, -0.37, 0.26, 0.69, -0.14, 0.58], "text": "Write a news brief on election updates", "category": "generation", "subcategory": "news", "tags": ["election", "news", "brief"]}
{"index": {"_index": "prompts", "_id": "prompt_010"}}
{"id": "prompt_010", "embedding": [0.23, -0.59, 0.47, 0.61, -0.35, 0.18, 0.72, -0.26], "text": "Explain how neural networks work", "category": "explanation", "subcategory": "technology", "tags": ["AI", "education", "neural networks"]}

View File

@@ -1,6 +1,6 @@
[project] [project]
name = "embeddingbuddy" name = "embeddingbuddy"
version = "0.2.0" version = "0.3.0"
description = "A Python Dash application for interactive exploration and visualization of embedding vectors through dimensionality reduction techniques." description = "A Python Dash application for interactive exploration and visualization of embedding vectors through dimensionality reduction techniques."
readme = "README.md" readme = "README.md"
requires-python = ">=3.11" requires-python = ">=3.11"
@@ -14,7 +14,29 @@ dependencies = [
"umap-learn>=0.5.8", "umap-learn>=0.5.8",
"numba>=0.56.4", "numba>=0.56.4",
"openTSNE>=1.0.0", "openTSNE>=1.0.0",
"mypy>=1.17.1",
"opensearch-py>=3.0.0",
]
[project.optional-dependencies]
test = [
"pytest>=8.4.1", "pytest>=8.4.1",
"pytest-cov>=4.1.0",
]
lint = [
"ruff>=0.1.0",
"mypy>=1.5.0",
]
security = [
"bandit[toml]>=1.7.5",
"safety>=2.3.0",
"pip-audit>=2.6.0",
]
dev = [
"embeddingbuddy[test,lint,security]",
]
all = [
"embeddingbuddy[test,lint,security]",
] ]
[build-system] [build-system]

View File

@@ -1,3 +1,3 @@
"""EmbeddingBuddy - Interactive exploration and visualization of embedding vectors.""" """EmbeddingBuddy - Interactive exploration and visualization of embedding vectors."""
__version__ = "0.1.0" __version__ = "0.1.0"

View File

@@ -8,32 +8,32 @@ from .ui.callbacks.interactions import InteractionCallbacks
def create_app(): def create_app():
app = dash.Dash( app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])
__name__,
external_stylesheets=[dbc.themes.BOOTSTRAP] # Allow callbacks to components that are dynamically created in tabs
) app.config.suppress_callback_exceptions = True
layout_manager = AppLayout() layout_manager = AppLayout()
app.layout = layout_manager.create_layout() app.layout = layout_manager.create_layout()
DataProcessingCallbacks() DataProcessingCallbacks()
VisualizationCallbacks() VisualizationCallbacks()
InteractionCallbacks() InteractionCallbacks()
return app return app
def run_app(app=None, debug=None, host=None, port=None): def run_app(app=None, debug=None, host=None, port=None):
if app is None: if app is None:
app = create_app() app = create_app()
app.run( app.run(
debug=debug if debug is not None else AppSettings.DEBUG, debug=debug if debug is not None else AppSettings.DEBUG,
host=host if host is not None else AppSettings.HOST, host=host if host is not None else AppSettings.HOST,
port=port if port is not None else AppSettings.PORT port=port if port is not None else AppSettings.PORT,
) )
if __name__ == '__main__': if __name__ == "__main__":
app = create_app() app = create_app()
run_app(app) run_app(app)

View File

@@ -3,105 +3,106 @@ import os
class AppSettings: class AppSettings:
# UI Configuration # UI Configuration
UPLOAD_STYLE = { UPLOAD_STYLE = {
'width': '100%', "width": "100%",
'height': '60px', "height": "60px",
'lineHeight': '60px', "lineHeight": "60px",
'borderWidth': '1px', "borderWidth": "1px",
'borderStyle': 'dashed', "borderStyle": "dashed",
'borderRadius': '5px', "borderRadius": "5px",
'textAlign': 'center', "textAlign": "center",
'margin-bottom': '20px' "margin-bottom": "20px",
} }
PROMPTS_UPLOAD_STYLE = { PROMPTS_UPLOAD_STYLE = {**UPLOAD_STYLE, "borderColor": "#28a745"}
**UPLOAD_STYLE,
'borderColor': '#28a745' PLOT_CONFIG = {"responsive": True, "displayModeBar": True}
}
PLOT_STYLE = {"height": "85vh", "width": "100%"}
PLOT_CONFIG = {
'responsive': True,
'displayModeBar': True
}
PLOT_STYLE = {
'height': '85vh',
'width': '100%'
}
PLOT_LAYOUT_CONFIG = { PLOT_LAYOUT_CONFIG = {
'height': None, "height": None,
'autosize': True, "autosize": True,
'margin': dict(l=0, r=0, t=50, b=0) "margin": dict(l=0, r=0, t=50, b=0),
} }
# Dimensionality Reduction Settings # Dimensionality Reduction Settings
DEFAULT_N_COMPONENTS_3D = 3 DEFAULT_N_COMPONENTS_3D = 3
DEFAULT_N_COMPONENTS_2D = 2 DEFAULT_N_COMPONENTS_2D = 2
DEFAULT_RANDOM_STATE = 42 DEFAULT_RANDOM_STATE = 42
# Available Methods # Available Methods
REDUCTION_METHODS = [ REDUCTION_METHODS = [
{'label': 'PCA', 'value': 'pca'}, {"label": "PCA", "value": "pca"},
{'label': 't-SNE', 'value': 'tsne'}, {"label": "t-SNE", "value": "tsne"},
{'label': 'UMAP', 'value': 'umap'} {"label": "UMAP", "value": "umap"},
] ]
COLOR_OPTIONS = [ COLOR_OPTIONS = [
{'label': 'Category', 'value': 'category'}, {"label": "Category", "value": "category"},
{'label': 'Subcategory', 'value': 'subcategory'}, {"label": "Subcategory", "value": "subcategory"},
{'label': 'Tags', 'value': 'tags'} {"label": "Tags", "value": "tags"},
] ]
DIMENSION_OPTIONS = [ DIMENSION_OPTIONS = [{"label": "2D", "value": "2d"}, {"label": "3D", "value": "3d"}]
{'label': '2D', 'value': '2d'},
{'label': '3D', 'value': '3d'}
]
# Default Values # Default Values
DEFAULT_METHOD = 'pca' DEFAULT_METHOD = "pca"
DEFAULT_COLOR_BY = 'category' DEFAULT_COLOR_BY = "category"
DEFAULT_DIMENSIONS = '3d' DEFAULT_DIMENSIONS = "3d"
DEFAULT_SHOW_PROMPTS = ['show'] DEFAULT_SHOW_PROMPTS = ["show"]
# Plot Marker Settings # Plot Marker Settings
DOCUMENT_MARKER_SIZE_2D = 8 DOCUMENT_MARKER_SIZE_2D = 8
DOCUMENT_MARKER_SIZE_3D = 5 DOCUMENT_MARKER_SIZE_3D = 5
PROMPT_MARKER_SIZE_2D = 10 PROMPT_MARKER_SIZE_2D = 10
PROMPT_MARKER_SIZE_3D = 6 PROMPT_MARKER_SIZE_3D = 6
DOCUMENT_MARKER_SYMBOL = 'circle' DOCUMENT_MARKER_SYMBOL = "circle"
PROMPT_MARKER_SYMBOL = 'diamond' PROMPT_MARKER_SYMBOL = "diamond"
DOCUMENT_OPACITY = 1.0 DOCUMENT_OPACITY = 1.0
PROMPT_OPACITY = 0.8 PROMPT_OPACITY = 0.8
# Text Processing # Text Processing
TEXT_PREVIEW_LENGTH = 100 TEXT_PREVIEW_LENGTH = 100
# App Configuration # App Configuration
DEBUG = os.getenv('EMBEDDINGBUDDY_DEBUG', 'True').lower() == 'true' DEBUG = os.getenv("EMBEDDINGBUDDY_DEBUG", "True").lower() == "true"
HOST = os.getenv('EMBEDDINGBUDDY_HOST', '127.0.0.1') HOST = os.getenv("EMBEDDINGBUDDY_HOST", "127.0.0.1")
PORT = int(os.getenv('EMBEDDINGBUDDY_PORT', '8050')) PORT = int(os.getenv("EMBEDDINGBUDDY_PORT", "8050"))
# OpenSearch Configuration
OPENSEARCH_DEFAULT_SIZE = 100
OPENSEARCH_SAMPLE_SIZE = 5
OPENSEARCH_CONNECTION_TIMEOUT = 30
OPENSEARCH_VERIFY_CERTS = True
# Bootstrap Theme # Bootstrap Theme
EXTERNAL_STYLESHEETS = ['https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css'] EXTERNAL_STYLESHEETS = [
"https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css"
]
@classmethod @classmethod
def get_plot_marker_config(cls, dimensions: str, is_prompt: bool = False) -> Dict[str, Any]: def get_plot_marker_config(
cls, dimensions: str, is_prompt: bool = False
) -> Dict[str, Any]:
if is_prompt: if is_prompt:
size = cls.PROMPT_MARKER_SIZE_3D if dimensions == '3d' else cls.PROMPT_MARKER_SIZE_2D size = (
cls.PROMPT_MARKER_SIZE_3D
if dimensions == "3d"
else cls.PROMPT_MARKER_SIZE_2D
)
symbol = cls.PROMPT_MARKER_SYMBOL symbol = cls.PROMPT_MARKER_SYMBOL
opacity = cls.PROMPT_OPACITY opacity = cls.PROMPT_OPACITY
else: else:
size = cls.DOCUMENT_MARKER_SIZE_3D if dimensions == '3d' else cls.DOCUMENT_MARKER_SIZE_2D size = (
cls.DOCUMENT_MARKER_SIZE_3D
if dimensions == "3d"
else cls.DOCUMENT_MARKER_SIZE_2D
)
symbol = cls.DOCUMENT_MARKER_SYMBOL symbol = cls.DOCUMENT_MARKER_SYMBOL
opacity = cls.DOCUMENT_OPACITY opacity = cls.DOCUMENT_OPACITY
return { return {"size": size, "symbol": symbol, "opacity": opacity}
'size': size,
'symbol': symbol,
'opacity': opacity
}

View File

@@ -1,39 +1,72 @@
import json import json
import uuid import uuid
import base64 import base64
from typing import List, Union from typing import List
from ..models.schemas import Document, ProcessedData from ..models.schemas import Document
class NDJSONParser: class NDJSONParser:
@staticmethod @staticmethod
def parse_upload_contents(contents: str) -> List[Document]: def parse_upload_contents(contents: str) -> List[Document]:
content_type, content_string = contents.split(',') content_type, content_string = contents.split(",")
decoded = base64.b64decode(content_string) decoded = base64.b64decode(content_string)
text_content = decoded.decode('utf-8') text_content = decoded.decode("utf-8")
return NDJSONParser.parse_text(text_content) return NDJSONParser.parse_text(text_content)
@staticmethod @staticmethod
def parse_text(text_content: str) -> List[Document]: def parse_text(text_content: str) -> List[Document]:
documents = [] documents = []
for line in text_content.strip().split('\n'): for line_num, line in enumerate(text_content.strip().split("\n"), 1):
if line.strip(): if line.strip():
doc_dict = json.loads(line) try:
doc = NDJSONParser._dict_to_document(doc_dict) doc_dict = json.loads(line)
documents.append(doc) doc = NDJSONParser._dict_to_document(doc_dict)
documents.append(doc)
except json.JSONDecodeError as e:
raise json.JSONDecodeError(
f"Invalid JSON on line {line_num}: {e.msg}", e.doc, e.pos
)
except KeyError as e:
raise KeyError(f"Missing required field {e} on line {line_num}")
except (TypeError, ValueError) as e:
raise ValueError(
f"Invalid data format on line {line_num}: {str(e)}"
)
return documents return documents
@staticmethod @staticmethod
def _dict_to_document(doc_dict: dict) -> Document: def _dict_to_document(doc_dict: dict) -> Document:
if 'id' not in doc_dict: if "id" not in doc_dict:
doc_dict['id'] = str(uuid.uuid4()) doc_dict["id"] = str(uuid.uuid4())
# Validate required fields
if "text" not in doc_dict:
raise KeyError("'text'")
if "embedding" not in doc_dict:
raise KeyError("'embedding'")
# Validate embedding format
embedding = doc_dict["embedding"]
if not isinstance(embedding, list):
raise ValueError(
f"Embedding must be a list, got {type(embedding).__name__}"
)
if not embedding:
raise ValueError("Embedding cannot be empty")
# Check that all embedding values are numbers
for i, val in enumerate(embedding):
if not isinstance(val, (int, float)) or val != val: # NaN check
raise ValueError(
f"Embedding contains invalid value at index {i}: {val}"
)
return Document( return Document(
id=doc_dict['id'], id=doc_dict["id"],
text=doc_dict['text'], text=doc_dict["text"],
embedding=doc_dict['embedding'], embedding=embedding,
category=doc_dict.get('category'), category=doc_dict.get("category"),
subcategory=doc_dict.get('subcategory'), subcategory=doc_dict.get("subcategory"),
tags=doc_dict.get('tags') tags=doc_dict.get("tags"),
) )

View File

@@ -1,22 +1,24 @@
import numpy as np import numpy as np
from typing import List, Optional, Tuple from typing import List, Optional, Tuple
from ..models.schemas import Document, ProcessedData from ..models.schemas import Document, ProcessedData
from ..models.field_mapper import FieldMapper
from .parser import NDJSONParser from .parser import NDJSONParser
class DataProcessor: class DataProcessor:
def __init__(self): def __init__(self):
self.parser = NDJSONParser() self.parser = NDJSONParser()
def process_upload(self, contents: str, filename: Optional[str] = None) -> ProcessedData: def process_upload(
self, contents: str, filename: Optional[str] = None
) -> ProcessedData:
try: try:
documents = self.parser.parse_upload_contents(contents) documents = self.parser.parse_upload_contents(contents)
embeddings = self._extract_embeddings(documents) embeddings = self._extract_embeddings(documents)
return ProcessedData(documents=documents, embeddings=embeddings) return ProcessedData(documents=documents, embeddings=embeddings)
except Exception as e: except Exception as e:
return ProcessedData(documents=[], embeddings=np.array([]), error=str(e)) return ProcessedData(documents=[], embeddings=np.array([]), error=str(e))
def process_text(self, text_content: str) -> ProcessedData: def process_text(self, text_content: str) -> ProcessedData:
try: try:
documents = self.parser.parse_text(text_content) documents = self.parser.parse_text(text_content)
@@ -24,31 +26,71 @@ class DataProcessor:
return ProcessedData(documents=documents, embeddings=embeddings) return ProcessedData(documents=documents, embeddings=embeddings)
except Exception as e: except Exception as e:
return ProcessedData(documents=[], embeddings=np.array([]), error=str(e)) return ProcessedData(documents=[], embeddings=np.array([]), error=str(e))
def process_opensearch_data(
self, raw_documents: List[dict], field_mapping
) -> ProcessedData:
"""Process raw OpenSearch documents using field mapping."""
try:
# Transform documents using field mapping
transformed_docs = FieldMapper.transform_documents(
raw_documents, field_mapping
)
# Parse transformed documents
documents = []
for doc_dict in transformed_docs:
try:
# Ensure required fields are present with defaults if needed
if "id" not in doc_dict or not doc_dict["id"]:
doc_dict["id"] = f"doc_{len(documents)}"
doc = Document(**doc_dict)
documents.append(doc)
except Exception:
continue # Skip invalid documents
if not documents:
return ProcessedData(
documents=[],
embeddings=np.array([]),
error="No valid documents after transformation",
)
embeddings = self._extract_embeddings(documents)
return ProcessedData(documents=documents, embeddings=embeddings)
except Exception as e:
return ProcessedData(documents=[], embeddings=np.array([]), error=str(e))
def _extract_embeddings(self, documents: List[Document]) -> np.ndarray: def _extract_embeddings(self, documents: List[Document]) -> np.ndarray:
if not documents: if not documents:
return np.array([]) return np.array([])
return np.array([doc.embedding for doc in documents]) return np.array([doc.embedding for doc in documents])
def combine_data(self, doc_data: ProcessedData, prompt_data: Optional[ProcessedData] = None) -> Tuple[np.ndarray, List[Document], Optional[List[Document]]]: def combine_data(
self, doc_data: ProcessedData, prompt_data: Optional[ProcessedData] = None
) -> Tuple[np.ndarray, List[Document], Optional[List[Document]]]:
if not doc_data or doc_data.error: if not doc_data or doc_data.error:
raise ValueError("Invalid document data") raise ValueError("Invalid document data")
all_embeddings = doc_data.embeddings all_embeddings = doc_data.embeddings
documents = doc_data.documents documents = doc_data.documents
prompts = None prompts = None
if prompt_data and not prompt_data.error and prompt_data.documents: if prompt_data and not prompt_data.error and prompt_data.documents:
all_embeddings = np.vstack([doc_data.embeddings, prompt_data.embeddings]) all_embeddings = np.vstack([doc_data.embeddings, prompt_data.embeddings])
prompts = prompt_data.documents prompts = prompt_data.documents
return all_embeddings, documents, prompts return all_embeddings, documents, prompts
def split_reduced_data(self, reduced_embeddings: np.ndarray, n_documents: int, n_prompts: int = 0) -> Tuple[np.ndarray, Optional[np.ndarray]]: def split_reduced_data(
self, reduced_embeddings: np.ndarray, n_documents: int, n_prompts: int = 0
) -> Tuple[np.ndarray, Optional[np.ndarray]]:
doc_reduced = reduced_embeddings[:n_documents] doc_reduced = reduced_embeddings[:n_documents]
prompt_reduced = None prompt_reduced = None
if n_prompts > 0: if n_prompts > 0:
prompt_reduced = reduced_embeddings[n_documents:n_documents + n_prompts] prompt_reduced = reduced_embeddings[n_documents : n_documents + n_prompts]
return doc_reduced, prompt_reduced return doc_reduced, prompt_reduced

View File

@@ -0,0 +1,189 @@
from typing import Dict, List, Optional, Any, Tuple
import logging
from opensearchpy import OpenSearch
from opensearchpy.exceptions import OpenSearchException
logger = logging.getLogger(__name__)
class OpenSearchClient:
def __init__(self):
self.client: Optional[OpenSearch] = None
self.connection_info: Optional[Dict[str, Any]] = None
def connect(
self,
url: str,
username: Optional[str] = None,
password: Optional[str] = None,
api_key: Optional[str] = None,
verify_certs: bool = True,
) -> Tuple[bool, str]:
"""
Connect to OpenSearch instance.
Returns:
Tuple of (success: bool, message: str)
"""
try:
# Parse URL to extract host and port
if url.startswith("http://") or url.startswith("https://"):
host = url
else:
host = f"https://{url}"
# Build auth configuration
auth_config = {}
if username and password:
auth_config["http_auth"] = (username, password)
elif api_key:
auth_config["api_key"] = api_key
# Create client
self.client = OpenSearch([host], verify_certs=verify_certs, **auth_config)
# Test connection
info = self.client.info()
self.connection_info = {
"url": host,
"cluster_name": info.get("cluster_name", "Unknown"),
"version": info.get("version", {}).get("number", "Unknown"),
}
return (
True,
f"Connected to {info.get('cluster_name', 'OpenSearch cluster')}",
)
except OpenSearchException as e:
logger.error(f"OpenSearch connection error: {e}")
return False, f"Connection failed: {str(e)}"
except Exception as e:
logger.error(f"Unexpected error connecting to OpenSearch: {e}")
return False, f"Unexpected error: {str(e)}"
def get_index_mapping(self, index_name: str) -> Tuple[bool, Optional[Dict], str]:
"""
Get the mapping for a specific index.
Returns:
Tuple of (success: bool, mapping: Dict or None, message: str)
"""
if not self.client:
return False, None, "Not connected to OpenSearch"
try:
mapping = self.client.indices.get_mapping(index=index_name)
return True, mapping, "Mapping retrieved successfully"
except OpenSearchException as e:
logger.error(f"Error getting mapping for index {index_name}: {e}")
return False, None, f"Failed to get mapping: {str(e)}"
def analyze_fields(self, index_name: str) -> Tuple[bool, Optional[Dict], str]:
"""
Analyze index fields to detect potential embedding and text fields.
Returns:
Tuple of (success: bool, analysis: Dict or None, message: str)
"""
success, mapping, message = self.get_index_mapping(index_name)
if not success:
return False, None, message
try:
# Extract field information from mapping
index_mapping = mapping[index_name]["mappings"]["properties"]
analysis = {
"vector_fields": [],
"text_fields": [],
"keyword_fields": [],
"numeric_fields": [],
"all_fields": [],
}
for field_name, field_info in index_mapping.items():
field_type = field_info.get("type", "unknown")
analysis["all_fields"].append(field_name)
if field_type == "dense_vector":
analysis["vector_fields"].append(
{
"name": field_name,
"dimension": field_info.get("dimension", "unknown"),
}
)
elif field_type == "text":
analysis["text_fields"].append(field_name)
elif field_type == "keyword":
analysis["keyword_fields"].append(field_name)
elif field_type in ["integer", "long", "float", "double"]:
analysis["numeric_fields"].append(field_name)
return True, analysis, "Field analysis completed"
except Exception as e:
logger.error(f"Error analyzing fields: {e}")
return False, None, f"Field analysis failed: {str(e)}"
def fetch_sample_data(
self, index_name: str, size: int = 5
) -> Tuple[bool, List[Dict], str]:
"""
Fetch sample documents from the index.
Returns:
Tuple of (success: bool, documents: List[Dict], message: str)
"""
if not self.client:
return False, [], "Not connected to OpenSearch"
try:
response = self.client.search(
index=index_name, body={"query": {"match_all": {}}, "size": size}
)
documents = [hit["_source"] for hit in response["hits"]["hits"]]
return True, documents, f"Retrieved {len(documents)} sample documents"
except OpenSearchException as e:
logger.error(f"Error fetching sample data: {e}")
return False, [], f"Failed to fetch sample data: {str(e)}"
def fetch_data(
self, index_name: str, size: int = 100
) -> Tuple[bool, List[Dict], str]:
"""
Fetch documents from the index.
Returns:
Tuple of (success: bool, documents: List[Dict], message: str)
"""
if not self.client:
return False, [], "Not connected to OpenSearch"
try:
response = self.client.search(
index=index_name, body={"query": {"match_all": {}}, "size": size}
)
documents = [hit["_source"] for hit in response["hits"]["hits"]]
total_hits = response["hits"]["total"]["value"]
message = f"Retrieved {len(documents)} documents from {total_hits} total"
return True, documents, message
except OpenSearchException as e:
logger.error(f"Error fetching data: {e}")
return False, [], f"Failed to fetch data: {str(e)}"
def disconnect(self):
"""Disconnect from OpenSearch."""
if self.client:
self.client = None
self.connection_info = None
def is_connected(self) -> bool:
"""Check if connected to OpenSearch."""
return self.client is not None

View File

@@ -0,0 +1,254 @@
from dataclasses import dataclass
from typing import Dict, List, Optional, Any
import logging
logger = logging.getLogger(__name__)
@dataclass
class FieldMapping:
"""Configuration for mapping OpenSearch fields to standard format."""
embedding_field: str
text_field: str
id_field: Optional[str] = None
category_field: Optional[str] = None
subcategory_field: Optional[str] = None
tags_field: Optional[str] = None
class FieldMapper:
"""Handles field mapping and data transformation from OpenSearch to standard format."""
@staticmethod
def suggest_mappings(field_analysis: Dict) -> Dict[str, List[str]]:
"""
Suggest field mappings based on field analysis.
Each dropdown will show ALL available fields, but ordered by relevance
with the most likely candidates first.
Args:
field_analysis: Analysis results from OpenSearchClient.analyze_fields
Returns:
Dictionary with suggested fields for each mapping (ordered by relevance)
"""
all_fields = field_analysis.get("all_fields", [])
vector_fields = [vf["name"] for vf in field_analysis.get("vector_fields", [])]
text_fields = field_analysis.get("text_fields", [])
keyword_fields = field_analysis.get("keyword_fields", [])
# Helper function to create ordered suggestions
def create_ordered_suggestions(primary_candidates, all_available_fields):
# Start with primary candidates, then add all other fields
ordered = []
# Add primary candidates first
for field in primary_candidates:
if field in all_available_fields and field not in ordered:
ordered.append(field)
# Add remaining fields
for field in all_available_fields:
if field not in ordered:
ordered.append(field)
return ordered
suggestions = {}
# Embedding field suggestions (vector fields first, then name-based candidates, then all fields)
embedding_candidates = vector_fields.copy()
# Add fields that likely contain embeddings based on name
embedding_name_candidates = [
f
for f in all_fields
if any(
keyword in f.lower()
for keyword in ["embedding", "embeddings", "vector", "vectors", "embed"]
)
]
# Add name-based candidates that aren't already in vector_fields
for candidate in embedding_name_candidates:
if candidate not in embedding_candidates:
embedding_candidates.append(candidate)
suggestions["embedding"] = create_ordered_suggestions(
embedding_candidates, all_fields
)
# Text field suggestions (text fields first, then all fields)
text_candidates = text_fields.copy()
suggestions["text"] = create_ordered_suggestions(text_candidates, all_fields)
# ID field suggestions (ID-like fields first, then all fields)
id_candidates = [
f
for f in keyword_fields
if any(keyword in f.lower() for keyword in ["id", "_id", "doc", "document"])
]
id_candidates.append("_id") # _id is always available
suggestions["id"] = create_ordered_suggestions(id_candidates, all_fields)
# Category field suggestions (category-like fields first, then all fields)
category_candidates = [
f
for f in keyword_fields
if any(
keyword in f.lower()
for keyword in ["category", "class", "type", "label"]
)
]
suggestions["category"] = create_ordered_suggestions(
category_candidates, all_fields
)
# Subcategory field suggestions (subcategory-like fields first, then all fields)
subcategory_candidates = [
f
for f in keyword_fields
if any(
keyword in f.lower()
for keyword in ["subcategory", "subclass", "subtype", "subtopic"]
)
]
suggestions["subcategory"] = create_ordered_suggestions(
subcategory_candidates, all_fields
)
# Tags field suggestions (tag-like fields first, then all fields)
tags_candidates = [
f
for f in keyword_fields
if any(
keyword in f.lower()
for keyword in ["tag", "tags", "keyword", "keywords"]
)
]
suggestions["tags"] = create_ordered_suggestions(tags_candidates, all_fields)
return suggestions
@staticmethod
def validate_mapping(
mapping: FieldMapping, available_fields: List[str]
) -> List[str]:
"""
Validate that the field mapping is correct.
Returns:
List of validation errors (empty if valid)
"""
errors = []
# Required fields validation
if not mapping.embedding_field:
errors.append("Embedding field is required")
elif mapping.embedding_field not in available_fields:
errors.append(
f"Embedding field '{mapping.embedding_field}' not found in index"
)
if not mapping.text_field:
errors.append("Text field is required")
elif mapping.text_field not in available_fields:
errors.append(f"Text field '{mapping.text_field}' not found in index")
# Optional fields validation
optional_fields = {
"id_field": mapping.id_field,
"category_field": mapping.category_field,
"subcategory_field": mapping.subcategory_field,
"tags_field": mapping.tags_field,
}
for field_name, field_value in optional_fields.items():
if field_value and field_value not in available_fields:
errors.append(
f"Field '{field_value}' for {field_name} not found in index"
)
return errors
@staticmethod
def transform_documents(
documents: List[Dict[str, Any]], mapping: FieldMapping
) -> List[Dict[str, Any]]:
"""
Transform OpenSearch documents to standard format using field mapping.
Args:
documents: Raw documents from OpenSearch
mapping: Field mapping configuration
Returns:
List of transformed documents in standard format
"""
transformed = []
for doc in documents:
try:
# Build standard format document
standard_doc = {}
# Required fields
if mapping.embedding_field in doc:
standard_doc["embedding"] = doc[mapping.embedding_field]
else:
logger.warning(
f"Missing embedding field '{mapping.embedding_field}' in document"
)
continue
if mapping.text_field in doc:
standard_doc["text"] = str(doc[mapping.text_field])
else:
logger.warning(
f"Missing text field '{mapping.text_field}' in document"
)
continue
# Optional fields
if mapping.id_field and mapping.id_field in doc:
standard_doc["id"] = str(doc[mapping.id_field])
if mapping.category_field and mapping.category_field in doc:
standard_doc["category"] = str(doc[mapping.category_field])
if mapping.subcategory_field and mapping.subcategory_field in doc:
standard_doc["subcategory"] = str(doc[mapping.subcategory_field])
if mapping.tags_field and mapping.tags_field in doc:
tags = doc[mapping.tags_field]
# Handle both string and list tags
if isinstance(tags, list):
standard_doc["tags"] = [str(tag) for tag in tags]
else:
standard_doc["tags"] = [str(tags)]
transformed.append(standard_doc)
except Exception as e:
logger.error(f"Error transforming document: {e}")
continue
logger.info(f"Transformed {len(transformed)} documents out of {len(documents)}")
return transformed
@staticmethod
def create_mapping_from_dict(mapping_dict: Dict[str, str]) -> FieldMapping:
"""
Create a FieldMapping from a dictionary.
Args:
mapping_dict: Dictionary with field mappings
Returns:
FieldMapping instance
"""
return FieldMapping(
embedding_field=mapping_dict.get("embedding", ""),
text_field=mapping_dict.get("text", ""),
id_field=mapping_dict.get("id") or None,
category_field=mapping_dict.get("category") or None,
subcategory_field=mapping_dict.get("subcategory") or None,
tags_field=mapping_dict.get("tags") or None,
)

View File

@@ -1,6 +1,5 @@
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
import numpy as np import numpy as np
from typing import Optional, Tuple
from sklearn.decomposition import PCA from sklearn.decomposition import PCA
import umap import umap
from openTSNE import TSNE from openTSNE import TSNE
@@ -8,88 +7,89 @@ from .schemas import ReducedData
class DimensionalityReducer(ABC): class DimensionalityReducer(ABC):
def __init__(self, n_components: int = 3, random_state: int = 42): def __init__(self, n_components: int = 3, random_state: int = 42):
self.n_components = n_components self.n_components = n_components
self.random_state = random_state self.random_state = random_state
self._reducer = None self._reducer = None
@abstractmethod @abstractmethod
def fit_transform(self, embeddings: np.ndarray) -> ReducedData: def fit_transform(self, embeddings: np.ndarray) -> ReducedData:
pass pass
@abstractmethod @abstractmethod
def get_method_name(self) -> str: def get_method_name(self) -> str:
pass pass
class PCAReducer(DimensionalityReducer): class PCAReducer(DimensionalityReducer):
def fit_transform(self, embeddings: np.ndarray) -> ReducedData: def fit_transform(self, embeddings: np.ndarray) -> ReducedData:
self._reducer = PCA(n_components=self.n_components) self._reducer = PCA(n_components=self.n_components)
reduced = self._reducer.fit_transform(embeddings) reduced = self._reducer.fit_transform(embeddings)
variance_explained = self._reducer.explained_variance_ratio_ variance_explained = self._reducer.explained_variance_ratio_
return ReducedData( return ReducedData(
reduced_embeddings=reduced, reduced_embeddings=reduced,
variance_explained=variance_explained, variance_explained=variance_explained,
method=self.get_method_name(), method=self.get_method_name(),
n_components=self.n_components n_components=self.n_components,
) )
def get_method_name(self) -> str: def get_method_name(self) -> str:
return "PCA" return "PCA"
class TSNEReducer(DimensionalityReducer): class TSNEReducer(DimensionalityReducer):
def fit_transform(self, embeddings: np.ndarray) -> ReducedData: def fit_transform(self, embeddings: np.ndarray) -> ReducedData:
self._reducer = TSNE(n_components=self.n_components, random_state=self.random_state) self._reducer = TSNE(
n_components=self.n_components, random_state=self.random_state
)
reduced = self._reducer.fit(embeddings) reduced = self._reducer.fit(embeddings)
return ReducedData( return ReducedData(
reduced_embeddings=reduced, reduced_embeddings=reduced,
variance_explained=None, variance_explained=None,
method=self.get_method_name(), method=self.get_method_name(),
n_components=self.n_components n_components=self.n_components,
) )
def get_method_name(self) -> str: def get_method_name(self) -> str:
return "t-SNE" return "t-SNE"
class UMAPReducer(DimensionalityReducer): class UMAPReducer(DimensionalityReducer):
def fit_transform(self, embeddings: np.ndarray) -> ReducedData: def fit_transform(self, embeddings: np.ndarray) -> ReducedData:
self._reducer = umap.UMAP(n_components=self.n_components, random_state=self.random_state) self._reducer = umap.UMAP(
n_components=self.n_components, random_state=self.random_state
)
reduced = self._reducer.fit_transform(embeddings) reduced = self._reducer.fit_transform(embeddings)
return ReducedData( return ReducedData(
reduced_embeddings=reduced, reduced_embeddings=reduced,
variance_explained=None, variance_explained=None,
method=self.get_method_name(), method=self.get_method_name(),
n_components=self.n_components n_components=self.n_components,
) )
def get_method_name(self) -> str: def get_method_name(self) -> str:
return "UMAP" return "UMAP"
class ReducerFactory: class ReducerFactory:
@staticmethod @staticmethod
def create_reducer(method: str, n_components: int = 3, random_state: int = 42) -> DimensionalityReducer: def create_reducer(
method: str, n_components: int = 3, random_state: int = 42
) -> DimensionalityReducer:
method_lower = method.lower() method_lower = method.lower()
if method_lower == 'pca': if method_lower == "pca":
return PCAReducer(n_components=n_components, random_state=random_state) return PCAReducer(n_components=n_components, random_state=random_state)
elif method_lower == 'tsne': elif method_lower == "tsne":
return TSNEReducer(n_components=n_components, random_state=random_state) return TSNEReducer(n_components=n_components, random_state=random_state)
elif method_lower == 'umap': elif method_lower == "umap":
return UMAPReducer(n_components=n_components, random_state=random_state) return UMAPReducer(n_components=n_components, random_state=random_state)
else: else:
raise ValueError(f"Unknown reduction method: {method}") raise ValueError(f"Unknown reduction method: {method}")
@staticmethod @staticmethod
def get_available_methods() -> list: def get_available_methods() -> list:
return ['pca', 'tsne', 'umap'] return ["pca", "tsne", "umap"]

View File

@@ -1,4 +1,4 @@
from typing import List, Optional, Any, Dict from typing import List, Optional
from dataclasses import dataclass from dataclasses import dataclass
import numpy as np import numpy as np
@@ -50,9 +50,11 @@ class PlotData:
coordinates: np.ndarray coordinates: np.ndarray
prompts: Optional[List[Document]] = None prompts: Optional[List[Document]] = None
prompt_coordinates: Optional[np.ndarray] = None prompt_coordinates: Optional[np.ndarray] = None
def __post_init__(self): def __post_init__(self):
if not isinstance(self.coordinates, np.ndarray): if not isinstance(self.coordinates, np.ndarray):
self.coordinates = np.array(self.coordinates) self.coordinates = np.array(self.coordinates)
if self.prompt_coordinates is not None and not isinstance(self.prompt_coordinates, np.ndarray): if self.prompt_coordinates is not None and not isinstance(
self.prompt_coordinates = np.array(self.prompt_coordinates) self.prompt_coordinates, np.ndarray
):
self.prompt_coordinates = np.array(self.prompt_coordinates)

View File

@@ -1,61 +1,523 @@
import numpy as np from dash import callback, Input, Output, State, no_update
from dash import callback, Input, Output, State
from ...data.processor import DataProcessor from ...data.processor import DataProcessor
from ...data.sources.opensearch import OpenSearchClient
from ...models.field_mapper import FieldMapper
from ...config.settings import AppSettings
class DataProcessingCallbacks: class DataProcessingCallbacks:
def __init__(self): def __init__(self):
self.processor = DataProcessor() self.processor = DataProcessor()
self.opensearch_client_data = OpenSearchClient() # For data/documents
self.opensearch_client_prompts = OpenSearchClient() # For prompts
self._register_callbacks() self._register_callbacks()
def _register_callbacks(self): def _register_callbacks(self):
@callback( @callback(
Output('processed-data', 'data'), [
Input('upload-data', 'contents'), Output("processed-data", "data", allow_duplicate=True),
State('upload-data', 'filename') Output("upload-error-alert", "children", allow_duplicate=True),
Output("upload-error-alert", "is_open", allow_duplicate=True),
],
Input("upload-data", "contents"),
State("upload-data", "filename"),
prevent_initial_call=True,
) )
def process_uploaded_file(contents, filename): def process_uploaded_file(contents, filename):
if contents is None: if contents is None:
return None return None, "", False
processed_data = self.processor.process_upload(contents, filename) processed_data = self.processor.process_upload(contents, filename)
if processed_data.error: if processed_data.error:
return {'error': processed_data.error} error_message = self._format_error_message(
processed_data.error, filename
return { )
'documents': [self._document_to_dict(doc) for doc in processed_data.documents], return (
'embeddings': processed_data.embeddings.tolist() {"error": processed_data.error},
} error_message,
True, # Show error alert
)
return (
{
"documents": [
self._document_to_dict(doc) for doc in processed_data.documents
],
"embeddings": processed_data.embeddings.tolist(),
},
"",
False, # Hide error alert
)
@callback( @callback(
Output('processed-prompts', 'data'), Output("processed-prompts", "data", allow_duplicate=True),
Input('upload-prompts', 'contents'), Input("upload-prompts", "contents"),
State('upload-prompts', 'filename') State("upload-prompts", "filename"),
prevent_initial_call=True,
) )
def process_uploaded_prompts(contents, filename): def process_uploaded_prompts(contents, filename):
if contents is None: if contents is None:
return None return None
processed_data = self.processor.process_upload(contents, filename) processed_data = self.processor.process_upload(contents, filename)
if processed_data.error: if processed_data.error:
return {'error': processed_data.error} return {"error": processed_data.error}
return { return {
'prompts': [self._document_to_dict(doc) for doc in processed_data.documents], "prompts": [
'embeddings': processed_data.embeddings.tolist() self._document_to_dict(doc) for doc in processed_data.documents
],
"embeddings": processed_data.embeddings.tolist(),
} }
# OpenSearch callbacks
@callback(
[
Output("tab-content", "children"),
],
[Input("data-source-tabs", "active_tab")],
prevent_initial_call=False,
)
def render_tab_content(active_tab):
from ...ui.components.datasource import DataSourceComponent
datasource = DataSourceComponent()
if active_tab == "opensearch-tab":
return [datasource.create_opensearch_tab()]
else:
return [datasource.create_file_upload_tab()]
# Register callbacks for both data and prompts sections
self._register_opensearch_callbacks("data", self.opensearch_client_data)
self._register_opensearch_callbacks("prompts", self.opensearch_client_prompts)
# Register collapsible section callbacks
self._register_collapse_callbacks()
def _register_opensearch_callbacks(self, section_type, opensearch_client):
"""Register callbacks for a specific section (data or prompts)."""
@callback(
Output(f"{section_type}-auth-collapse", "is_open"),
[Input(f"{section_type}-auth-toggle", "n_clicks")],
[State(f"{section_type}-auth-collapse", "is_open")],
prevent_initial_call=True,
)
def toggle_auth(n_clicks, is_open):
if n_clicks:
return not is_open
return is_open
@callback(
Output(f"{section_type}-auth-toggle", "children"),
[Input(f"{section_type}-auth-collapse", "is_open")],
prevent_initial_call=False,
)
def update_auth_button_text(is_open):
return "Hide Authentication" if is_open else "Show Authentication"
@callback(
[
Output(f"{section_type}-connection-status", "children"),
Output(f"{section_type}-field-mapping-section", "children"),
Output(f"{section_type}-field-mapping-section", "style"),
Output(f"{section_type}-load-data-section", "style"),
Output(f"{section_type}-load-opensearch-data-btn", "disabled"),
Output(f"{section_type}-embedding-field-dropdown", "options"),
Output(f"{section_type}-text-field-dropdown", "options"),
Output(f"{section_type}-id-field-dropdown", "options"),
Output(f"{section_type}-category-field-dropdown", "options"),
Output(f"{section_type}-subcategory-field-dropdown", "options"),
Output(f"{section_type}-tags-field-dropdown", "options"),
],
[Input(f"{section_type}-test-connection-btn", "n_clicks")],
[
State(f"{section_type}-opensearch-url", "value"),
State(f"{section_type}-opensearch-index", "value"),
State(f"{section_type}-opensearch-username", "value"),
State(f"{section_type}-opensearch-password", "value"),
State(f"{section_type}-opensearch-api-key", "value"),
],
prevent_initial_call=True,
)
def test_opensearch_connection(
n_clicks, url, index_name, username, password, api_key
):
if not n_clicks or not url or not index_name:
return (
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
no_update,
)
# Test connection
success, message = opensearch_client.connect(
url=url,
username=username,
password=password,
api_key=api_key,
verify_certs=AppSettings.OPENSEARCH_VERIFY_CERTS,
)
if not success:
return (
self._create_status_alert(f"{message}", "danger"),
[],
{"display": "none"},
{"display": "none"},
True,
[], # empty options for hidden dropdowns
[],
[],
[],
[],
[],
)
# Analyze fields
success, field_analysis, analysis_message = (
opensearch_client.analyze_fields(index_name)
)
if not success:
return (
self._create_status_alert(f"{analysis_message}", "danger"),
[],
{"display": "none"},
{"display": "none"},
True,
[], # empty options for hidden dropdowns
[],
[],
[],
[],
[],
)
# Generate field suggestions
field_suggestions = FieldMapper.suggest_mappings(field_analysis)
from ...ui.components.datasource import DataSourceComponent
datasource = DataSourceComponent()
field_mapping_ui = datasource.create_field_mapping_interface(
field_suggestions, section_type
)
return (
self._create_status_alert(f"{message}", "success"),
field_mapping_ui,
{"display": "block"},
{"display": "block"},
False,
[
{"label": field, "value": field}
for field in field_suggestions.get("embedding", [])
],
[
{"label": field, "value": field}
for field in field_suggestions.get("text", [])
],
[
{"label": field, "value": field}
for field in field_suggestions.get("id", [])
],
[
{"label": field, "value": field}
for field in field_suggestions.get("category", [])
],
[
{"label": field, "value": field}
for field in field_suggestions.get("subcategory", [])
],
[
{"label": field, "value": field}
for field in field_suggestions.get("tags", [])
],
)
# Determine output target based on section type
output_target = (
"processed-data" if section_type == "data" else "processed-prompts"
)
@callback(
[
Output(output_target, "data", allow_duplicate=True),
Output("opensearch-success-alert", "children", allow_duplicate=True),
Output("opensearch-success-alert", "is_open", allow_duplicate=True),
Output("opensearch-error-alert", "children", allow_duplicate=True),
Output("opensearch-error-alert", "is_open", allow_duplicate=True),
],
[Input(f"{section_type}-load-opensearch-data-btn", "n_clicks")],
[
State(f"{section_type}-opensearch-index", "value"),
State(f"{section_type}-opensearch-query-size", "value"),
State(f"{section_type}-embedding-field-dropdown-ui", "value"),
State(f"{section_type}-text-field-dropdown-ui", "value"),
State(f"{section_type}-id-field-dropdown-ui", "value"),
State(f"{section_type}-category-field-dropdown-ui", "value"),
State(f"{section_type}-subcategory-field-dropdown-ui", "value"),
State(f"{section_type}-tags-field-dropdown-ui", "value"),
],
prevent_initial_call=True,
)
def load_opensearch_data(
n_clicks,
index_name,
query_size,
embedding_field,
text_field,
id_field,
category_field,
subcategory_field,
tags_field,
):
if not n_clicks or not index_name or not embedding_field or not text_field:
return no_update, no_update, no_update, no_update, no_update
try:
# Validate and set query size
if not query_size or query_size < 1:
query_size = AppSettings.OPENSEARCH_DEFAULT_SIZE
elif query_size > 1000:
query_size = 1000 # Cap at reasonable maximum
# Create field mapping
field_mapping = FieldMapper.create_mapping_from_dict(
{
"embedding": embedding_field,
"text": text_field,
"id": id_field,
"category": category_field,
"subcategory": subcategory_field,
"tags": tags_field,
}
)
# Fetch data from OpenSearch
success, raw_documents, message = opensearch_client.fetch_data(
index_name, size=query_size
)
if not success:
return (
no_update,
"",
False,
f"❌ Failed to fetch {section_type}: {message}",
True,
)
# Process the data
processed_data = self.processor.process_opensearch_data(
raw_documents, field_mapping
)
if processed_data.error:
return (
{"error": processed_data.error},
"",
False,
f"{section_type.title()} processing error: {processed_data.error}",
True,
)
success_message = f"✅ Successfully loaded {len(processed_data.documents)} {section_type} from OpenSearch"
# Format for appropriate target (data vs prompts)
if section_type == "data":
return (
{
"documents": [
self._document_to_dict(doc)
for doc in processed_data.documents
],
"embeddings": processed_data.embeddings.tolist(),
},
success_message,
True,
"",
False,
)
else: # prompts
return (
{
"prompts": [
self._document_to_dict(doc)
for doc in processed_data.documents
],
"embeddings": processed_data.embeddings.tolist(),
},
success_message,
True,
"",
False,
)
except Exception as e:
return (no_update, "", False, f"❌ Unexpected error: {str(e)}", True)
# Sync callbacks to update hidden dropdowns from UI dropdowns
@callback(
Output(f"{section_type}-embedding-field-dropdown", "value"),
Input(f"{section_type}-embedding-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_embedding_dropdown(value):
return value
@callback(
Output(f"{section_type}-text-field-dropdown", "value"),
Input(f"{section_type}-text-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_text_dropdown(value):
return value
@callback(
Output(f"{section_type}-id-field-dropdown", "value"),
Input(f"{section_type}-id-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_id_dropdown(value):
return value
@callback(
Output(f"{section_type}-category-field-dropdown", "value"),
Input(f"{section_type}-category-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_category_dropdown(value):
return value
@callback(
Output(f"{section_type}-subcategory-field-dropdown", "value"),
Input(f"{section_type}-subcategory-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_subcategory_dropdown(value):
return value
@callback(
Output(f"{section_type}-tags-field-dropdown", "value"),
Input(f"{section_type}-tags-field-dropdown-ui", "value"),
prevent_initial_call=True,
)
def sync_tags_dropdown(value):
return value
def _register_collapse_callbacks(self):
"""Register callbacks for collapsible sections."""
# Data section collapse callback
@callback(
[
Output("data-collapse", "is_open"),
Output("data-collapse-icon", "className"),
],
[Input("data-collapse-toggle", "n_clicks")],
[State("data-collapse", "is_open")],
prevent_initial_call=True,
)
def toggle_data_collapse(n_clicks, is_open):
if n_clicks:
new_state = not is_open
icon_class = (
"fas fa-chevron-down me-2"
if new_state
else "fas fa-chevron-right me-2"
)
return new_state, icon_class
return is_open, "fas fa-chevron-down me-2"
# Prompts section collapse callback
@callback(
[
Output("prompts-collapse", "is_open"),
Output("prompts-collapse-icon", "className"),
],
[Input("prompts-collapse-toggle", "n_clicks")],
[State("prompts-collapse", "is_open")],
prevent_initial_call=True,
)
def toggle_prompts_collapse(n_clicks, is_open):
if n_clicks:
new_state = not is_open
icon_class = (
"fas fa-chevron-down me-2"
if new_state
else "fas fa-chevron-right me-2"
)
return new_state, icon_class
return is_open, "fas fa-chevron-down me-2"
@staticmethod @staticmethod
def _document_to_dict(doc): def _document_to_dict(doc):
return { return {
'id': doc.id, "id": doc.id,
'text': doc.text, "text": doc.text,
'embedding': doc.embedding, "embedding": doc.embedding,
'category': doc.category, "category": doc.category,
'subcategory': doc.subcategory, "subcategory": doc.subcategory,
'tags': doc.tags "tags": doc.tags,
} }
@staticmethod
def _format_error_message(error: str, filename: str | None = None) -> str:
"""Format error message with helpful guidance for users."""
file_part = f" in file '{filename}'" if filename else ""
# Check for common error patterns and provide helpful messages
if "embedding" in error.lower() and (
"key" in error.lower() or "required field" in error.lower()
):
return (
f"❌ Missing 'embedding' field{file_part}. "
"Each line must contain an 'embedding' field with a list of numbers."
)
elif "text" in error.lower() and (
"key" in error.lower() or "required field" in error.lower()
):
return (
f"❌ Missing 'text' field{file_part}. "
"Each line must contain a 'text' field with the document content."
)
elif "json" in error.lower() and "decode" in error.lower():
return (
f"❌ Invalid JSON format{file_part}. "
"Please check that each line is valid JSON with proper syntax (quotes, braces, etc.)."
)
elif "unicode" in error.lower() or "decode" in error.lower():
return (
f"❌ File encoding issue{file_part}. "
"Please ensure the file is saved in UTF-8 format and contains no binary data."
)
elif "array" in error.lower() or "list" in error.lower():
return (
f"❌ Invalid embedding format{file_part}. "
"Embeddings must be arrays/lists of numbers, not strings or other types."
)
else:
return (
f"❌ Error processing file{file_part}: {error}. "
"Please check that your file is valid NDJSON with required 'text' and 'embedding' fields."
)
@staticmethod
def _create_status_alert(message: str, color: str):
"""Create a status alert component."""
import dash_bootstrap_components as dbc
return dbc.Alert(message, color=color, className="mb-2")

View File

@@ -4,63 +4,79 @@ import dash_bootstrap_components as dbc
class InteractionCallbacks: class InteractionCallbacks:
def __init__(self): def __init__(self):
self._register_callbacks() self._register_callbacks()
def _register_callbacks(self): def _register_callbacks(self):
@callback( @callback(
Output('point-details', 'children'), Output("point-details", "children"),
Input('embedding-plot', 'clickData'), Input("embedding-plot", "clickData"),
[State('processed-data', 'data'), [State("processed-data", "data"), State("processed-prompts", "data")],
State('processed-prompts', 'data')]
) )
def display_click_data(clickData, data, prompts_data): def display_click_data(clickData, data, prompts_data):
if not clickData or not data: if not clickData or not data:
return "Click on a point to see details" return "Click on a point to see details"
point_data = clickData['points'][0] point_data = clickData["points"][0]
trace_name = point_data.get('fullData', {}).get('name', 'Documents') trace_name = point_data.get("fullData", {}).get("name", "Documents")
if 'pointIndex' in point_data: if "pointIndex" in point_data:
point_index = point_data['pointIndex'] point_index = point_data["pointIndex"]
elif 'pointNumber' in point_data: elif "pointNumber" in point_data:
point_index = point_data['pointNumber'] point_index = point_data["pointNumber"]
else: else:
return "Could not identify clicked point" return "Could not identify clicked point"
if trace_name.startswith('Prompts') and prompts_data and 'prompts' in prompts_data: if (
item = prompts_data['prompts'][point_index] trace_name.startswith("Prompts")
item_type = 'Prompt' and prompts_data
and "prompts" in prompts_data
):
item = prompts_data["prompts"][point_index]
item_type = "Prompt"
else: else:
item = data['documents'][point_index] item = data["documents"][point_index]
item_type = 'Document' item_type = "Document"
return self._create_detail_card(item, item_type) return self._create_detail_card(item, item_type)
@callback( @callback(
[Output('processed-data', 'data', allow_duplicate=True), [
Output('processed-prompts', 'data', allow_duplicate=True), Output("processed-data", "data", allow_duplicate=True),
Output('point-details', 'children', allow_duplicate=True)], Output("processed-prompts", "data", allow_duplicate=True),
Input('reset-button', 'n_clicks'), Output("point-details", "children", allow_duplicate=True),
prevent_initial_call=True ],
Input("reset-button", "n_clicks"),
prevent_initial_call=True,
) )
def reset_data(n_clicks): def reset_data(n_clicks):
if n_clicks is None or n_clicks == 0: if n_clicks is None or n_clicks == 0:
return dash.no_update, dash.no_update, dash.no_update return dash.no_update, dash.no_update, dash.no_update
return None, None, "Click on a point to see details" return None, None, "Click on a point to see details"
@staticmethod @staticmethod
def _create_detail_card(item, item_type): def _create_detail_card(item, item_type):
return dbc.Card([ return dbc.Card(
dbc.CardBody([ [
html.H5(f"{item_type}: {item['id']}", className="card-title"), dbc.CardBody(
html.P(f"Text: {item['text']}", className="card-text"), [
html.P(f"Category: {item.get('category', 'Unknown')}", className="card-text"), html.H5(f"{item_type}: {item['id']}", className="card-title"),
html.P(f"Subcategory: {item.get('subcategory', 'Unknown')}", className="card-text"), html.P(f"Text: {item['text']}", className="card-text"),
html.P(f"Tags: {', '.join(item.get('tags', [])) if item.get('tags') else 'None'}", className="card-text"), html.P(
html.P(f"Type: {item_type}", className="card-text text-muted") f"Category: {item.get('category', 'Unknown')}",
]) className="card-text",
]) ),
html.P(
f"Subcategory: {item.get('subcategory', 'Unknown')}",
className="card-text",
),
html.P(
f"Tags: {', '.join(item.get('tags', [])) if item.get('tags') else 'None'}",
className="card-text",
),
html.P(f"Type: {item_type}", className="card-text text-muted"),
]
)
]
)

View File

@@ -7,81 +7,102 @@ from ...visualization.plots import PlotFactory
class VisualizationCallbacks: class VisualizationCallbacks:
def __init__(self): def __init__(self):
self.plot_factory = PlotFactory() self.plot_factory = PlotFactory()
self._register_callbacks() self._register_callbacks()
def _register_callbacks(self): def _register_callbacks(self):
@callback( @callback(
Output('embedding-plot', 'figure'), Output("embedding-plot", "figure"),
[Input('processed-data', 'data'), [
Input('processed-prompts', 'data'), Input("processed-data", "data"),
Input('method-dropdown', 'value'), Input("processed-prompts", "data"),
Input('color-dropdown', 'value'), Input("method-dropdown", "value"),
Input('dimension-toggle', 'value'), Input("color-dropdown", "value"),
Input('show-prompts-toggle', 'value')] Input("dimension-toggle", "value"),
Input("show-prompts-toggle", "value"),
],
) )
def update_plot(data, prompts_data, method, color_by, dimensions, show_prompts): def update_plot(data, prompts_data, method, color_by, dimensions, show_prompts):
if not data or 'error' in data: if not data or "error" in data:
return go.Figure().add_annotation( return go.Figure().add_annotation(
text="Upload a valid NDJSON file to see visualization", text="Upload a valid NDJSON file to see visualization",
xref="paper", yref="paper", xref="paper",
x=0.5, y=0.5, xanchor='center', yanchor='middle', yref="paper",
showarrow=False, font=dict(size=16) x=0.5,
y=0.5,
xanchor="center",
yanchor="middle",
showarrow=False,
font=dict(size=16),
) )
try: try:
doc_embeddings = np.array(data['embeddings']) doc_embeddings = np.array(data["embeddings"])
all_embeddings = doc_embeddings all_embeddings = doc_embeddings
has_prompts = prompts_data and 'error' not in prompts_data and prompts_data.get('prompts') has_prompts = (
prompts_data
and "error" not in prompts_data
and prompts_data.get("prompts")
)
if has_prompts: if has_prompts:
prompt_embeddings = np.array(prompts_data['embeddings']) prompt_embeddings = np.array(prompts_data["embeddings"])
all_embeddings = np.vstack([doc_embeddings, prompt_embeddings]) all_embeddings = np.vstack([doc_embeddings, prompt_embeddings])
n_components = 3 if dimensions == '3d' else 2 n_components = 3 if dimensions == "3d" else 2
reducer = ReducerFactory.create_reducer(method, n_components=n_components) reducer = ReducerFactory.create_reducer(
method, n_components=n_components
)
reduced_data = reducer.fit_transform(all_embeddings) reduced_data = reducer.fit_transform(all_embeddings)
doc_reduced = reduced_data.reduced_embeddings[:len(doc_embeddings)] doc_reduced = reduced_data.reduced_embeddings[: len(doc_embeddings)]
prompt_reduced = None prompt_reduced = None
if has_prompts: if has_prompts:
prompt_reduced = reduced_data.reduced_embeddings[len(doc_embeddings):] prompt_reduced = reduced_data.reduced_embeddings[
len(doc_embeddings) :
documents = [self._dict_to_document(doc) for doc in data['documents']] ]
documents = [self._dict_to_document(doc) for doc in data["documents"]]
prompts = None prompts = None
if has_prompts: if has_prompts:
prompts = [self._dict_to_document(prompt) for prompt in prompts_data['prompts']] prompts = [
self._dict_to_document(prompt)
for prompt in prompts_data["prompts"]
]
plot_data = PlotData( plot_data = PlotData(
documents=documents, documents=documents,
coordinates=doc_reduced, coordinates=doc_reduced,
prompts=prompts, prompts=prompts,
prompt_coordinates=prompt_reduced prompt_coordinates=prompt_reduced,
) )
return self.plot_factory.create_plot( return self.plot_factory.create_plot(
plot_data, dimensions, color_by, reduced_data.method, show_prompts plot_data, dimensions, color_by, reduced_data.method, show_prompts
) )
except Exception as e: except Exception as e:
return go.Figure().add_annotation( return go.Figure().add_annotation(
text=f"Error creating visualization: {str(e)}", text=f"Error creating visualization: {str(e)}",
xref="paper", yref="paper", xref="paper",
x=0.5, y=0.5, xanchor='center', yanchor='middle', yref="paper",
showarrow=False, font=dict(size=16) x=0.5,
y=0.5,
xanchor="center",
yanchor="middle",
showarrow=False,
font=dict(size=16),
) )
@staticmethod @staticmethod
def _dict_to_document(doc_dict): def _dict_to_document(doc_dict):
return Document( return Document(
id=doc_dict['id'], id=doc_dict["id"],
text=doc_dict['text'], text=doc_dict["text"],
embedding=doc_dict['embedding'], embedding=doc_dict["embedding"],
category=doc_dict.get('category'), category=doc_dict.get("category"),
subcategory=doc_dict.get('subcategory'), subcategory=doc_dict.get("subcategory"),
tags=doc_dict.get('tags', []) tags=doc_dict.get("tags", []),
) )

View File

@@ -0,0 +1,519 @@
from dash import dcc, html
import dash_bootstrap_components as dbc
from .upload import UploadComponent
class DataSourceComponent:
def __init__(self):
self.upload_component = UploadComponent()
def create_tabbed_interface(self):
"""Create tabbed interface for different data sources."""
return dbc.Card(
[
dbc.CardHeader(
[
dbc.Tabs(
[
dbc.Tab(label="File Upload", tab_id="file-tab"),
dbc.Tab(label="OpenSearch", tab_id="opensearch-tab"),
],
id="data-source-tabs",
active_tab="file-tab",
)
]
),
dbc.CardBody([html.Div(id="tab-content")]),
]
)
def create_file_upload_tab(self):
"""Create file upload tab content."""
return html.Div(
[
self.upload_component.create_error_alert(),
self.upload_component.create_data_upload(),
self.upload_component.create_prompts_upload(),
self.upload_component.create_reset_button(),
]
)
def create_opensearch_tab(self):
"""Create OpenSearch tab content with separate Data and Prompts sections."""
return html.Div(
[
# Data Section
dbc.Card(
[
dbc.CardHeader(
[
dbc.Button(
[
html.I(
className="fas fa-chevron-down me-2",
id="data-collapse-icon",
),
"📄 Documents/Data",
],
id="data-collapse-toggle",
color="link",
className="text-start p-0 w-100 text-decoration-none",
style={
"border": "none",
"font-size": "1.25rem",
"font-weight": "500",
},
),
]
),
dbc.Collapse(
[dbc.CardBody([self._create_opensearch_section("data")])],
id="data-collapse",
is_open=True,
),
],
className="mb-4",
),
# Prompts Section
dbc.Card(
[
dbc.CardHeader(
[
dbc.Button(
[
html.I(
className="fas fa-chevron-down me-2",
id="prompts-collapse-icon",
),
"💬 Prompts",
],
id="prompts-collapse-toggle",
color="link",
className="text-start p-0 w-100 text-decoration-none",
style={
"border": "none",
"font-size": "1.25rem",
"font-weight": "500",
},
),
]
),
dbc.Collapse(
[
dbc.CardBody(
[self._create_opensearch_section("prompts")]
)
],
id="prompts-collapse",
is_open=True,
),
],
className="mb-4",
),
# Hidden dropdowns to prevent callback errors (for both sections)
html.Div(
[
# Data dropdowns (hidden sync targets)
dcc.Dropdown(
id="data-embedding-field-dropdown",
style={"display": "none"},
),
dcc.Dropdown(
id="data-text-field-dropdown", style={"display": "none"}
),
dcc.Dropdown(
id="data-id-field-dropdown", style={"display": "none"}
),
dcc.Dropdown(
id="data-category-field-dropdown", style={"display": "none"}
),
dcc.Dropdown(
id="data-subcategory-field-dropdown",
style={"display": "none"},
),
dcc.Dropdown(
id="data-tags-field-dropdown", style={"display": "none"}
),
# Data UI dropdowns (hidden placeholders)
dcc.Dropdown(
id="data-embedding-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="data-text-field-dropdown-ui", style={"display": "none"}
),
dcc.Dropdown(
id="data-id-field-dropdown-ui", style={"display": "none"}
),
dcc.Dropdown(
id="data-category-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="data-subcategory-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="data-tags-field-dropdown-ui", style={"display": "none"}
),
# Prompts dropdowns (hidden sync targets)
dcc.Dropdown(
id="prompts-embedding-field-dropdown",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-text-field-dropdown", style={"display": "none"}
),
dcc.Dropdown(
id="prompts-id-field-dropdown", style={"display": "none"}
),
dcc.Dropdown(
id="prompts-category-field-dropdown",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-subcategory-field-dropdown",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-tags-field-dropdown", style={"display": "none"}
),
# Prompts UI dropdowns (hidden placeholders)
dcc.Dropdown(
id="prompts-embedding-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-text-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-id-field-dropdown-ui", style={"display": "none"}
),
dcc.Dropdown(
id="prompts-category-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-subcategory-field-dropdown-ui",
style={"display": "none"},
),
dcc.Dropdown(
id="prompts-tags-field-dropdown-ui",
style={"display": "none"},
),
],
style={"display": "none"},
),
]
)
def _create_opensearch_section(self, section_type):
"""Create a complete OpenSearch section for either 'data' or 'prompts'."""
section_id = section_type # 'data' or 'prompts'
return html.Div(
[
# Connection section
html.H6("Connection", className="mb-2"),
dbc.Row(
[
dbc.Col(
[
dbc.Label("OpenSearch URL:"),
dbc.Input(
id=f"{section_id}-opensearch-url",
type="text",
placeholder="https://opensearch.example.com:9200",
className="mb-2",
),
],
width=12,
),
]
),
dbc.Row(
[
dbc.Col(
[
dbc.Label("Index Name:"),
dbc.Input(
id=f"{section_id}-opensearch-index",
type="text",
placeholder="my-embeddings-index",
className="mb-2",
),
],
width=6,
),
dbc.Col(
[
dbc.Label("Query Size:"),
dbc.Input(
id=f"{section_id}-opensearch-query-size",
type="number",
value=100,
min=1,
max=1000,
placeholder="100",
className="mb-2",
),
],
width=6,
),
]
),
dbc.Row(
[
dbc.Col(
[
dbc.Button(
"Test Connection",
id=f"{section_id}-test-connection-btn",
color="primary",
className="mb-3",
),
],
width=12,
),
]
),
# Authentication section (collapsible)
dbc.Collapse(
[
html.Hr(),
html.H6("Authentication (Optional)", className="mb-2"),
dbc.Row(
[
dbc.Col(
[
dbc.Label("Username:"),
dbc.Input(
id=f"{section_id}-opensearch-username",
type="text",
className="mb-2",
),
],
width=6,
),
dbc.Col(
[
dbc.Label("Password:"),
dbc.Input(
id=f"{section_id}-opensearch-password",
type="password",
className="mb-2",
),
],
width=6,
),
]
),
dbc.Label("OR"),
dbc.Input(
id=f"{section_id}-opensearch-api-key",
type="text",
placeholder="API Key",
className="mb-2",
),
],
id=f"{section_id}-auth-collapse",
is_open=False,
),
dbc.Button(
"Show Authentication",
id=f"{section_id}-auth-toggle",
color="link",
size="sm",
className="p-0 mb-3",
),
# Connection status
html.Div(id=f"{section_id}-connection-status", className="mb-3"),
# Field mapping section (hidden initially)
html.Div(
id=f"{section_id}-field-mapping-section", style={"display": "none"}
),
# Load data button (hidden initially)
html.Div(
[
dbc.Button(
f"Load {section_type.title()}",
id=f"{section_id}-load-opensearch-data-btn",
color="success",
className="mb-2",
disabled=True,
),
],
id=f"{section_id}-load-data-section",
style={"display": "none"},
),
# OpenSearch status/results
html.Div(id=f"{section_id}-opensearch-status", className="mb-3"),
]
)
def create_field_mapping_interface(self, field_suggestions, section_type="data"):
"""Create field mapping interface based on detected fields."""
return html.Div(
[
html.Hr(),
html.H6("Field Mapping", className="mb-2"),
html.P(
"Map your OpenSearch fields to the required format:",
className="text-muted small",
),
# Required fields
dbc.Row(
[
dbc.Col(
[
dbc.Label(
"Embedding Field (required):", className="fw-bold"
),
dcc.Dropdown(
id=f"{section_type}-embedding-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get(
"embedding", []
)
],
value=field_suggestions.get("embedding", [None])[
0
], # Default to first suggestion
placeholder="Select embedding field...",
className="mb-2",
),
],
width=6,
),
dbc.Col(
[
dbc.Label(
"Text Field (required):", className="fw-bold"
),
dcc.Dropdown(
id=f"{section_type}-text-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get("text", [])
],
value=field_suggestions.get("text", [None])[
0
], # Default to first suggestion
placeholder="Select text field...",
className="mb-2",
),
],
width=6,
),
]
),
# Optional fields
html.H6("Optional Fields", className="mb-2 mt-3"),
dbc.Row(
[
dbc.Col(
[
dbc.Label("ID Field:"),
dcc.Dropdown(
id=f"{section_type}-id-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get("id", [])
],
value=field_suggestions.get("id", [None])[
0
], # Default to first suggestion
placeholder="Select ID field...",
className="mb-2",
),
],
width=6,
),
dbc.Col(
[
dbc.Label("Category Field:"),
dcc.Dropdown(
id=f"{section_type}-category-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get(
"category", []
)
],
value=field_suggestions.get("category", [None])[
0
], # Default to first suggestion
placeholder="Select category field...",
className="mb-2",
),
],
width=6,
),
]
),
dbc.Row(
[
dbc.Col(
[
dbc.Label("Subcategory Field:"),
dcc.Dropdown(
id=f"{section_type}-subcategory-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get(
"subcategory", []
)
],
value=field_suggestions.get("subcategory", [None])[
0
], # Default to first suggestion
placeholder="Select subcategory field...",
className="mb-2",
),
],
width=6,
),
dbc.Col(
[
dbc.Label("Tags Field:"),
dcc.Dropdown(
id=f"{section_type}-tags-field-dropdown-ui",
options=[
{"label": field, "value": field}
for field in field_suggestions.get("tags", [])
],
value=field_suggestions.get("tags", [None])[
0
], # Default to first suggestion
placeholder="Select tags field...",
className="mb-2",
),
],
width=6,
),
]
),
]
)
def create_error_alert(self):
"""Create error alert component for OpenSearch issues."""
return dbc.Alert(
id="opensearch-error-alert",
dismissable=True,
is_open=False,
color="danger",
className="mb-3",
)
def create_success_alert(self):
"""Create success alert component for OpenSearch operations."""
return dbc.Alert(
id="opensearch-success-alert",
dismissable=True,
is_open=False,
color="success",
className="mb-3",
)

View File

@@ -1,82 +1,88 @@
from dash import dcc, html from dash import dcc, html
import dash_bootstrap_components as dbc import dash_bootstrap_components as dbc
from .upload import UploadComponent from .upload import UploadComponent
from .datasource import DataSourceComponent
class SidebarComponent: class SidebarComponent:
def __init__(self): def __init__(self):
self.upload_component = UploadComponent() self.upload_component = UploadComponent()
self.datasource_component = DataSourceComponent()
def create_layout(self): def create_layout(self):
return dbc.Col([ return dbc.Col(
html.H5("Upload Data", className="mb-3"), [
self.upload_component.create_data_upload(), html.H5("Data Sources", className="mb-3"),
self.upload_component.create_prompts_upload(), self.datasource_component.create_error_alert(),
self.upload_component.create_reset_button(), self.datasource_component.create_success_alert(),
self.datasource_component.create_tabbed_interface(),
html.H5("Visualization Controls", className="mb-3"), html.H5("Visualization Controls", className="mb-3 mt-4"),
self._create_method_dropdown(), ]
self._create_color_dropdown(), + self._create_method_dropdown()
self._create_dimension_toggle(), + self._create_color_dropdown()
self._create_prompts_toggle(), + self._create_dimension_toggle()
+ self._create_prompts_toggle()
html.H5("Point Details", className="mb-3"), + [
html.Div(id='point-details', children="Click on a point to see details") html.H5("Point Details", className="mb-3"),
html.Div(
], width=3, style={'padding-right': '20px'}) id="point-details", children="Click on a point to see details"
),
],
width=3,
style={"padding-right": "20px"},
)
def _create_method_dropdown(self): def _create_method_dropdown(self):
return [ return [
dbc.Label("Method:"), dbc.Label("Method:"),
dcc.Dropdown( dcc.Dropdown(
id='method-dropdown', id="method-dropdown",
options=[ options=[
{'label': 'PCA', 'value': 'pca'}, {"label": "PCA", "value": "pca"},
{'label': 't-SNE', 'value': 'tsne'}, {"label": "t-SNE", "value": "tsne"},
{'label': 'UMAP', 'value': 'umap'} {"label": "UMAP", "value": "umap"},
], ],
value='pca', value="pca",
style={'margin-bottom': '15px'} style={"margin-bottom": "15px"},
) ),
] ]
def _create_color_dropdown(self): def _create_color_dropdown(self):
return [ return [
dbc.Label("Color by:"), dbc.Label("Color by:"),
dcc.Dropdown( dcc.Dropdown(
id='color-dropdown', id="color-dropdown",
options=[ options=[
{'label': 'Category', 'value': 'category'}, {"label": "Category", "value": "category"},
{'label': 'Subcategory', 'value': 'subcategory'}, {"label": "Subcategory", "value": "subcategory"},
{'label': 'Tags', 'value': 'tags'} {"label": "Tags", "value": "tags"},
], ],
value='category', value="category",
style={'margin-bottom': '15px'} style={"margin-bottom": "15px"},
) ),
] ]
def _create_dimension_toggle(self): def _create_dimension_toggle(self):
return [ return [
dbc.Label("Dimensions:"), dbc.Label("Dimensions:"),
dcc.RadioItems( dcc.RadioItems(
id='dimension-toggle', id="dimension-toggle",
options=[ options=[
{'label': '2D', 'value': '2d'}, {"label": "2D", "value": "2d"},
{'label': '3D', 'value': '3d'} {"label": "3D", "value": "3d"},
], ],
value='3d', value="3d",
style={'margin-bottom': '20px'} style={"margin-bottom": "20px"},
) ),
] ]
def _create_prompts_toggle(self): def _create_prompts_toggle(self):
return [ return [
dbc.Label("Show Prompts:"), dbc.Label("Show Prompts:"),
dcc.Checklist( dcc.Checklist(
id='show-prompts-toggle', id="show-prompts-toggle",
options=[{'label': 'Show prompts on plot', 'value': 'show'}], options=[{"label": "Show prompts on plot", "value": "show"}],
value=['show'], value=["show"],
style={'margin-bottom': '20px'} style={"margin-bottom": "20px"},
) ),
] ]

View File

@@ -3,58 +3,62 @@ import dash_bootstrap_components as dbc
class UploadComponent: class UploadComponent:
@staticmethod @staticmethod
def create_data_upload(): def create_data_upload():
return dcc.Upload( return dcc.Upload(
id='upload-data', id="upload-data",
children=html.Div([ children=html.Div(["Drag and Drop or ", html.A("Select Files")]),
'Drag and Drop or ',
html.A('Select Files')
]),
style={ style={
'width': '100%', "width": "100%",
'height': '60px', "height": "60px",
'lineHeight': '60px', "lineHeight": "60px",
'borderWidth': '1px', "borderWidth": "1px",
'borderStyle': 'dashed', "borderStyle": "dashed",
'borderRadius': '5px', "borderRadius": "5px",
'textAlign': 'center', "textAlign": "center",
'margin-bottom': '20px' "margin-bottom": "20px",
}, },
multiple=False multiple=False,
) )
@staticmethod @staticmethod
def create_prompts_upload(): def create_prompts_upload():
return dcc.Upload( return dcc.Upload(
id='upload-prompts', id="upload-prompts",
children=html.Div([ children=html.Div(["Drag and Drop Prompts or ", html.A("Select Files")]),
'Drag and Drop Prompts or ',
html.A('Select Files')
]),
style={ style={
'width': '100%', "width": "100%",
'height': '60px', "height": "60px",
'lineHeight': '60px', "lineHeight": "60px",
'borderWidth': '1px', "borderWidth": "1px",
'borderStyle': 'dashed', "borderStyle": "dashed",
'borderRadius': '5px', "borderRadius": "5px",
'textAlign': 'center', "textAlign": "center",
'margin-bottom': '20px', "margin-bottom": "20px",
'borderColor': '#28a745' "borderColor": "#28a745",
}, },
multiple=False multiple=False,
) )
@staticmethod @staticmethod
def create_reset_button(): def create_reset_button():
return dbc.Button( return dbc.Button(
"Reset All Data", "Reset All Data",
id='reset-button', id="reset-button",
color='danger', color="danger",
outline=True, outline=True,
size='sm', size="sm",
className='mb-3', className="mb-3",
style={'width': '100%'} style={"width": "100%"},
) )
@staticmethod
def create_error_alert():
"""Create error alert component for data upload issues."""
return dbc.Alert(
id="upload-error-alert",
dismissable=True,
is_open=False,
color="danger",
className="mb-3",
)

View File

@@ -4,41 +4,44 @@ from .components.sidebar import SidebarComponent
class AppLayout: class AppLayout:
def __init__(self): def __init__(self):
self.sidebar = SidebarComponent() self.sidebar = SidebarComponent()
def create_layout(self): def create_layout(self):
return dbc.Container([ return dbc.Container(
self._create_header(), [self._create_header(), self._create_main_content()]
self._create_main_content(), + self._create_stores(),
self._create_stores() fluid=True,
], fluid=True) )
def _create_header(self): def _create_header(self):
return dbc.Row([ return dbc.Row(
dbc.Col([ [
html.H1("EmbeddingBuddy", className="text-center mb-4"), dbc.Col(
], width=12) [
]) html.H1("EmbeddingBuddy", className="text-center mb-4"),
],
width=12,
)
]
)
def _create_main_content(self): def _create_main_content(self):
return dbc.Row([ return dbc.Row(
self.sidebar.create_layout(), [self.sidebar.create_layout(), self._create_visualization_area()]
self._create_visualization_area() )
])
def _create_visualization_area(self): def _create_visualization_area(self):
return dbc.Col([ return dbc.Col(
dcc.Graph( [
id='embedding-plot', dcc.Graph(
style={'height': '85vh', 'width': '100%'}, id="embedding-plot",
config={'responsive': True, 'displayModeBar': True} style={"height": "85vh", "width": "100%"},
) config={"responsive": True, "displayModeBar": True},
], width=9) )
],
width=9,
)
def _create_stores(self): def _create_stores(self):
return [ return [dcc.Store(id="processed-data"), dcc.Store(id="processed-prompts")]
dcc.Store(id='processed-data'),
dcc.Store(id='processed-prompts')
]

View File

@@ -1,33 +1,36 @@
from typing import List, Dict, Any from typing import List
import plotly.colors as pc import plotly.colors as pc
from ..models.schemas import Document from ..models.schemas import Document
class ColorMapper: class ColorMapper:
@staticmethod @staticmethod
def create_color_mapping(documents: List[Document], color_by: str) -> List[str]: def create_color_mapping(documents: List[Document], color_by: str) -> List[str]:
if color_by == 'category': if color_by == "category":
return [doc.category for doc in documents] return [doc.category for doc in documents]
elif color_by == 'subcategory': elif color_by == "subcategory":
return [doc.subcategory for doc in documents] return [doc.subcategory for doc in documents]
elif color_by == 'tags': elif color_by == "tags":
return [', '.join(doc.tags) if doc.tags else 'No tags' for doc in documents] return [", ".join(doc.tags) if doc.tags else "No tags" for doc in documents]
else: else:
return ['All'] * len(documents) return ["All"] * len(documents)
@staticmethod @staticmethod
def to_grayscale_hex(color_str: str) -> str: def to_grayscale_hex(color_str: str) -> str:
try: try:
if color_str.startswith('#'): if color_str.startswith("#"):
rgb = tuple(int(color_str[i:i+2], 16) for i in (1, 3, 5)) rgb = tuple(int(color_str[i : i + 2], 16) for i in (1, 3, 5))
else: else:
rgb = pc.hex_to_rgb(pc.convert_colors_to_same_type([color_str], colortype='hex')[0][0]) rgb = pc.hex_to_rgb(
pc.convert_colors_to_same_type([color_str], colortype="hex")[0][0]
)
gray_value = int(0.299 * rgb[0] + 0.587 * rgb[1] + 0.114 * rgb[2]) gray_value = int(0.299 * rgb[0] + 0.587 * rgb[1] + 0.114 * rgb[2])
gray_rgb = (gray_value * 0.7 + rgb[0] * 0.3, gray_rgb = (
gray_value * 0.7 + rgb[1] * 0.3, gray_value * 0.7 + rgb[0] * 0.3,
gray_value * 0.7 + rgb[2] * 0.3) gray_value * 0.7 + rgb[1] * 0.3,
return f'rgb({int(gray_rgb[0])},{int(gray_rgb[1])},{int(gray_rgb[2])})' gray_value * 0.7 + rgb[2] * 0.3,
except: )
return 'rgb(128,128,128)' return f"rgb({int(gray_rgb[0])},{int(gray_rgb[1])},{int(gray_rgb[2])})"
except: # noqa: E722
return "rgb(128,128,128)"

View File

@@ -7,139 +7,172 @@ from .colors import ColorMapper
class PlotFactory: class PlotFactory:
def __init__(self): def __init__(self):
self.color_mapper = ColorMapper() self.color_mapper = ColorMapper()
def create_plot(self, plot_data: PlotData, dimensions: str = '3d', def create_plot(
color_by: str = 'category', method: str = 'PCA', self,
show_prompts: Optional[List[str]] = None) -> go.Figure: plot_data: PlotData,
dimensions: str = "3d",
if plot_data.prompts and show_prompts and 'show' in show_prompts: color_by: str = "category",
method: str = "PCA",
show_prompts: Optional[List[str]] = None,
) -> go.Figure:
if plot_data.prompts and show_prompts and "show" in show_prompts:
return self._create_dual_plot(plot_data, dimensions, color_by, method) return self._create_dual_plot(plot_data, dimensions, color_by, method)
else: else:
return self._create_single_plot(plot_data, dimensions, color_by, method) return self._create_single_plot(plot_data, dimensions, color_by, method)
def _create_single_plot(self, plot_data: PlotData, dimensions: str, def _create_single_plot(
color_by: str, method: str) -> go.Figure: self, plot_data: PlotData, dimensions: str, color_by: str, method: str
df = self._prepare_dataframe(plot_data.documents, plot_data.coordinates, dimensions) ) -> go.Figure:
color_values = self.color_mapper.create_color_mapping(plot_data.documents, color_by) df = self._prepare_dataframe(
plot_data.documents, plot_data.coordinates, dimensions
hover_fields = ['id', 'text_preview', 'category', 'subcategory', 'tags_str'] )
color_values = self.color_mapper.create_color_mapping(
if dimensions == '3d': plot_data.documents, color_by
)
hover_fields = ["id", "text_preview", "category", "subcategory", "tags_str"]
if dimensions == "3d":
fig = px.scatter_3d( fig = px.scatter_3d(
df, x='dim_1', y='dim_2', z='dim_3', df,
x="dim_1",
y="dim_2",
z="dim_3",
color=color_values, color=color_values,
hover_data=hover_fields, hover_data=hover_fields,
title=f'3D Embedding Visualization - {method} (colored by {color_by})' title=f"3D Embedding Visualization - {method} (colored by {color_by})",
) )
fig.update_traces(marker=dict(size=5)) fig.update_traces(marker=dict(size=5))
else: else:
fig = px.scatter( fig = px.scatter(
df, x='dim_1', y='dim_2', df,
x="dim_1",
y="dim_2",
color=color_values, color=color_values,
hover_data=hover_fields, hover_data=hover_fields,
title=f'2D Embedding Visualization - {method} (colored by {color_by})' title=f"2D Embedding Visualization - {method} (colored by {color_by})",
) )
fig.update_traces(marker=dict(size=8)) fig.update_traces(marker=dict(size=8))
fig.update_layout( fig.update_layout(height=None, autosize=True, margin=dict(l=0, r=0, t=50, b=0))
height=None,
autosize=True,
margin=dict(l=0, r=0, t=50, b=0)
)
return fig return fig
def _create_dual_plot(self, plot_data: PlotData, dimensions: str, def _create_dual_plot(
color_by: str, method: str) -> go.Figure: self, plot_data: PlotData, dimensions: str, color_by: str, method: str
) -> go.Figure:
fig = go.Figure() fig = go.Figure()
doc_df = self._prepare_dataframe(plot_data.documents, plot_data.coordinates, dimensions) doc_df = self._prepare_dataframe(
doc_color_values = self.color_mapper.create_color_mapping(plot_data.documents, color_by) plot_data.documents, plot_data.coordinates, dimensions
)
hover_fields = ['id', 'text_preview', 'category', 'subcategory', 'tags_str'] doc_color_values = self.color_mapper.create_color_mapping(
plot_data.documents, color_by
if dimensions == '3d': )
hover_fields = ["id", "text_preview", "category", "subcategory", "tags_str"]
if dimensions == "3d":
doc_fig = px.scatter_3d( doc_fig = px.scatter_3d(
doc_df, x='dim_1', y='dim_2', z='dim_3', doc_df,
x="dim_1",
y="dim_2",
z="dim_3",
color=doc_color_values, color=doc_color_values,
hover_data=hover_fields hover_data=hover_fields,
) )
else: else:
doc_fig = px.scatter( doc_fig = px.scatter(
doc_df, x='dim_1', y='dim_2', doc_df,
x="dim_1",
y="dim_2",
color=doc_color_values, color=doc_color_values,
hover_data=hover_fields hover_data=hover_fields,
) )
for trace in doc_fig.data: for trace in doc_fig.data:
trace.name = f'Documents - {trace.name}' trace.name = f"Documents - {trace.name}"
if dimensions == '3d': if dimensions == "3d":
trace.marker.size = 5 trace.marker.size = 5
trace.marker.symbol = 'circle' trace.marker.symbol = "circle"
else: else:
trace.marker.size = 8 trace.marker.size = 8
trace.marker.symbol = 'circle' trace.marker.symbol = "circle"
trace.marker.opacity = 1.0 trace.marker.opacity = 1.0
fig.add_trace(trace) fig.add_trace(trace)
if plot_data.prompts and plot_data.prompt_coordinates is not None: if plot_data.prompts and plot_data.prompt_coordinates is not None:
prompt_df = self._prepare_dataframe(plot_data.prompts, plot_data.prompt_coordinates, dimensions) prompt_df = self._prepare_dataframe(
prompt_color_values = self.color_mapper.create_color_mapping(plot_data.prompts, color_by) plot_data.prompts, plot_data.prompt_coordinates, dimensions
)
if dimensions == '3d': prompt_color_values = self.color_mapper.create_color_mapping(
plot_data.prompts, color_by
)
if dimensions == "3d":
prompt_fig = px.scatter_3d( prompt_fig = px.scatter_3d(
prompt_df, x='dim_1', y='dim_2', z='dim_3', prompt_df,
x="dim_1",
y="dim_2",
z="dim_3",
color=prompt_color_values, color=prompt_color_values,
hover_data=hover_fields hover_data=hover_fields,
) )
else: else:
prompt_fig = px.scatter( prompt_fig = px.scatter(
prompt_df, x='dim_1', y='dim_2', prompt_df,
x="dim_1",
y="dim_2",
color=prompt_color_values, color=prompt_color_values,
hover_data=hover_fields hover_data=hover_fields,
) )
for trace in prompt_fig.data: for trace in prompt_fig.data:
if hasattr(trace.marker, 'color') and isinstance(trace.marker.color, str): if hasattr(trace.marker, "color") and isinstance(
trace.marker.color = self.color_mapper.to_grayscale_hex(trace.marker.color) trace.marker.color, str
):
trace.name = f'Prompts - {trace.name}' trace.marker.color = self.color_mapper.to_grayscale_hex(
if dimensions == '3d': trace.marker.color
)
trace.name = f"Prompts - {trace.name}"
if dimensions == "3d":
trace.marker.size = 6 trace.marker.size = 6
trace.marker.symbol = 'diamond' trace.marker.symbol = "diamond"
else: else:
trace.marker.size = 10 trace.marker.size = 10
trace.marker.symbol = 'diamond' trace.marker.symbol = "diamond"
trace.marker.opacity = 0.8 trace.marker.opacity = 0.8
fig.add_trace(trace) fig.add_trace(trace)
title = f'{dimensions.upper()} Embedding Visualization - {method} (colored by {color_by})' title = f"{dimensions.upper()} Embedding Visualization - {method} (colored by {color_by})"
fig.update_layout( fig.update_layout(
title=title, title=title, height=None, autosize=True, margin=dict(l=0, r=0, t=50, b=0)
height=None,
autosize=True,
margin=dict(l=0, r=0, t=50, b=0)
) )
return fig return fig
def _prepare_dataframe(self, documents: List[Document], coordinates, dimensions: str) -> pd.DataFrame: def _prepare_dataframe(
self, documents: List[Document], coordinates, dimensions: str
) -> pd.DataFrame:
df_data = [] df_data = []
for i, doc in enumerate(documents): for i, doc in enumerate(documents):
row = { row = {
'id': doc.id, "id": doc.id,
'text': doc.text, "text": doc.text,
'text_preview': doc.text[:100] + "..." if len(doc.text) > 100 else doc.text, "text_preview": doc.text[:100] + "..."
'category': doc.category, if len(doc.text) > 100
'subcategory': doc.subcategory, else doc.text,
'tags_str': ', '.join(doc.tags) if doc.tags else 'None', "category": doc.category,
'dim_1': coordinates[i, 0], "subcategory": doc.subcategory,
'dim_2': coordinates[i, 1], "tags_str": ", ".join(doc.tags) if doc.tags else "None",
"dim_1": coordinates[i, 0],
"dim_2": coordinates[i, 1],
} }
if dimensions == '3d': if dimensions == "3d":
row['dim_3'] = coordinates[i, 2] row["dim_3"] = coordinates[i, 2]
df_data.append(row) df_data.append(row)
return pd.DataFrame(df_data) return pd.DataFrame(df_data)

197
tests/test_bad_data.py Normal file
View File

@@ -0,0 +1,197 @@
"""Tests for handling bad/invalid data files."""
import pytest
import json
import base64
from src.embeddingbuddy.data.parser import NDJSONParser
from src.embeddingbuddy.data.processor import DataProcessor
class TestBadDataHandling:
"""Test suite for various types of invalid input data."""
def setup_method(self):
"""Set up test fixtures."""
self.parser = NDJSONParser()
self.processor = DataProcessor()
def _create_upload_contents(self, text_content: str) -> str:
"""Helper to create upload contents format."""
encoded = base64.b64encode(text_content.encode("utf-8")).decode("utf-8")
return f"data:application/json;base64,{encoded}"
def test_missing_embedding_field(self):
"""Test files missing required embedding field."""
bad_content = '{"id": "doc_001", "text": "Sample text", "category": "test"}'
with pytest.raises(KeyError, match="embedding"):
self.parser.parse_text(bad_content)
# Test processor error handling
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
assert result.error is not None
assert "embedding" in result.error
def test_missing_text_field(self):
"""Test files missing required text field."""
bad_content = (
'{"id": "doc_001", "embedding": [0.1, 0.2, 0.3], "category": "test"}'
)
with pytest.raises(KeyError, match="text"):
self.parser.parse_text(bad_content)
# Test processor error handling
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
assert result.error is not None
assert "text" in result.error
def test_malformed_json_lines(self):
"""Test files with malformed JSON syntax."""
# Missing closing brace
bad_content = '{"id": "doc_001", "embedding": [0.1, 0.2], "text": "test"'
with pytest.raises(json.JSONDecodeError):
self.parser.parse_text(bad_content)
# Test processor error handling
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
assert result.error is not None
def test_invalid_embedding_types(self):
"""Test files with invalid embedding data types."""
test_cases = [
# String instead of array
'{"id": "doc_001", "embedding": "not_an_array", "text": "test"}',
# Mixed types in array
'{"id": "doc_002", "embedding": [0.1, "text", 0.3], "text": "test"}',
# Empty array
'{"id": "doc_003", "embedding": [], "text": "test"}',
# Null embedding
'{"id": "doc_004", "embedding": null, "text": "test"}',
]
for bad_content in test_cases:
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
assert result.error is not None, f"Should fail for: {bad_content}"
def test_inconsistent_embedding_dimensions(self):
"""Test files with embeddings of different dimensions."""
bad_content = """{"id": "doc_001", "embedding": [0.1, 0.2, 0.3, 0.4], "text": "4D embedding"}
{"id": "doc_002", "embedding": [0.1, 0.2, 0.3], "text": "3D embedding"}"""
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
# This might succeed parsing but fail in processing
# The error depends on where dimension validation occurs
if result.error is None:
# If parsing succeeds, check that embeddings have inconsistent shapes
assert len(result.documents) == 2
assert len(result.documents[0].embedding) != len(
result.documents[1].embedding
)
def test_empty_lines_in_ndjson(self):
"""Test files with empty lines mixed in."""
content_with_empty_lines = """{"id": "doc_001", "embedding": [0.1, 0.2], "text": "First line"}
{"id": "doc_002", "embedding": [0.3, 0.4], "text": "After empty line"}"""
# This should work - empty lines should be skipped
documents = self.parser.parse_text(content_with_empty_lines)
assert len(documents) == 2
assert documents[0].id == "doc_001"
assert documents[1].id == "doc_002"
def test_not_ndjson_format(self):
"""Test regular JSON array instead of NDJSON."""
json_array = """[
{"id": "doc_001", "embedding": [0.1, 0.2], "text": "First"},
{"id": "doc_002", "embedding": [0.3, 0.4], "text": "Second"}
]"""
with pytest.raises(json.JSONDecodeError):
self.parser.parse_text(json_array)
def test_binary_content_in_file(self):
"""Test files with binary content mixed in."""
# Simulate binary content that can't be decoded
binary_content = (
b'\x00\x01\x02{"id": "doc_001", "embedding": [0.1], "text": "test"}'
)
# This should result in an error when processing
encoded = base64.b64encode(binary_content).decode("utf-8")
upload_contents = f"data:application/json;base64,{encoded}"
result = self.processor.process_upload(upload_contents)
# Should either fail with UnicodeDecodeError or JSON parsing error
assert result.error is not None
def test_extremely_large_embeddings(self):
"""Test embeddings with very large dimensions."""
large_embedding = [0.1] * 10000 # 10k dimensions
content = json.dumps(
{
"id": "doc_001",
"embedding": large_embedding,
"text": "Large embedding test",
}
)
# This should work but might be slow
upload_contents = self._create_upload_contents(content)
result = self.processor.process_upload(upload_contents)
if result.error is None:
assert len(result.documents) == 1
assert len(result.documents[0].embedding) == 10000
def test_special_characters_in_text(self):
"""Test handling of special characters and unicode."""
special_content = json.dumps(
{
"id": "doc_001",
"embedding": [0.1, 0.2],
"text": 'Special chars: 🚀 ñoñó 中文 \n\t"',
},
ensure_ascii=False,
)
upload_contents = self._create_upload_contents(special_content)
result = self.processor.process_upload(upload_contents)
assert result.error is None
assert len(result.documents) == 1
assert "🚀" in result.documents[0].text
def test_processor_error_structure(self):
"""Test that processor returns proper error structure."""
bad_content = '{"invalid": "json"' # Missing closing brace
upload_contents = self._create_upload_contents(bad_content)
result = self.processor.process_upload(upload_contents)
# Check error structure
assert result.error is not None
assert isinstance(result.error, str)
assert len(result.documents) == 0
assert result.embeddings.size == 0
def test_multiple_errors_in_file(self):
"""Test file with multiple different types of errors."""
multi_error_content = """{"id": "doc_001", "text": "Missing embedding"}
{"id": "doc_002", "embedding": "wrong_type", "text": "Wrong embedding type"}
{"id": "doc_003", "embedding": [0.1, 0.2], "text": "Valid line"}
{"id": "doc_004", "embedding": [0.3, 0.4]""" # Missing text and closing brace
upload_contents = self._create_upload_contents(multi_error_content)
result = self.processor.process_upload(upload_contents)
# Should fail on first error encountered
assert result.error is not None

View File

@@ -6,62 +6,64 @@ from src.embeddingbuddy.models.schemas import Document
class TestNDJSONParser: class TestNDJSONParser:
def test_parse_text_basic(self): def test_parse_text_basic(self):
text_content = '{"id": "test1", "text": "Hello world", "embedding": [0.1, 0.2, 0.3]}' text_content = (
'{"id": "test1", "text": "Hello world", "embedding": [0.1, 0.2, 0.3]}'
)
documents = NDJSONParser.parse_text(text_content) documents = NDJSONParser.parse_text(text_content)
assert len(documents) == 1 assert len(documents) == 1
assert documents[0].id == "test1" assert documents[0].id == "test1"
assert documents[0].text == "Hello world" assert documents[0].text == "Hello world"
assert documents[0].embedding == [0.1, 0.2, 0.3] assert documents[0].embedding == [0.1, 0.2, 0.3]
def test_parse_text_with_metadata(self): def test_parse_text_with_metadata(self):
text_content = '{"id": "test1", "text": "Hello", "embedding": [0.1, 0.2], "category": "greeting", "tags": ["test"]}' text_content = '{"id": "test1", "text": "Hello", "embedding": [0.1, 0.2], "category": "greeting", "tags": ["test"]}'
documents = NDJSONParser.parse_text(text_content) documents = NDJSONParser.parse_text(text_content)
assert documents[0].category == "greeting" assert documents[0].category == "greeting"
assert documents[0].tags == ["test"] assert documents[0].tags == ["test"]
def test_parse_text_missing_id(self): def test_parse_text_missing_id(self):
text_content = '{"text": "Hello", "embedding": [0.1, 0.2]}' text_content = '{"text": "Hello", "embedding": [0.1, 0.2]}'
documents = NDJSONParser.parse_text(text_content) documents = NDJSONParser.parse_text(text_content)
assert len(documents) == 1 assert len(documents) == 1
assert documents[0].id is not None # Should be auto-generated assert documents[0].id is not None # Should be auto-generated
class TestDataProcessor: class TestDataProcessor:
def test_extract_embeddings(self): def test_extract_embeddings(self):
documents = [ documents = [
Document(id="1", text="test1", embedding=[0.1, 0.2]), Document(id="1", text="test1", embedding=[0.1, 0.2]),
Document(id="2", text="test2", embedding=[0.3, 0.4]) Document(id="2", text="test2", embedding=[0.3, 0.4]),
] ]
processor = DataProcessor() processor = DataProcessor()
embeddings = processor._extract_embeddings(documents) embeddings = processor._extract_embeddings(documents)
assert embeddings.shape == (2, 2) assert embeddings.shape == (2, 2)
assert np.allclose(embeddings[0], [0.1, 0.2]) assert np.allclose(embeddings[0], [0.1, 0.2])
assert np.allclose(embeddings[1], [0.3, 0.4]) assert np.allclose(embeddings[1], [0.3, 0.4])
def test_combine_data(self): def test_combine_data(self):
from src.embeddingbuddy.models.schemas import ProcessedData from src.embeddingbuddy.models.schemas import ProcessedData
doc_data = ProcessedData( doc_data = ProcessedData(
documents=[Document(id="1", text="doc", embedding=[0.1, 0.2])], documents=[Document(id="1", text="doc", embedding=[0.1, 0.2])],
embeddings=np.array([[0.1, 0.2]]) embeddings=np.array([[0.1, 0.2]]),
) )
prompt_data = ProcessedData( prompt_data = ProcessedData(
documents=[Document(id="p1", text="prompt", embedding=[0.3, 0.4])], documents=[Document(id="p1", text="prompt", embedding=[0.3, 0.4])],
embeddings=np.array([[0.3, 0.4]]) embeddings=np.array([[0.3, 0.4]]),
) )
processor = DataProcessor() processor = DataProcessor()
all_embeddings, documents, prompts = processor.combine_data(doc_data, prompt_data) all_embeddings, documents, prompts = processor.combine_data(
doc_data, prompt_data
)
assert all_embeddings.shape == (2, 2) assert all_embeddings.shape == (2, 2)
assert len(documents) == 1 assert len(documents) == 1
assert len(prompts) == 1 assert len(prompts) == 1
@@ -70,4 +72,4 @@ class TestDataProcessor:
if __name__ == "__main__": if __name__ == "__main__":
pytest.main([__file__]) pytest.main([__file__])

View File

@@ -0,0 +1,155 @@
from unittest.mock import patch
from src.embeddingbuddy.data.processor import DataProcessor
from src.embeddingbuddy.models.field_mapper import FieldMapping
class TestDataProcessorOpenSearch:
def test_process_opensearch_data_success(self):
processor = DataProcessor()
# Mock raw OpenSearch documents
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
"content": "Test document 1",
"doc_id": "doc1",
"type": "news",
},
{
"vector": [0.4, 0.5, 0.6],
"content": "Test document 2",
"doc_id": "doc2",
"type": "blog",
},
]
# Create field mapping
field_mapping = FieldMapping(
embedding_field="vector",
text_field="content",
id_field="doc_id",
category_field="type",
)
# Process the data
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
# Assertions
assert processed_data.error is None
assert len(processed_data.documents) == 2
assert processed_data.embeddings.shape == (2, 3)
# Check first document
doc1 = processed_data.documents[0]
assert doc1.text == "Test document 1"
assert doc1.embedding == [0.1, 0.2, 0.3]
assert doc1.id == "doc1"
assert doc1.category == "news"
# Check second document
doc2 = processed_data.documents[1]
assert doc2.text == "Test document 2"
assert doc2.embedding == [0.4, 0.5, 0.6]
assert doc2.id == "doc2"
assert doc2.category == "blog"
def test_process_opensearch_data_with_tags(self):
processor = DataProcessor()
# Mock raw OpenSearch documents with tags
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
"content": "Test document with tags",
"keywords": ["tag1", "tag2"],
}
]
# Create field mapping
field_mapping = FieldMapping(
embedding_field="vector", text_field="content", tags_field="keywords"
)
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
assert processed_data.error is None
assert len(processed_data.documents) == 1
doc = processed_data.documents[0]
assert doc.tags == ["tag1", "tag2"]
def test_process_opensearch_data_invalid_documents(self):
processor = DataProcessor()
# Mock raw documents with missing required fields
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
# Missing text field
}
]
field_mapping = FieldMapping(embedding_field="vector", text_field="content")
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
# Should return error since no valid documents
assert processed_data.error is not None
assert "No valid documents" in processed_data.error
assert len(processed_data.documents) == 0
def test_process_opensearch_data_partial_success(self):
processor = DataProcessor()
# Mix of valid and invalid documents
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
"content": "Valid document",
},
{
"vector": [0.4, 0.5, 0.6],
# Missing content field - should be skipped
},
{
"vector": [0.7, 0.8, 0.9],
"content": "Another valid document",
},
]
field_mapping = FieldMapping(embedding_field="vector", text_field="content")
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
# Should process valid documents only
assert processed_data.error is None
assert len(processed_data.documents) == 2
assert processed_data.documents[0].text == "Valid document"
assert processed_data.documents[1].text == "Another valid document"
@patch("src.embeddingbuddy.models.field_mapper.FieldMapper.transform_documents")
def test_process_opensearch_data_transformation_error(self, mock_transform):
processor = DataProcessor()
# Mock transformation error
mock_transform.side_effect = Exception("Transformation failed")
raw_documents = [{"vector": [0.1], "content": "test"}]
field_mapping = FieldMapping(embedding_field="vector", text_field="content")
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
assert processed_data.error is not None
assert "Transformation failed" in processed_data.error
assert len(processed_data.documents) == 0
def test_process_opensearch_data_empty_input(self):
processor = DataProcessor()
raw_documents = []
field_mapping = FieldMapping(embedding_field="vector", text_field="content")
processed_data = processor.process_opensearch_data(raw_documents, field_mapping)
assert processed_data.error is not None
assert "No valid documents" in processed_data.error
assert len(processed_data.documents) == 0

310
tests/test_opensearch.py Normal file
View File

@@ -0,0 +1,310 @@
from unittest.mock import Mock, patch
from src.embeddingbuddy.data.sources.opensearch import OpenSearchClient
from src.embeddingbuddy.models.field_mapper import FieldMapper, FieldMapping
class TestOpenSearchClient:
def test_init(self):
client = OpenSearchClient()
assert client.client is None
assert client.connection_info is None
@patch("src.embeddingbuddy.data.sources.opensearch.OpenSearch")
def test_connect_success(self, mock_opensearch):
# Mock the OpenSearch client
mock_client_instance = Mock()
mock_client_instance.info.return_value = {
"cluster_name": "test-cluster",
"version": {"number": "2.0.0"},
}
mock_opensearch.return_value = mock_client_instance
client = OpenSearchClient()
success, message = client.connect("https://localhost:9200")
assert success is True
assert "test-cluster" in message
assert client.client is not None
assert client.connection_info["cluster_name"] == "test-cluster"
@patch("src.embeddingbuddy.data.sources.opensearch.OpenSearch")
def test_connect_failure(self, mock_opensearch):
# Mock connection failure
mock_opensearch.side_effect = Exception("Connection failed")
client = OpenSearchClient()
success, message = client.connect("https://localhost:9200")
assert success is False
assert "Connection failed" in message
assert client.client is None
def test_analyze_fields(self):
client = OpenSearchClient()
client.client = Mock()
# Mock mapping response
mock_mapping = {
"test-index": {
"mappings": {
"properties": {
"embedding": {"type": "dense_vector", "dimension": 768},
"text": {"type": "text"},
"category": {"type": "keyword"},
"id": {"type": "keyword"},
"count": {"type": "integer"},
}
}
}
}
client.client.indices.get_mapping.return_value = mock_mapping
success, analysis, message = client.analyze_fields("test-index")
assert success is True
assert len(analysis["vector_fields"]) == 1
assert analysis["vector_fields"][0]["name"] == "embedding"
assert analysis["vector_fields"][0]["dimension"] == 768
assert "text" in analysis["text_fields"]
assert "category" in analysis["keyword_fields"]
assert "count" in analysis["numeric_fields"]
def test_fetch_sample_data(self):
client = OpenSearchClient()
client.client = Mock()
# Mock search response
mock_response = {
"hits": {
"hits": [
{"_source": {"text": "doc1", "embedding": [0.1, 0.2]}},
{"_source": {"text": "doc2", "embedding": [0.3, 0.4]}},
]
}
}
client.client.search.return_value = mock_response
success, documents, message = client.fetch_sample_data("test-index", size=2)
assert success is True
assert len(documents) == 2
assert documents[0]["text"] == "doc1"
assert documents[1]["text"] == "doc2"
class TestFieldMapper:
def test_suggest_mappings(self):
field_analysis = {
"vector_fields": [{"name": "embedding", "dimension": 768}],
"text_fields": ["content", "description"],
"keyword_fields": ["doc_id", "category", "type", "tags"],
"numeric_fields": ["count"],
"all_fields": [
"embedding",
"content",
"description",
"doc_id",
"category",
"type",
"tags",
"count",
],
}
suggestions = FieldMapper.suggest_mappings(field_analysis)
# Check that all dropdowns contain all fields
all_fields = [
"embedding",
"content",
"description",
"doc_id",
"category",
"type",
"tags",
"count",
]
for field_type in [
"embedding",
"text",
"id",
"category",
"subcategory",
"tags",
]:
for field in all_fields:
assert field in suggestions[field_type], (
f"Field '{field}' missing from {field_type} suggestions"
)
# Check that best candidates are first
assert (
suggestions["embedding"][0] == "embedding"
) # vector field should be first
assert suggestions["text"][0] in [
"content",
"description",
] # text fields should be first
assert suggestions["id"][0] == "doc_id" # ID-like field should be first
assert suggestions["category"][0] in [
"category",
"type",
] # category-like field should be first
assert suggestions["tags"][0] == "tags" # tags field should be first
def test_suggest_mappings_name_based_embedding(self):
"""Test that fields named 'embedding' are prioritized even without vector type."""
field_analysis = {
"vector_fields": [], # No explicit vector fields detected
"text_fields": ["content", "description"],
"keyword_fields": ["doc_id", "category", "type", "tags"],
"numeric_fields": ["count"],
"all_fields": [
"content",
"description",
"doc_id",
"category",
"embedding",
"type",
"tags",
"count",
],
}
suggestions = FieldMapper.suggest_mappings(field_analysis)
# Check that 'embedding' field is prioritized despite not being detected as vector type
assert suggestions["embedding"][0] == "embedding", (
"Field named 'embedding' should be first priority"
)
# Check that all fields are still available
all_fields = [
"content",
"description",
"doc_id",
"category",
"embedding",
"type",
"tags",
"count",
]
for field_type in [
"embedding",
"text",
"id",
"category",
"subcategory",
"tags",
]:
for field in all_fields:
assert field in suggestions[field_type], (
f"Field '{field}' missing from {field_type} suggestions"
)
def test_validate_mapping_success(self):
mapping = FieldMapping(
embedding_field="embedding", text_field="text", id_field="doc_id"
)
available_fields = ["embedding", "text", "doc_id", "category"]
errors = FieldMapper.validate_mapping(mapping, available_fields)
assert len(errors) == 0
def test_validate_mapping_missing_required(self):
mapping = FieldMapping(embedding_field="missing_field", text_field="text")
available_fields = ["text", "category"]
errors = FieldMapper.validate_mapping(mapping, available_fields)
assert len(errors) == 1
assert "missing_field" in errors[0]
assert "not found" in errors[0]
def test_validate_mapping_missing_optional(self):
mapping = FieldMapping(
embedding_field="embedding",
text_field="text",
category_field="missing_category",
)
available_fields = ["embedding", "text"]
errors = FieldMapper.validate_mapping(mapping, available_fields)
assert len(errors) == 1
assert "missing_category" in errors[0]
def test_transform_documents(self):
mapping = FieldMapping(
embedding_field="vector",
text_field="content",
id_field="doc_id",
category_field="type",
)
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
"content": "Test document 1",
"doc_id": "doc1",
"type": "news",
},
{
"vector": [0.4, 0.5, 0.6],
"content": "Test document 2",
"doc_id": "doc2",
"type": "blog",
},
]
transformed = FieldMapper.transform_documents(raw_documents, mapping)
assert len(transformed) == 2
assert transformed[0]["embedding"] == [0.1, 0.2, 0.3]
assert transformed[0]["text"] == "Test document 1"
assert transformed[0]["id"] == "doc1"
assert transformed[0]["category"] == "news"
def test_transform_documents_missing_required(self):
mapping = FieldMapping(embedding_field="vector", text_field="content")
raw_documents = [
{
"vector": [0.1, 0.2, 0.3],
# Missing content field
}
]
transformed = FieldMapper.transform_documents(raw_documents, mapping)
assert len(transformed) == 0 # Document should be skipped
def test_create_mapping_from_dict(self):
mapping_dict = {
"embedding": "vector_field",
"text": "text_field",
"id": "doc_id",
"category": "cat_field",
"subcategory": "subcat_field",
"tags": "tags_field",
}
mapping = FieldMapper.create_mapping_from_dict(mapping_dict)
assert mapping.embedding_field == "vector_field"
assert mapping.text_field == "text_field"
assert mapping.id_field == "doc_id"
assert mapping.category_field == "cat_field"
assert mapping.subcategory_field == "subcat_field"
assert mapping.tags_field == "tags_field"
def test_create_mapping_from_dict_minimal(self):
mapping_dict = {"embedding": "vector_field", "text": "text_field"}
mapping = FieldMapper.create_mapping_from_dict(mapping_dict)
assert mapping.embedding_field == "vector_field"
assert mapping.text_field == "text_field"
assert mapping.id_field is None
assert mapping.category_field is None

View File

@@ -1,89 +1,90 @@
import pytest import pytest
import numpy as np import numpy as np
from src.embeddingbuddy.models.reducers import ReducerFactory, PCAReducer, TSNEReducer, UMAPReducer from src.embeddingbuddy.models.reducers import (
ReducerFactory,
PCAReducer,
TSNEReducer,
UMAPReducer,
)
class TestReducerFactory: class TestReducerFactory:
def test_create_pca_reducer(self): def test_create_pca_reducer(self):
reducer = ReducerFactory.create_reducer('pca', n_components=2) reducer = ReducerFactory.create_reducer("pca", n_components=2)
assert isinstance(reducer, PCAReducer) assert isinstance(reducer, PCAReducer)
assert reducer.n_components == 2 assert reducer.n_components == 2
def test_create_tsne_reducer(self): def test_create_tsne_reducer(self):
reducer = ReducerFactory.create_reducer('tsne', n_components=3) reducer = ReducerFactory.create_reducer("tsne", n_components=3)
assert isinstance(reducer, TSNEReducer) assert isinstance(reducer, TSNEReducer)
assert reducer.n_components == 3 assert reducer.n_components == 3
def test_create_umap_reducer(self): def test_create_umap_reducer(self):
reducer = ReducerFactory.create_reducer('umap', n_components=2) reducer = ReducerFactory.create_reducer("umap", n_components=2)
assert isinstance(reducer, UMAPReducer) assert isinstance(reducer, UMAPReducer)
assert reducer.n_components == 2 assert reducer.n_components == 2
def test_invalid_method(self): def test_invalid_method(self):
with pytest.raises(ValueError, match="Unknown reduction method"): with pytest.raises(ValueError, match="Unknown reduction method"):
ReducerFactory.create_reducer('invalid_method') ReducerFactory.create_reducer("invalid_method")
def test_available_methods(self): def test_available_methods(self):
methods = ReducerFactory.get_available_methods() methods = ReducerFactory.get_available_methods()
assert 'pca' in methods assert "pca" in methods
assert 'tsne' in methods assert "tsne" in methods
assert 'umap' in methods assert "umap" in methods
class TestPCAReducer: class TestPCAReducer:
def test_fit_transform(self): def test_fit_transform(self):
embeddings = np.random.rand(100, 512) embeddings = np.random.rand(100, 512)
reducer = PCAReducer(n_components=2) reducer = PCAReducer(n_components=2)
result = reducer.fit_transform(embeddings) result = reducer.fit_transform(embeddings)
assert result.reduced_embeddings.shape == (100, 2) assert result.reduced_embeddings.shape == (100, 2)
assert result.variance_explained is not None assert result.variance_explained is not None
assert result.method == "PCA" assert result.method == "PCA"
assert result.n_components == 2 assert result.n_components == 2
def test_method_name(self): def test_method_name(self):
reducer = PCAReducer() reducer = PCAReducer()
assert reducer.get_method_name() == "PCA" assert reducer.get_method_name() == "PCA"
class TestTSNEReducer: class TestTSNEReducer:
def test_fit_transform_small_dataset(self): def test_fit_transform_small_dataset(self):
embeddings = np.random.rand(30, 10) # Small dataset for faster testing embeddings = np.random.rand(30, 10) # Small dataset for faster testing
reducer = TSNEReducer(n_components=2) reducer = TSNEReducer(n_components=2)
result = reducer.fit_transform(embeddings) result = reducer.fit_transform(embeddings)
assert result.reduced_embeddings.shape == (30, 2) assert result.reduced_embeddings.shape == (30, 2)
assert result.variance_explained is None # t-SNE doesn't provide this assert result.variance_explained is None # t-SNE doesn't provide this
assert result.method == "t-SNE" assert result.method == "t-SNE"
assert result.n_components == 2 assert result.n_components == 2
def test_method_name(self): def test_method_name(self):
reducer = TSNEReducer() reducer = TSNEReducer()
assert reducer.get_method_name() == "t-SNE" assert reducer.get_method_name() == "t-SNE"
class TestUMAPReducer: class TestUMAPReducer:
def test_fit_transform(self): def test_fit_transform(self):
embeddings = np.random.rand(50, 10) embeddings = np.random.rand(50, 10)
reducer = UMAPReducer(n_components=2) reducer = UMAPReducer(n_components=2)
result = reducer.fit_transform(embeddings) result = reducer.fit_transform(embeddings)
assert result.reduced_embeddings.shape == (50, 2) assert result.reduced_embeddings.shape == (50, 2)
assert result.variance_explained is None # UMAP doesn't provide this assert result.variance_explained is None # UMAP doesn't provide this
assert result.method == "UMAP" assert result.method == "UMAP"
assert result.n_components == 2 assert result.n_components == 2
def test_method_name(self): def test_method_name(self):
reducer = UMAPReducer() reducer = UMAPReducer()
assert reducer.get_method_name() == "UMAP" assert reducer.get_method_name() == "UMAP"
if __name__ == "__main__": if __name__ == "__main__":
pytest.main([__file__]) pytest.main([__file__])

1108
uv.lock generated

File diff suppressed because it is too large Load Diff