embedding-buddy/example/README_elasticsearch.md

# Elasticsearch/OpenSearch Sample Data

This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.

## Files

### Original NDJSON Files

- `sample_data.ndjson` - Original sample documents in EmbeddingBuddy format
- `sample_prompts.ndjson` - Original sample prompts in EmbeddingBuddy format

### Elasticsearch Bulk Files

- `sample_data_es_bulk.ndjson` - Documents in ES bulk format (index: "embeddings")
- `sample_prompts_es_bulk.ndjson` - Prompts in ES bulk format (index: "prompts")

## Usage

### 1. Index the data using curl

```bash
# Index main documents
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_data_es_bulk.ndjson

# Index prompts
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_prompts_es_bulk.ndjson
```

### 2. Create proper mappings (recommended)

First create the index with proper dense_vector mapping:

```bash
# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'

# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'
```

Then index the data using the bulk files above.

### 3. Test in EmbeddingBuddy

#### For "embeddings" index

- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings`
- **Field Mapping**:
  - Embedding Field: `embedding`
  - Text Field: `text`
  - ID Field: `id`
  - Category Field: `category`
  - Subcategory Field: `subcategory`
  - Tags Field: `tags`

#### For "embeddings-dense" index (alternative field names)

- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings-dense`
- **Field Mapping**:
  - Embedding Field: `vector`
  - Text Field: `content`
  - ID Field: `doc_id`
  - Category Field: `type`
  - Subcategory Field: `subtopic`
  - Tags Field: `keywords`

## Data Structure

### Original Format (from NDJSON files)

```json
{
  "id": "doc_001",
  "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
  "text": "Machine learning algorithms are transforming healthcare...",
  "category": "technology",
  "subcategory": "healthcare",
  "tags": ["ai", "medicine", "prediction"]
}
```

### ES Bulk Format

```json
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
```

### Alternative Field Names (dense vector format)

```json
{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
```

## Notes

- All embedding vectors are 8-dimensional for these sample files
- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
- The `dense_vector` field type in Elasticsearch/OpenSearch enables vector similarity search