Files
embedding-buddy/example/README_elasticsearch.md
Austin Godber 9cf2f0e6fa this will load data from Opensearch.
it doesn't have prompts as well
2025-08-14 13:49:46 -07:00

4.1 KiB

Elasticsearch/OpenSearch Sample Data

This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.

Files

Original NDJSON Files

  • sample_data.ndjson - Original sample documents in EmbeddingBuddy format
  • sample_prompts.ndjson - Original sample prompts in EmbeddingBuddy format

Elasticsearch Bulk Files

  • sample_data_es_bulk.ndjson - Documents in ES bulk format (index: "embeddings")
  • sample_prompts_es_bulk.ndjson - Prompts in ES bulk format (index: "prompts")

Usage

1. Index the data using curl

# Index main documents
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_data_es_bulk.ndjson

# Index prompts
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_prompts_es_bulk.ndjson

First create the index with proper dense_vector mapping:

# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'

# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'

Then index the data using the bulk files above.

3. Test in EmbeddingBuddy

For "embeddings" index

  • OpenSearch URL: http://localhost:9200
  • Index Name: embeddings
  • Field Mapping:
    • Embedding Field: embedding
    • Text Field: text
    • ID Field: id
    • Category Field: category
    • Subcategory Field: subcategory
    • Tags Field: tags

For "embeddings-dense" index (alternative field names)

  • OpenSearch URL: http://localhost:9200
  • Index Name: embeddings-dense
  • Field Mapping:
    • Embedding Field: vector
    • Text Field: content
    • ID Field: doc_id
    • Category Field: type
    • Subcategory Field: subtopic
    • Tags Field: keywords

Data Structure

Original Format (from NDJSON files)

{
  "id": "doc_001",
  "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
  "text": "Machine learning algorithms are transforming healthcare...",
  "category": "technology",
  "subcategory": "healthcare",
  "tags": ["ai", "medicine", "prediction"]
}

ES Bulk Format

{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}

Alternative Field Names (dense vector format)

{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}

Notes

  • All embedding vectors are 8-dimensional for these sample files
  • The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
  • For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
  • The dense_vector field type in Elasticsearch/OpenSearch enables vector similarity search