Files

Austin Godber 9cf2f0e6fa this will load data from Opensearch.

it doesn't have prompts as well

2025-08-14 13:49:46 -07:00

4.1 KiB

Raw Permalink Blame History

Elasticsearch/OpenSearch Sample Data

This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.

Files

Original NDJSON Files

sample_data.ndjson - Original sample documents in EmbeddingBuddy format
sample_prompts.ndjson - Original sample prompts in EmbeddingBuddy format

Elasticsearch Bulk Files

sample_data_es_bulk.ndjson - Documents in ES bulk format (index: "embeddings")
sample_prompts_es_bulk.ndjson - Prompts in ES bulk format (index: "prompts")

Usage

1. Index the data using curl

# Index main documents
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_data_es_bulk.ndjson

# Index prompts
curl -X POST "localhost:9200/_bulk" \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @sample_prompts_es_bulk.ndjson

2. Create proper mappings (recommended)

First create the index with proper dense_vector mapping:

# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'

# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "index.knn": true
    },
    "mappings": {
      "properties": {
        "id": {"type": "keyword"},
        "embedding": {
          "type": "knn_vector",
          "dimension": 8,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "text": {"type": "text"},
        "category": {"type": "keyword"},
        "subcategory": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }'

Then index the data using the bulk files above.

3. Test in EmbeddingBuddy

For "embeddings" index

OpenSearch URL: http://localhost:9200
Index Name: embeddings
Field Mapping:
- Embedding Field: embedding
- Text Field: text
- ID Field: id
- Category Field: category
- Subcategory Field: subcategory
- Tags Field: tags

For "embeddings-dense" index (alternative field names)

OpenSearch URL: http://localhost:9200
Index Name: embeddings-dense
Field Mapping:
- Embedding Field: vector
- Text Field: content
- ID Field: doc_id
- Category Field: type
- Subcategory Field: subtopic
- Tags Field: keywords

Data Structure

Original Format (from NDJSON files)

{
  "id": "doc_001",
  "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
  "text": "Machine learning algorithms are transforming healthcare...",
  "category": "technology",
  "subcategory": "healthcare",
  "tags": ["ai", "medicine", "prediction"]
}

ES Bulk Format

{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}

Alternative Field Names (dense vector format)

{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}

Notes

All embedding vectors are 8-dimensional for these sample files
The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
The dense_vector field type in Elasticsearch/OpenSearch enables vector similarity search

4.1 KiB Raw Permalink Blame History