this will load data from Opensearch.

it doesn't have prompts as well
2025-08-14 13:49:46 -07:00
parent a2adc8b958
commit 9cf2f0e6fa
16 changed files with 1694 additions and 7 deletions
--- a/example/README_elasticsearch.md
+++ b/example/README_elasticsearch.md
@@ -0,0 +1,157 @@
+# Elasticsearch/OpenSearch Sample Data
+
+This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.
+
+## Files
+
+### Original NDJSON Files
+
+- `sample_data.ndjson` - Original sample documents in EmbeddingBuddy format
+- `sample_prompts.ndjson` - Original sample prompts in EmbeddingBuddy format
+
+### Elasticsearch Bulk Files
+
+- `sample_data_es_bulk.ndjson` - Documents in ES bulk format (index: "embeddings")
+- `sample_prompts_es_bulk.ndjson` - Prompts in ES bulk format (index: "prompts")
+
+## Usage
+
+### 1. Index the data using curl
+
+```bash
+# Index main documents
+curl -X POST "localhost:9200/_bulk" \
+  -H "Content-Type: application/x-ndjson" \
+  --data-binary @sample_data_es_bulk.ndjson
+
+# Index prompts
+curl -X POST "localhost:9200/_bulk" \
+  -H "Content-Type: application/x-ndjson" \
+  --data-binary @sample_prompts_es_bulk.ndjson
+```
+
+### 2. Create proper mappings (recommended)
+
+First create the index with proper dense_vector mapping:
+
+```bash
+# Create embeddings index with dense_vector mapping
+curl -X PUT "localhost:9200/embeddings" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "settings": {
+      "index.knn": true
+    },
+    "mappings": {
+      "properties": {
+        "id": {"type": "keyword"},
+        "embedding": {
+          "type": "knn_vector",
+          "dimension": 8,
+          "method": {
+            "engine": "lucene",
+            "space_type": "cosinesimil",
+            "name": "hnsw",
+            "parameters": {}
+          }
+        },
+        "text": {"type": "text"},
+        "category": {"type": "keyword"},
+        "subcategory": {"type": "keyword"},
+        "tags": {"type": "keyword"}
+      }
+    }
+  }'
+
+# Create dense vector index with alternative field names
+curl -X PUT "localhost:9200/prompts" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "settings": {
+      "index.knn": true
+    },
+    "mappings": {
+      "properties": {
+        "id": {"type": "keyword"},
+        "embedding": {
+          "type": "knn_vector",
+          "dimension": 8,
+          "method": {
+            "engine": "lucene",
+            "space_type": "cosinesimil",
+            "name": "hnsw",
+            "parameters": {}
+          }
+        },
+        "text": {"type": "text"},
+        "category": {"type": "keyword"},
+        "subcategory": {"type": "keyword"},
+        "tags": {"type": "keyword"}
+      }
+    }
+  }'
+```
+
+Then index the data using the bulk files above.
+
+### 3. Test in EmbeddingBuddy
+
+#### For "embeddings" index
+
+- **OpenSearch URL**: `http://localhost:9200`
+- **Index Name**: `embeddings`
+- **Field Mapping**:
+  - Embedding Field: `embedding`
+  - Text Field: `text`
+  - ID Field: `id`
+  - Category Field: `category`
+  - Subcategory Field: `subcategory`
+  - Tags Field: `tags`
+
+#### For "embeddings-dense" index (alternative field names)
+
+- **OpenSearch URL**: `http://localhost:9200`
+- **Index Name**: `embeddings-dense`
+- **Field Mapping**:
+  - Embedding Field: `vector`
+  - Text Field: `content`
+  - ID Field: `doc_id`
+  - Category Field: `type`
+  - Subcategory Field: `subtopic`
+  - Tags Field: `keywords`
+
+## Data Structure
+
+### Original Format (from NDJSON files)
+
+```json
+{
+  "id": "doc_001",
+  "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
+  "text": "Machine learning algorithms are transforming healthcare...",
+  "category": "technology",
+  "subcategory": "healthcare",
+  "tags": ["ai", "medicine", "prediction"]
+}
+```
+
+### ES Bulk Format
+
+```json
+{"index": {"_index": "embeddings", "_id": "doc_001"}}
+{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
+```
+
+### Alternative Field Names (dense vector format)
+
+```json
+{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
+{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
+```
+
+## Notes
+
+- All embedding vectors are 8-dimensional for these sample files
+- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
+- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
+- The `dense_vector` field type in Elasticsearch/OpenSearch enables vector similarity search