4.1 KiB
4.1 KiB
Elasticsearch/OpenSearch Sample Data
This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.
Files
Original NDJSON Files
sample_data.ndjson
- Original sample documents in EmbeddingBuddy formatsample_prompts.ndjson
- Original sample prompts in EmbeddingBuddy format
Elasticsearch Bulk Files
sample_data_es_bulk.ndjson
- Documents in ES bulk format (index: "embeddings")sample_prompts_es_bulk.ndjson
- Prompts in ES bulk format (index: "prompts")
Usage
1. Index the data using curl
# Index main documents
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_data_es_bulk.ndjson
# Index prompts
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_prompts_es_bulk.ndjson
2. Create proper mappings (recommended)
First create the index with proper dense_vector mapping:
# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
Then index the data using the bulk files above.
3. Test in EmbeddingBuddy
For "embeddings" index
- OpenSearch URL:
http://localhost:9200
- Index Name:
embeddings
- Field Mapping:
- Embedding Field:
embedding
- Text Field:
text
- ID Field:
id
- Category Field:
category
- Subcategory Field:
subcategory
- Tags Field:
tags
- Embedding Field:
For "embeddings-dense" index (alternative field names)
- OpenSearch URL:
http://localhost:9200
- Index Name:
embeddings-dense
- Field Mapping:
- Embedding Field:
vector
- Text Field:
content
- ID Field:
doc_id
- Category Field:
type
- Subcategory Field:
subtopic
- Tags Field:
keywords
- Embedding Field:
Data Structure
Original Format (from NDJSON files)
{
"id": "doc_001",
"embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
"text": "Machine learning algorithms are transforming healthcare...",
"category": "technology",
"subcategory": "healthcare",
"tags": ["ai", "medicine", "prediction"]
}
ES Bulk Format
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
Alternative Field Names (dense vector format)
{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
Notes
- All embedding vectors are 8-dimensional for these sample files
- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
- The
dense_vector
field type in Elasticsearch/OpenSearch enables vector similarity search