this will load data from Opensearch.

it doesn't have prompts as well
This commit is contained in:
2025-08-14 13:49:46 -07:00
parent a2adc8b958
commit 9cf2f0e6fa
16 changed files with 1694 additions and 7 deletions

View File

@@ -0,0 +1,157 @@
# Elasticsearch/OpenSearch Sample Data
This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.
## Files
### Original NDJSON Files
- `sample_data.ndjson` - Original sample documents in EmbeddingBuddy format
- `sample_prompts.ndjson` - Original sample prompts in EmbeddingBuddy format
### Elasticsearch Bulk Files
- `sample_data_es_bulk.ndjson` - Documents in ES bulk format (index: "embeddings")
- `sample_prompts_es_bulk.ndjson` - Prompts in ES bulk format (index: "prompts")
## Usage
### 1. Index the data using curl
```bash
# Index main documents
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_data_es_bulk.ndjson
# Index prompts
curl -X POST "localhost:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary @sample_prompts_es_bulk.ndjson
```
### 2. Create proper mappings (recommended)
First create the index with proper dense_vector mapping:
```bash
# Create embeddings index with dense_vector mapping
curl -X PUT "localhost:9200/embeddings" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
# Create dense vector index with alternative field names
curl -X PUT "localhost:9200/prompts" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"id": {"type": "keyword"},
"embedding": {
"type": "knn_vector",
"dimension": 8,
"method": {
"engine": "lucene",
"space_type": "cosinesimil",
"name": "hnsw",
"parameters": {}
}
},
"text": {"type": "text"},
"category": {"type": "keyword"},
"subcategory": {"type": "keyword"},
"tags": {"type": "keyword"}
}
}
}'
```
Then index the data using the bulk files above.
### 3. Test in EmbeddingBuddy
#### For "embeddings" index
- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings`
- **Field Mapping**:
- Embedding Field: `embedding`
- Text Field: `text`
- ID Field: `id`
- Category Field: `category`
- Subcategory Field: `subcategory`
- Tags Field: `tags`
#### For "embeddings-dense" index (alternative field names)
- **OpenSearch URL**: `http://localhost:9200`
- **Index Name**: `embeddings-dense`
- **Field Mapping**:
- Embedding Field: `vector`
- Text Field: `content`
- ID Field: `doc_id`
- Category Field: `type`
- Subcategory Field: `subtopic`
- Tags Field: `keywords`
## Data Structure
### Original Format (from NDJSON files)
```json
{
"id": "doc_001",
"embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
"text": "Machine learning algorithms are transforming healthcare...",
"category": "technology",
"subcategory": "healthcare",
"tags": ["ai", "medicine", "prediction"]
}
```
### ES Bulk Format
```json
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
```
### Alternative Field Names (dense vector format)
```json
{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
```
## Notes
- All embedding vectors are 8-dimensional for these sample files
- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
- The `dense_vector` field type in Elasticsearch/OpenSearch enables vector similarity search

View File

@@ -0,0 +1,40 @@
{"index": {"_index": "embeddings", "_id": "doc_001"}}
{"id": "doc_001", "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3], "text": "Machine learning algorithms are transforming healthcare by enabling predictive analytics and personalized medicine.", "category": "technology", "subcategory": "healthcare", "tags": ["ai", "medicine", "prediction"]}
{"index": {"_index": "embeddings", "_id": "doc_002"}}
{"id": "doc_002", "embedding": [0.1, 0.4, -0.2, 0.6, 0.3, -0.4, 0.8, 0.2], "text": "Climate change poses significant challenges to global food security and agricultural sustainability.", "category": "environment", "subcategory": "agriculture", "tags": ["climate", "food", "sustainability"]}
{"index": {"_index": "embeddings", "_id": "doc_003"}}
{"id": "doc_003", "embedding": [-0.3, 0.7, 0.1, -0.2, 0.9, 0.4, -0.1, 0.5], "text": "The rise of electric vehicles is reshaping the automotive industry and urban transportation systems.", "category": "technology", "subcategory": "automotive", "tags": ["electric", "transport", "urban"]}
{"index": {"_index": "embeddings", "_id": "doc_004"}}
{"id": "doc_004", "embedding": [0.5, -0.6, 0.3, 0.8, -0.2, 0.1, 0.7, -0.4], "text": "Renewable energy sources like solar and wind are becoming increasingly cost-competitive with fossil fuels.", "category": "environment", "subcategory": "energy", "tags": ["renewable", "solar", "wind"]}
{"index": {"_index": "embeddings", "_id": "doc_005"}}
{"id": "doc_005", "embedding": [0.8, 0.2, -0.5, 0.1, 0.6, -0.3, 0.4, 0.9], "text": "Financial markets are experiencing volatility due to geopolitical tensions and inflation concerns.", "category": "finance", "subcategory": "markets", "tags": ["volatility", "inflation", "geopolitics"]}
{"index": {"_index": "embeddings", "_id": "doc_006"}}
{"id": "doc_006", "embedding": [-0.1, 0.5, 0.7, -0.4, 0.2, 0.8, -0.6, 0.3], "text": "Quantum computing research is advancing rapidly with potential applications in cryptography and drug discovery.", "category": "technology", "subcategory": "research", "tags": ["quantum", "cryptography", "research"]}
{"index": {"_index": "embeddings", "_id": "doc_007"}}
{"id": "doc_007", "embedding": [0.4, -0.3, 0.6, 0.7, -0.8, 0.2, 0.5, -0.1], "text": "Ocean pollution from plastic waste is threatening marine ecosystems and biodiversity worldwide.", "category": "environment", "subcategory": "marine", "tags": ["pollution", "plastic", "marine"]}
{"index": {"_index": "embeddings", "_id": "doc_008"}}
{"id": "doc_008", "embedding": [0.3, 0.8, -0.2, 0.5, 0.1, -0.7, 0.6, 0.4], "text": "Artificial intelligence is revolutionizing customer service through chatbots and automated support systems.", "category": "technology", "subcategory": "customer_service", "tags": ["ai", "chatbots", "automation"]}
{"index": {"_index": "embeddings", "_id": "doc_009"}}
{"id": "doc_009", "embedding": [-0.5, 0.3, 0.9, -0.1, 0.7, 0.4, -0.2, 0.8], "text": "Global supply chains are being redesigned for resilience after pandemic-related disruptions.", "category": "business", "subcategory": "logistics", "tags": ["supply_chain", "pandemic", "resilience"]}
{"index": {"_index": "embeddings", "_id": "doc_010"}}
{"id": "doc_010", "embedding": [0.7, -0.4, 0.2, 0.9, -0.3, 0.6, 0.1, -0.8], "text": "Space exploration missions are expanding our understanding of the solar system and potential for life.", "category": "science", "subcategory": "space", "tags": ["space", "exploration", "life"]}
{"index": {"_index": "embeddings", "_id": "doc_011"}}
{"id": "doc_011", "embedding": [-0.2, 0.6, 0.4, -0.7, 0.8, 0.3, -0.5, 0.1], "text": "Cryptocurrency adoption is growing among institutional investors despite regulatory uncertainties.", "category": "finance", "subcategory": "crypto", "tags": ["cryptocurrency", "institutional", "regulation"]}
{"index": {"_index": "embeddings", "_id": "doc_012"}}
{"id": "doc_012", "embedding": [0.6, 0.1, -0.8, 0.4, 0.5, -0.2, 0.9, -0.3], "text": "Remote work technologies are transforming traditional office environments and work-life balance.", "category": "technology", "subcategory": "workplace", "tags": ["remote", "work", "balance"]}
{"index": {"_index": "embeddings", "_id": "doc_013"}}
{"id": "doc_013", "embedding": [0.1, -0.7, 0.5, 0.8, -0.4, 0.3, 0.2, 0.6], "text": "Gene therapy breakthroughs are offering new hope for treating previously incurable genetic diseases.", "category": "science", "subcategory": "medicine", "tags": ["gene_therapy", "genetics", "medicine"]}
{"index": {"_index": "embeddings", "_id": "doc_014"}}
{"id": "doc_014", "embedding": [-0.4, 0.2, 0.7, -0.1, 0.9, -0.6, 0.3, 0.5], "text": "Urban planning is evolving to create more sustainable and livable cities for growing populations.", "category": "environment", "subcategory": "urban", "tags": ["urban_planning", "sustainability", "cities"]}
{"index": {"_index": "embeddings", "_id": "doc_015"}}
{"id": "doc_015", "embedding": [0.9, -0.1, 0.3, 0.6, -0.5, 0.8, -0.2, 0.4], "text": "Social media platforms are implementing new policies to combat misinformation and protect user privacy.", "category": "technology", "subcategory": "social_media", "tags": ["social_media", "misinformation", "privacy"]}
{"index": {"_index": "embeddings", "_id": "doc_016"}}
{"id": "doc_016", "embedding": [-0.3, 0.8, -0.1, 0.4, 0.7, -0.5, 0.6, -0.9], "text": "Educational technology is personalizing learning experiences and improving student outcomes.", "category": "education", "subcategory": "technology", "tags": ["education", "personalization", "technology"]}
{"index": {"_index": "embeddings", "_id": "doc_017"}}
{"id": "doc_017", "embedding": [0.5, 0.3, -0.6, 0.2, 0.8, 0.1, -0.4, 0.7], "text": "Biodiversity conservation efforts are critical for maintaining ecosystem balance and preventing species extinction.", "category": "environment", "subcategory": "conservation", "tags": ["biodiversity", "conservation", "extinction"]}
{"index": {"_index": "embeddings", "_id": "doc_018"}}
{"id": "doc_018", "embedding": [0.2, -0.8, 0.4, 0.7, -0.1, 0.5, 0.9, -0.3], "text": "Healthcare systems are adopting telemedicine to improve access and reduce costs for patients.", "category": "technology", "subcategory": "healthcare", "tags": ["telemedicine", "healthcare", "access"]}
{"index": {"_index": "embeddings", "_id": "doc_019"}}
{"id": "doc_019", "embedding": [-0.7, 0.4, 0.8, -0.2, 0.3, 0.6, -0.1, 0.9], "text": "Autonomous vehicles are being tested extensively with promises of safer and more efficient transportation.", "category": "technology", "subcategory": "automotive", "tags": ["autonomous", "safety", "efficiency"]}
{"index": {"_index": "embeddings", "_id": "doc_020"}}
{"id": "doc_020", "embedding": [0.4, 0.7, -0.3, 0.9, -0.6, 0.2, 0.5, -0.1], "text": "Mental health awareness is increasing with new approaches to therapy and workplace wellness programs.", "category": "health", "subcategory": "mental", "tags": ["mental_health", "therapy", "wellness"]}

View File

@@ -0,0 +1,20 @@
{"index": {"_index": "prompts", "_id": "prompt_001"}}
{"id": "prompt_001", "embedding": [0.15, -0.28, 0.65, 0.42, -0.11, 0.33, 0.78, -0.52], "text": "Find articles about machine learning applications", "category": "search", "subcategory": "technology", "tags": ["AI", "research"]}
{"index": {"_index": "prompts", "_id": "prompt_002"}}
{"id": "prompt_002", "embedding": [0.72, 0.18, -0.35, 0.51, 0.09, -0.44, 0.27, 0.63], "text": "Show me product reviews for smartphones", "category": "search", "subcategory": "product", "tags": ["mobile", "reviews"]}
{"index": {"_index": "prompts", "_id": "prompt_003"}}
{"id": "prompt_003", "embedding": [-0.21, 0.59, 0.34, -0.67, 0.45, 0.12, -0.38, 0.76], "text": "What are the latest political developments?", "category": "search", "subcategory": "news", "tags": ["politics", "current events"]}
{"index": {"_index": "prompts", "_id": "prompt_004"}}
{"id": "prompt_004", "embedding": [0.48, -0.15, 0.72, 0.31, -0.58, 0.24, 0.67, -0.39], "text": "Summarize recent tech industry trends", "category": "analysis", "subcategory": "technology", "tags": ["tech", "trends", "summary"]}
{"index": {"_index": "prompts", "_id": "prompt_005"}}
{"id": "prompt_005", "embedding": [-0.33, 0.47, -0.62, 0.28, 0.71, -0.18, 0.54, 0.35], "text": "Compare different smartphone models", "category": "analysis", "subcategory": "product", "tags": ["comparison", "mobile", "evaluation"]}
{"index": {"_index": "prompts", "_id": "prompt_006"}}
{"id": "prompt_006", "embedding": [0.64, 0.21, 0.39, -0.45, 0.13, 0.58, -0.27, 0.74], "text": "Analyze voter sentiment on recent policies", "category": "analysis", "subcategory": "politics", "tags": ["sentiment", "politics", "analysis"]}
{"index": {"_index": "prompts", "_id": "prompt_007"}}
{"id": "prompt_007", "embedding": [0.29, -0.43, 0.56, 0.68, -0.22, 0.37, 0.14, -0.61], "text": "Generate a summary of machine learning research", "category": "generation", "subcategory": "technology", "tags": ["AI", "research", "summary"]}
{"index": {"_index": "prompts", "_id": "prompt_008"}}
{"id": "prompt_008", "embedding": [-0.17, 0.52, -0.48, 0.36, 0.74, -0.29, 0.61, 0.18], "text": "Create a product recommendation report", "category": "generation", "subcategory": "product", "tags": ["recommendation", "report", "analysis"]}
{"index": {"_index": "prompts", "_id": "prompt_009"}}
{"id": "prompt_009", "embedding": [0.55, 0.08, 0.41, -0.37, 0.26, 0.69, -0.14, 0.58], "text": "Write a news brief on election updates", "category": "generation", "subcategory": "news", "tags": ["election", "news", "brief"]}
{"index": {"_index": "prompts", "_id": "prompt_010"}}
{"id": "prompt_010", "embedding": [0.23, -0.59, 0.47, 0.61, -0.35, 0.18, 0.72, -0.26], "text": "Explain how neural networks work", "category": "explanation", "subcategory": "technology", "tags": ["AI", "education", "neural networks"]}