this will load data from Opensearch.
it doesn't have prompts as well
This commit is contained in:
157
example/README_elasticsearch.md
Normal file
157
example/README_elasticsearch.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Elasticsearch/OpenSearch Sample Data
|
||||
|
||||
This directory contains sample data files in Elasticsearch bulk index format for testing the OpenSearch integration in EmbeddingBuddy.
|
||||
|
||||
## Files
|
||||
|
||||
### Original NDJSON Files
|
||||
|
||||
- `sample_data.ndjson` - Original sample documents in EmbeddingBuddy format
|
||||
- `sample_prompts.ndjson` - Original sample prompts in EmbeddingBuddy format
|
||||
|
||||
### Elasticsearch Bulk Files
|
||||
|
||||
- `sample_data_es_bulk.ndjson` - Documents in ES bulk format (index: "embeddings")
|
||||
- `sample_prompts_es_bulk.ndjson` - Prompts in ES bulk format (index: "prompts")
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Index the data using curl
|
||||
|
||||
```bash
|
||||
# Index main documents
|
||||
curl -X POST "localhost:9200/_bulk" \
|
||||
-H "Content-Type: application/x-ndjson" \
|
||||
--data-binary @sample_data_es_bulk.ndjson
|
||||
|
||||
# Index prompts
|
||||
curl -X POST "localhost:9200/_bulk" \
|
||||
-H "Content-Type: application/x-ndjson" \
|
||||
--data-binary @sample_prompts_es_bulk.ndjson
|
||||
```
|
||||
|
||||
### 2. Create proper mappings (recommended)
|
||||
|
||||
First create the index with proper dense_vector mapping:
|
||||
|
||||
```bash
|
||||
# Create embeddings index with dense_vector mapping
|
||||
curl -X PUT "localhost:9200/embeddings" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"settings": {
|
||||
"index.knn": true
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"id": {"type": "keyword"},
|
||||
"embedding": {
|
||||
"type": "knn_vector",
|
||||
"dimension": 8,
|
||||
"method": {
|
||||
"engine": "lucene",
|
||||
"space_type": "cosinesimil",
|
||||
"name": "hnsw",
|
||||
"parameters": {}
|
||||
}
|
||||
},
|
||||
"text": {"type": "text"},
|
||||
"category": {"type": "keyword"},
|
||||
"subcategory": {"type": "keyword"},
|
||||
"tags": {"type": "keyword"}
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
# Create dense vector index with alternative field names
|
||||
curl -X PUT "localhost:9200/prompts" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"settings": {
|
||||
"index.knn": true
|
||||
},
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"id": {"type": "keyword"},
|
||||
"embedding": {
|
||||
"type": "knn_vector",
|
||||
"dimension": 8,
|
||||
"method": {
|
||||
"engine": "lucene",
|
||||
"space_type": "cosinesimil",
|
||||
"name": "hnsw",
|
||||
"parameters": {}
|
||||
}
|
||||
},
|
||||
"text": {"type": "text"},
|
||||
"category": {"type": "keyword"},
|
||||
"subcategory": {"type": "keyword"},
|
||||
"tags": {"type": "keyword"}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Then index the data using the bulk files above.
|
||||
|
||||
### 3. Test in EmbeddingBuddy
|
||||
|
||||
#### For "embeddings" index
|
||||
|
||||
- **OpenSearch URL**: `http://localhost:9200`
|
||||
- **Index Name**: `embeddings`
|
||||
- **Field Mapping**:
|
||||
- Embedding Field: `embedding`
|
||||
- Text Field: `text`
|
||||
- ID Field: `id`
|
||||
- Category Field: `category`
|
||||
- Subcategory Field: `subcategory`
|
||||
- Tags Field: `tags`
|
||||
|
||||
#### For "embeddings-dense" index (alternative field names)
|
||||
|
||||
- **OpenSearch URL**: `http://localhost:9200`
|
||||
- **Index Name**: `embeddings-dense`
|
||||
- **Field Mapping**:
|
||||
- Embedding Field: `vector`
|
||||
- Text Field: `content`
|
||||
- ID Field: `doc_id`
|
||||
- Category Field: `type`
|
||||
- Subcategory Field: `subtopic`
|
||||
- Tags Field: `keywords`
|
||||
|
||||
## Data Structure
|
||||
|
||||
### Original Format (from NDJSON files)
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "doc_001",
|
||||
"embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3],
|
||||
"text": "Machine learning algorithms are transforming healthcare...",
|
||||
"category": "technology",
|
||||
"subcategory": "healthcare",
|
||||
"tags": ["ai", "medicine", "prediction"]
|
||||
}
|
||||
```
|
||||
|
||||
### ES Bulk Format
|
||||
|
||||
```json
|
||||
{"index": {"_index": "embeddings", "_id": "doc_001"}}
|
||||
{"id": "doc_001", "embedding": [...], "text": "...", "category": "...", ...}
|
||||
```
|
||||
|
||||
### Alternative Field Names (dense vector format)
|
||||
|
||||
```json
|
||||
{"index": {"_index": "embeddings-dense", "_id": "doc_001"}}
|
||||
{"doc_id": "doc_001", "vector": [...], "content": "...", "type": "...", ...}
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- All embedding vectors are 8-dimensional for these sample files
|
||||
- The alternative format demonstrates how EmbeddingBuddy's field mapping handles different field names
|
||||
- For production use, you may want larger embedding dimensions (e.g., 384, 768, 1536)
|
||||
- The `dense_vector` field type in Elasticsearch/OpenSearch enables vector similarity search
|
40
example/sample_data_es_bulk.ndjson
Normal file
40
example/sample_data_es_bulk.ndjson
Normal file
@@ -0,0 +1,40 @@
|
||||
{"index": {"_index": "embeddings", "_id": "doc_001"}}
|
||||
{"id": "doc_001", "embedding": [0.2, -0.1, 0.8, 0.3, -0.5, 0.7, 0.1, -0.3], "text": "Machine learning algorithms are transforming healthcare by enabling predictive analytics and personalized medicine.", "category": "technology", "subcategory": "healthcare", "tags": ["ai", "medicine", "prediction"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_002"}}
|
||||
{"id": "doc_002", "embedding": [0.1, 0.4, -0.2, 0.6, 0.3, -0.4, 0.8, 0.2], "text": "Climate change poses significant challenges to global food security and agricultural sustainability.", "category": "environment", "subcategory": "agriculture", "tags": ["climate", "food", "sustainability"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_003"}}
|
||||
{"id": "doc_003", "embedding": [-0.3, 0.7, 0.1, -0.2, 0.9, 0.4, -0.1, 0.5], "text": "The rise of electric vehicles is reshaping the automotive industry and urban transportation systems.", "category": "technology", "subcategory": "automotive", "tags": ["electric", "transport", "urban"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_004"}}
|
||||
{"id": "doc_004", "embedding": [0.5, -0.6, 0.3, 0.8, -0.2, 0.1, 0.7, -0.4], "text": "Renewable energy sources like solar and wind are becoming increasingly cost-competitive with fossil fuels.", "category": "environment", "subcategory": "energy", "tags": ["renewable", "solar", "wind"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_005"}}
|
||||
{"id": "doc_005", "embedding": [0.8, 0.2, -0.5, 0.1, 0.6, -0.3, 0.4, 0.9], "text": "Financial markets are experiencing volatility due to geopolitical tensions and inflation concerns.", "category": "finance", "subcategory": "markets", "tags": ["volatility", "inflation", "geopolitics"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_006"}}
|
||||
{"id": "doc_006", "embedding": [-0.1, 0.5, 0.7, -0.4, 0.2, 0.8, -0.6, 0.3], "text": "Quantum computing research is advancing rapidly with potential applications in cryptography and drug discovery.", "category": "technology", "subcategory": "research", "tags": ["quantum", "cryptography", "research"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_007"}}
|
||||
{"id": "doc_007", "embedding": [0.4, -0.3, 0.6, 0.7, -0.8, 0.2, 0.5, -0.1], "text": "Ocean pollution from plastic waste is threatening marine ecosystems and biodiversity worldwide.", "category": "environment", "subcategory": "marine", "tags": ["pollution", "plastic", "marine"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_008"}}
|
||||
{"id": "doc_008", "embedding": [0.3, 0.8, -0.2, 0.5, 0.1, -0.7, 0.6, 0.4], "text": "Artificial intelligence is revolutionizing customer service through chatbots and automated support systems.", "category": "technology", "subcategory": "customer_service", "tags": ["ai", "chatbots", "automation"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_009"}}
|
||||
{"id": "doc_009", "embedding": [-0.5, 0.3, 0.9, -0.1, 0.7, 0.4, -0.2, 0.8], "text": "Global supply chains are being redesigned for resilience after pandemic-related disruptions.", "category": "business", "subcategory": "logistics", "tags": ["supply_chain", "pandemic", "resilience"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_010"}}
|
||||
{"id": "doc_010", "embedding": [0.7, -0.4, 0.2, 0.9, -0.3, 0.6, 0.1, -0.8], "text": "Space exploration missions are expanding our understanding of the solar system and potential for life.", "category": "science", "subcategory": "space", "tags": ["space", "exploration", "life"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_011"}}
|
||||
{"id": "doc_011", "embedding": [-0.2, 0.6, 0.4, -0.7, 0.8, 0.3, -0.5, 0.1], "text": "Cryptocurrency adoption is growing among institutional investors despite regulatory uncertainties.", "category": "finance", "subcategory": "crypto", "tags": ["cryptocurrency", "institutional", "regulation"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_012"}}
|
||||
{"id": "doc_012", "embedding": [0.6, 0.1, -0.8, 0.4, 0.5, -0.2, 0.9, -0.3], "text": "Remote work technologies are transforming traditional office environments and work-life balance.", "category": "technology", "subcategory": "workplace", "tags": ["remote", "work", "balance"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_013"}}
|
||||
{"id": "doc_013", "embedding": [0.1, -0.7, 0.5, 0.8, -0.4, 0.3, 0.2, 0.6], "text": "Gene therapy breakthroughs are offering new hope for treating previously incurable genetic diseases.", "category": "science", "subcategory": "medicine", "tags": ["gene_therapy", "genetics", "medicine"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_014"}}
|
||||
{"id": "doc_014", "embedding": [-0.4, 0.2, 0.7, -0.1, 0.9, -0.6, 0.3, 0.5], "text": "Urban planning is evolving to create more sustainable and livable cities for growing populations.", "category": "environment", "subcategory": "urban", "tags": ["urban_planning", "sustainability", "cities"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_015"}}
|
||||
{"id": "doc_015", "embedding": [0.9, -0.1, 0.3, 0.6, -0.5, 0.8, -0.2, 0.4], "text": "Social media platforms are implementing new policies to combat misinformation and protect user privacy.", "category": "technology", "subcategory": "social_media", "tags": ["social_media", "misinformation", "privacy"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_016"}}
|
||||
{"id": "doc_016", "embedding": [-0.3, 0.8, -0.1, 0.4, 0.7, -0.5, 0.6, -0.9], "text": "Educational technology is personalizing learning experiences and improving student outcomes.", "category": "education", "subcategory": "technology", "tags": ["education", "personalization", "technology"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_017"}}
|
||||
{"id": "doc_017", "embedding": [0.5, 0.3, -0.6, 0.2, 0.8, 0.1, -0.4, 0.7], "text": "Biodiversity conservation efforts are critical for maintaining ecosystem balance and preventing species extinction.", "category": "environment", "subcategory": "conservation", "tags": ["biodiversity", "conservation", "extinction"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_018"}}
|
||||
{"id": "doc_018", "embedding": [0.2, -0.8, 0.4, 0.7, -0.1, 0.5, 0.9, -0.3], "text": "Healthcare systems are adopting telemedicine to improve access and reduce costs for patients.", "category": "technology", "subcategory": "healthcare", "tags": ["telemedicine", "healthcare", "access"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_019"}}
|
||||
{"id": "doc_019", "embedding": [-0.7, 0.4, 0.8, -0.2, 0.3, 0.6, -0.1, 0.9], "text": "Autonomous vehicles are being tested extensively with promises of safer and more efficient transportation.", "category": "technology", "subcategory": "automotive", "tags": ["autonomous", "safety", "efficiency"]}
|
||||
{"index": {"_index": "embeddings", "_id": "doc_020"}}
|
||||
{"id": "doc_020", "embedding": [0.4, 0.7, -0.3, 0.9, -0.6, 0.2, 0.5, -0.1], "text": "Mental health awareness is increasing with new approaches to therapy and workplace wellness programs.", "category": "health", "subcategory": "mental", "tags": ["mental_health", "therapy", "wellness"]}
|
20
example/sample_prompts_es_bulk.ndjson
Normal file
20
example/sample_prompts_es_bulk.ndjson
Normal file
@@ -0,0 +1,20 @@
|
||||
{"index": {"_index": "prompts", "_id": "prompt_001"}}
|
||||
{"id": "prompt_001", "embedding": [0.15, -0.28, 0.65, 0.42, -0.11, 0.33, 0.78, -0.52], "text": "Find articles about machine learning applications", "category": "search", "subcategory": "technology", "tags": ["AI", "research"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_002"}}
|
||||
{"id": "prompt_002", "embedding": [0.72, 0.18, -0.35, 0.51, 0.09, -0.44, 0.27, 0.63], "text": "Show me product reviews for smartphones", "category": "search", "subcategory": "product", "tags": ["mobile", "reviews"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_003"}}
|
||||
{"id": "prompt_003", "embedding": [-0.21, 0.59, 0.34, -0.67, 0.45, 0.12, -0.38, 0.76], "text": "What are the latest political developments?", "category": "search", "subcategory": "news", "tags": ["politics", "current events"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_004"}}
|
||||
{"id": "prompt_004", "embedding": [0.48, -0.15, 0.72, 0.31, -0.58, 0.24, 0.67, -0.39], "text": "Summarize recent tech industry trends", "category": "analysis", "subcategory": "technology", "tags": ["tech", "trends", "summary"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_005"}}
|
||||
{"id": "prompt_005", "embedding": [-0.33, 0.47, -0.62, 0.28, 0.71, -0.18, 0.54, 0.35], "text": "Compare different smartphone models", "category": "analysis", "subcategory": "product", "tags": ["comparison", "mobile", "evaluation"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_006"}}
|
||||
{"id": "prompt_006", "embedding": [0.64, 0.21, 0.39, -0.45, 0.13, 0.58, -0.27, 0.74], "text": "Analyze voter sentiment on recent policies", "category": "analysis", "subcategory": "politics", "tags": ["sentiment", "politics", "analysis"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_007"}}
|
||||
{"id": "prompt_007", "embedding": [0.29, -0.43, 0.56, 0.68, -0.22, 0.37, 0.14, -0.61], "text": "Generate a summary of machine learning research", "category": "generation", "subcategory": "technology", "tags": ["AI", "research", "summary"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_008"}}
|
||||
{"id": "prompt_008", "embedding": [-0.17, 0.52, -0.48, 0.36, 0.74, -0.29, 0.61, 0.18], "text": "Create a product recommendation report", "category": "generation", "subcategory": "product", "tags": ["recommendation", "report", "analysis"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_009"}}
|
||||
{"id": "prompt_009", "embedding": [0.55, 0.08, 0.41, -0.37, 0.26, 0.69, -0.14, 0.58], "text": "Write a news brief on election updates", "category": "generation", "subcategory": "news", "tags": ["election", "news", "brief"]}
|
||||
{"index": {"_index": "prompts", "_id": "prompt_010"}}
|
||||
{"id": "prompt_010", "embedding": [0.23, -0.59, 0.47, 0.61, -0.35, 0.18, 0.72, -0.26], "text": "Explain how neural networks work", "category": "explanation", "subcategory": "technology", "tags": ["AI", "education", "neural networks"]}
|
Reference in New Issue
Block a user