Complete example showing how to upsert Skill Seekers documents to Pinecone and perform semantic search.
- Creates a Pinecone serverless index
- Loads Skill Seekers-generated documents (LangChain format)
- Generates embeddings with OpenAI
- Upserts documents to Pinecone with metadata
- Demonstrates semantic search capabilities
- Provides interactive search mode
# Install dependencies
pip install pinecone-client openai
# Set API keys
export PINECONE_API_KEY=your-pinecone-api-key
export OPENAI_API_KEY=sk-...First, generate LangChain-format documents using Skill Seekers:
# Option 1: Use preset config (e.g., Django)
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain
# Option 2: From GitHub repo
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain
# Output: output/django-langchain.jsoncd examples/pinecone-upsert
# Run the quickstart script
python quickstart.py- Index creation (if it doesn't exist)
- Documents loaded with category breakdown
- Batch upsert with progress tracking
- Example queries demonstrating semantic search
- Interactive search mode for your own queries
============================================================
PINECONE UPSERT QUICKSTART
============================================================
Step 1: Creating Pinecone index...
✅ Index created: skill-seekers-demo
Step 2: Loading documents...
✅ Loaded 180 documents
Categories: {'api': 38, 'guides': 45, 'models': 42, 'overview': 1, ...}
Step 3: Upserting to Pinecone...
Upserting 180 documents...
Batch size: 100
Upserted 100/180 documents...
Upserted 180/180 documents...
✅ Upserted all documents to Pinecone
Total vectors in index: 180
Step 4: Running example queries...
============================================================
QUERY: How do I create a Django model?
------------------------------------------------------------
Score: 0.892
Category: models
Text: Django models are Python classes that define the structure of your database tables...
Score: 0.854
Category: api
Text: To create a model, inherit from django.db.models.Model and define fields...
============================================================
INTERACTIVE SEMANTIC SEARCH
============================================================
Search the documentation (type 'quit' to exit)
Query: What are Django views?
- Serverless Index - Auto-scaling Pinecone infrastructure
- Batch Upsertion - Efficient bulk loading (100 docs/batch)
- Metadata Filtering - Category-based search filters
- Semantic Search - Vector similarity matching
- Interactive Mode - Real-time query interface
quickstart.py- Complete working exampleREADME.md- This filerequirements.txt- Python dependencies
For 1000 documents:
- Embeddings: ~$0.01 (OpenAI ada-002)
- Storage: ~$0.03/month (Pinecone serverless)
- Queries: ~$0.025 per 100k queries
Total first month: ~$0.04 + query costs
INDEX_NAME = "my-custom-index" # Line 215batch_upsert(index, openai_client, documents, batch_size=50) # Line 239matches = semantic_search(
index=index,
openai_client=openai_client,
query="your query",
category="models" # Only search in "models" category
)# In create_embeddings() function
response = openai_client.embeddings.create(
model="text-embedding-3-small", # Cheaper, smaller dimension
input=texts
)
# Update index dimension to 1536 (for text-embedding-3-small)
create_index(pc, INDEX_NAME, dimension=1536)"Index already exists"
- Normal message if you've run the script before
- The script will reuse the existing index
"PINECONE_API_KEY not set"
- Get API key from: https://app.pinecone.io/
- Set environment variable:
export PINECONE_API_KEY=your-key
"OPENAI_API_KEY not set"
- Get API key from: https://platform.openai.com/api-keys
- Set environment variable:
export OPENAI_API_KEY=sk-...
"Documents not found"
- Make sure you've generated documents first (see "Generate Documents" above)
- Check the
DOCS_PATHinquickstart.pymatches your output location
"Rate limit exceeded"
- OpenAI or Pinecone rate limit hit
- Reduce batch_size:
batch_size=50orbatch_size=25 - Add delays between batches
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("skill-seekers-demo")
# Query immediately (no need to re-upsert)
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)# Upsert with same ID to update
index.upsert(vectors=[{
"id": "doc_123",
"values": new_embedding,
"metadata": updated_metadata
}])# Delete by ID
index.delete(ids=["doc_123", "doc_456"])
# Delete by metadata filter
index.delete(filter={"category": {"$eq": "deprecated"}})
# Delete all (namespace)
index.delete(delete_all=True)# Upsert to namespace
index.upsert(vectors=vectors, namespace="production")
# Query specific namespace
results = index.query(
vector=query_embedding,
namespace="production",
top_k=5
)Need help? GitHub Discussions