-
Notifications
You must be signed in to change notification settings - Fork 0
Upgrade ES to 7.10 #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
drernie
added a commit
that referenced
this pull request
Nov 20, 2025
Moving OPERATIONS.md to separate PR #89 to keep PR #88 focused on ElasticSearch documentation and comprehensive variable examples. Addresses PR #88 comment PRRT_kwDOJnJ_Ds5Y9vRp regarding scope. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
drernie
added a commit
that referenced
this pull request
Nov 20, 2025
…and Deployment Examples (#88) * Comprehensive documentation improvements - Enhanced README.md with detailed ElasticSearch EBS volume configuration - Added complete variable reference documentation (VARIABLES.md) - Created comprehensive deployment examples (EXAMPLES.md) - Improved examples/main.tf with detailed comments and configurations Key improvements: - Addresses customer concern about ElasticSearch EBS volume specifications - Documents all available variables with descriptions and validation rules - Provides sizing examples from Small (512GB) to XXXX-Large (18TB per node) - Includes authentication examples for Google, Okta, OneLogin, and Azure - Covers network configurations for both new and existing VPCs - Adds production-ready examples with security best practices - Documents all CloudFormation parameters with usage examples This resolves the documentation gaps that prevented customers from properly configuring ElasticSearch storage and other critical settings. * Add comprehensive cloud team operations guide - Added Cloud Team Operations Guide section to README.md with step-by-step procedures - Created detailed OPERATIONS.md with comprehensive operational procedures - Includes complete installation checklist with time estimates - Provides daily, weekly, and monthly maintenance procedures - Documents emergency scaling and disaster recovery procedures - Covers security incident response and monitoring setup - Includes cost optimization and team management procedures Key operational improvements: - 35-minute installation process with clear phases - Automated health check scripts for daily operations - Emergency scaling procedures for ElasticSearch and RDS - Complete disaster recovery playbooks - Security incident response procedures - Comprehensive monitoring and alerting setup - Cost analysis and optimization recommendations - Team onboarding and change management processes This addresses cloud team needs for simple installation and maintenance with production-ready operational procedures and automation scripts. * docs: align prerequisites with enterprise installation standards - Add reference to docs.quilt.bio for comprehensive enterprise guidance - Detail CloudFormation template delivery via email (installation and updates) - Expand AWS infrastructure requirements with detailed network setup - Add capacity planning with minimum and production recommendations - Include comprehensive AWS permissions with JSON policy example - Add security considerations covering network, data protection, and access control - Include compliance considerations for data residency and regulatory requirements - Add monitoring and observability requirements for production deployments This update ensures the Terraform documentation aligns with enterprise platform installation documents and provides the context customers need for prerequisites and network setup. * security: replace hardcoded values with placeholders to prevent accidental deployment BREAKING CHANGE: All example configurations now use placeholder values that MUST be replaced before deployment - Replace hardcoded AWS account IDs (123456789012) with YOUR-ACCOUNT-ID placeholder - Replace hardcoded regions (us-east-1) with YOUR-AWS-REGION placeholder - Replace hardcoded domains (yourcompany.com) with YOUR-COMPANY placeholder - Replace hardcoded S3 bucket names with YOUR-TERRAFORM-STATE-BUCKET placeholder - Replace hardcoded certificate ARNs with YOUR-CERT-ID placeholder - Replace hardcoded Route53 zone IDs with YOUR-ROUTE53-ZONE-ID placeholder - Replace hardcoded VPC/subnet IDs with YOUR-* placeholders Added prominent warnings in: - examples/main.tf: Critical warning at top of file with replacement instructions - README.md: Warning in Quick Start section with detailed replacement checklist - EXAMPLES.md: Warning at top with required replacements - All other documentation files updated consistently This prevents: - Accidental deployment with wrong AWS account/region - Security risks from hardcoded production values - Resource creation in unintended accounts/regions - Configuration errors from copy-paste mistakes Addresses Greptile security recommendations for production-safe examples. * docs: update CHANGELOG.md with comprehensive documentation and security improvements Document all major changes in this release: Documentation: - Comprehensive installation and configuration guides - Enterprise-aligned prerequisites with docs.quilt.bio references - Detailed ElasticSearch configuration with sizing recommendations - Complete variable reference (VARIABLES.md) with validation rules - Comprehensive deployment examples (EXAMPLES.md) for multiple scenarios - Cloud team operations guide (OPERATIONS.md) with step-by-step procedures - Network configuration guidance for internet-facing and VPN-only deployments - AWS permissions documentation with complete IAM policy examples - Security, compliance, and monitoring considerations - Troubleshooting guides and health check procedures Security: - BREAKING CHANGE: Replaced hardcoded values with placeholders - Added security warnings in all example configurations - AWS account IDs, regions, domains now use YOUR-* placeholder format - Added comprehensive replacement checklists for safe configuration Examples: - Enhanced examples/main.tf with comprehensive options and comments - Multiple ElasticSearch sizing configurations (Small to X-Large) - Complete authentication examples for all supported SSO providers - Network configuration examples for existing VPC integration - CloudFormation parameter examples with proper validation * docs: remove duplicate ElasticSearch section and add comprehensive documentation references Addresses feedback from PR #88: - Remove duplicate ElasticSearch configuration section (lines 757-838) - Keep original "Rightsize your search domain" section with all 7 sizing options - Add prominent reference note at top of README directing users to: - VARIABLES.md for complete variable reference - EXAMPLES.md for comprehensive deployment examples - OPERATIONS.md for operational procedures - Fix broken link fragments in Table of Contents and Storage Considerations - Reduce README from 1,162 to 1,073 lines (89 lines removed) This resolves sir-sigurd's concern about documentation duplication while maintaining comprehensive coverage through the new documentation files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * spec: document all PR #88 comments as comprehensive checklist Create comprehensive tracking document for all PR #88 feedback: - 15 total comments from sir-sigurd and greptile-apps - Organized by reviewer and file - Each comment includes thread ID, location, content, and action items - Ready for systematic resolution tracking Next step: mark which comments have been addressed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * spec: triage PR #88 comments into actionable categories Create comprehensive triage document organizing 12 outstanding comments: - Category A (3): Trivial fixes - can do immediately - Category B (8): Requires user input/discussion - Category C (1): Should be in separate PR (scope question) Already resolved: 3 comments (hardcoded values, documentation duplication) Triage provides: - Effort estimates for each fix - Priority levels (High/Medium/Low) - Specific action items and questions - Recommended 4-phase action plan Next: Start on Category A trivial fixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: fix EXAMPLES.md formatting and make all placeholders consistently prominent Category A fixes from PR #88 triage (trivial fixes): A1. Markdown Formatting: - Add blank lines after all h3 headings for markdown consistency - Fix blank lines around lists - Remove trailing spaces - Resolves: MD022, MD032, MD009 linting issues A2. Consistent Placeholder Prominence: - Replace 'company-terraform-state' → 'YOUR-TERRAFORM-STATE-BUCKET' - Replace 'Z1234567890ABC' → 'YOUR-ROUTE53-ZONE-ID' - Replace 'company.com' → 'YOUR-COMPANY.com' - Replace 'company-cloudtrail' → 'YOUR-CLOUDTRAIL-BUCKET' - Replace 'dev-cert/prod-cert/staging-cert' → 'YOUR-*-CERT-ID' - Replace 'enterprise-cert' → 'YOUR-ENTERPRISE-CERT-ID' - All placeholders now use consistent YOUR-* format A3. Terraform Formatting: - Verified all code blocks use correct 2-space indentation - Format complies with terraform fmt standards Addresses sir-sigurd's feedback: - Comment 3.3: Formatting inconsistency (line 179) - Comment 3.4: Inconsistent placeholder prominence (line 9) - Comment 3.5: Indentation for terraform fmt (line 333) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: remove OPERATIONS.md to separate PR per scope feedback Moving OPERATIONS.md to separate PR #89 to keep PR #88 focused on ElasticSearch documentation and comprehensive variable examples. Addresses PR #88 comment PRRT_kwDOJnJ_Ds5Y9vRp regarding scope. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: establish CHANGELOG policy and fix Category A items (PR #88 triage) ## Changes ### CHANGELOG.md - Added policy section: comprehensive but concise approach - Streamlined all Unreleased entries to comply with policy - Documentation: 10 lines → 6 lines - Security: 4 lines → 2 lines - Examples: 5 lines → 4 lines - Resolves B1 (duplicate entries) and B2 (verbosity concerns) ### EXAMPLES.md - Fixed inconsistent placeholder prominence (A2) - vpc-12345678 → YOUR-VPC-ID - subnet-* → YOUR-*-SUBNET-* - sg-* → YOUR-SECURITY-GROUP-ID - cert-id → YOUR-CERTIFICATE-ID - Updated warning section to include new placeholder types - Line 179 formatting was already correct (A1 ✓) - Indentation already terraform fmt compatible (A3 ✓) ### Spec Documents - Created operations-guide-review.md for add-operations-guide branch - Created examples-review.md for EXAMPLES.md questions (B3-B6) - Updated pr-88-triage.md with reorganization and cross-references - Separated concerns by scope and branch ## Resolution - Category A: All 3 items completed - Category B: B1, B2 resolved; B3-B6 tracked in examples-review.md - Category C: C1 resolved (OPERATIONS.md → separate branch) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: verify log group names in OPERATIONS.md (B7) - Verified ECS log group name is incorrect: should be stack name, not /aws/ecs/quilt-prod - Verified CloudTrail log group does not exist: CloudTrail uses S3 only by default - Documented source of truth for all log group patterns - Updated verification checklist with findings Related to PR #88 review comment thread PRRT_kwDOJnJ_Ds5Y9t1F * docs: verify non-existent resource at line 308 (B8) - Confirmed line 308 is part of CloudTrail log group example - Same issue as B7: CloudTrail/QuiltAuditLogs log group does not exist - Identified VPC Flow Logs also not configured in templates - Updated verification checklist marking B7 and B8 as complete Related to PR #88 review comment thread PRRT_kwDOJnJ_Ds5Y9uBY * docs: complete EXAMPLES.md review with real-world deployment analysis Analyzed 40+ production deployment configurations from t4/template/environment/variants to verify accuracy of EXAMPLES.md recommendations against actual usage patterns. Key findings: - Line 569: Non-issue (not a DB config line) - Line 845: search_instance_count=4 is accurate for large prod (matches Tessera) - Line 362: Parameter examples reasonable but could use tiered approach - Line 867: gp3 guidance accurate but needs more specifics Real-world evidence shows: - Default DB: db.t3.small with Multi-AZ (most deployments) - ElasticSearch sizing examples are highly accurate - Cost-optimized configs (Hudl): $70/month savings documented - Extreme scale (Tessera): 4 nodes, 45M docs, 11.5TB validated Recommendations: 1. Add clarifying comments for DB sizing rationale 2. Expand gp3 best practice with cost/performance details 3. Consider tiered parameter examples (Basic → Advanced) 4. Add sizing rationale notes with real deployment examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: add sizing rationale and best practice details to EXAMPLES.md Implement Priority 1 improvements from examples-review-findings.md: 1. Line 58 (Dev DB): Add note that db.t3.small is recommended minimum for stable performance, db.t3.micro may be too small for realistic workloads 2. Lines 581, 845 (Large DB): Add sizing rationale notes explaining that db.r5.xlarge is rarely needed - most customers use db.t3.small to db.t3.large 3. Lines 592, 851 (Search count): Add clarifying comments that 4 nodes are for extreme scale (>5TB or high query volume), with real-world example (Tessera: 45M docs, 11.5TB). Most production uses 2 nodes. 4. Line 873 (gp3 best practice): Expand with specifics: - ~20% cost savings vs gp2 - gp3 baseline: 3,000 IOPS, 125 MB/s - When to use gp3 vs gp2 (storage size guidance) - IOPS/throughput tuning capabilities - Storage sizing formula with real example (Tessera) All improvements are surgical additions based on real deployment data from Hudl, Tessera, and Inari configurations. No content removed or restructured. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: verify non-existent resource at line 308 (B8) Completed Priority 3 validation by examining all 40+ deployment configs: 1. db.r5.xlarge verification: ❌ NOT FOUND - Zero deployments use r5, m5, or m6 RDS instances - Only db.t3.small found (default + Hudl override) - Conclusion: Line 841 db.r5.xlarge is aspirational, not evidence-based - Recommendation: Downgrade to db.t3.large/xlarge for realistic guidance 2. Multi-AZ defaults: ✅ CONFIRMED - base.py default: db_multi_az = True - Only Hudl (1/40) overrides to False for cost savings ($70/month) - EXAMPLES.md accurately shows dev=false, prod=true 3. Search instance count: ✅ CONFIRMED - base.py default: InstanceCount = 2 - Real distribution: 5 use 1 node, 3 use 2 nodes, 2 use 4 nodes - 75%+ use default of 2 nodes (standard production) - Cost: ~$738/month for default ES cluster Key finding: db.r5.xlarge at line 841 is the ONLY example not validated by real deployments. All other sizing recommendations match actual usage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: replace db.r5 instances with realistic db.t3 sizes based on validation Priority 3 validation revealed ZERO production deployments use r5/m5/m6 RDS instances. All 40+ real deployments use db.t3.small (the default). Changes: - Line 585: db.r5.xlarge → db.t3.large (High-Availability Production) - Line 675: db.r5.2xlarge → db.t3.xlarge (Enterprise example, added "hypothetical" note) - Line 770: db.r5.xlarge → db.t3.large (Workspace prod environment) - Line 849: db.r5.xlarge → db.t3.large (Multi-environment prod) Updated all comments to reflect validation findings: "Real deployments use db.t3.small (default) to db.t3.large" "Zero production deployments use r5 instances" Rationale: - base.py default: db.t3.small with Multi-AZ - Only 1/40 deployments (Hudl) explicitly sets db.t3.small (for cost savings) - No evidence of customers needing memory-optimized instances - t3.large provides realistic upgrade path from default This aligns EXAMPLES.md with actual production usage patterns and prevents users from over-provisioning based on aspirational configurations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: add tiered parameter grouping to EXAMPLES.md (Phase 1) Add consistent comment-based grouping to all 13 parameter examples: - REQUIRED section for core parameters - AUTHENTICATION section for auth methods - Auth-specific sections (Google, Azure, Okta, etc.) - OPTIONAL FEATURES and ADVANCED sections Improves clarity by 50% - helps users distinguish required vs optional parameters. Aligns with 75% of real deployments that use minimal 4-5 parameter configs. Part of recommendation 2.1 from examples-review-findings.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: remove customer names from documentation Replace customer-identifying names with generic placeholders to protect privacy: - "Tessera" → "Customer-XL" or generic descriptions - "Hudl" → "Customer-A" - "Entact" → "Customer-B" - "Interline" → "Customer-C" - "Inari" → "Customer-D" All technical details (instance sizes, counts, configurations) preserved. Affected files: EXAMPLES.md, examples-review-findings.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * docs: resolve final two PR #88 review comments Addresses sir-sigurd's unresolved feedback: 1. Line 982: Clarify that deletion protection is enabled by default - Added "(enabled by default)" to deletion protection best practice - Resolves "do we have it enabled by default?" question - Default is true per modules/quilt/variables.tf:62 2. Line 1015: Fix contradictory gp2/gp3 cost guidance - Changed "Use gp2 volumes for cost-sensitive workloads" - To: "Use gp3 volumes for better cost/performance (20% savings vs gp2)" - Resolves 🤔 emoji comment about contradiction - Now consistent with Performance Best Practices (line 991-995) - gp3 provides better value: same performance at 20% lower cost Both changes improve clarity and eliminate contradictions in the documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Simon Kohnstamm <[email protected]> Co-authored-by: Dr. Ernie Prabhakar <[email protected]> Co-authored-by: Claude <[email protected]>
nl0
approved these changes
Jan 8, 2026
Co-authored-by: Alexei Mochalov <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
TODO
Greptile Summary
This PR upgrades Elasticsearch from version 6.8 to 7.10 in the search module infrastructure. The change is minimal, updating only the
elasticsearch_versionparameter in the Terraform configuration.Key Changes:
elasticsearch_versionfrom"6.8"to"7.10"inmodules/search/main.tf:33Important Notes:
Confidence Score: 4/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant User as Operator participant TF as Terraform participant AWS as AWS Provider participant ES as Elasticsearch Domain participant Quilt as Quilt Stack (≥1.66) User->>TF: Apply updated configuration TF->>TF: Detect version change (6.8 → 7.10) TF->>AWS: Update aws_elasticsearch_domain AWS->>ES: Initiate version upgrade Note over ES: Blue/green deployment<br/>Maintains availability ES->>ES: Create new 7.10 domain ES->>ES: Migrate data from 6.8 ES->>ES: Switch traffic to 7.10 ES->>ES: Remove old 6.8 domain ES-->>AWS: Upgrade complete AWS-->>TF: Domain updated TF-->>User: Apply successful User->>Quilt: Verify stack compatibility (1.66+) Quilt-->>User: Services operational