Implement living knowledge hub for Delta Lake and Apache Iceberg #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Creates a self-sustaining, community-driven knowledge platform for data lakehouse technologies with automated content validation, freshness detection, gamified contributions, and AI-powered resource curation.
Core Infrastructure
Governance
Content Architecture
Automation Workflows
CI/CD Pipelines (6 workflows)
ci-code-recipes.yml: Matrix-based validation, executesvalidate.shper recipeci-docs.yml: Markdownlint, lychee link checker, Mermaid syntax validationstale-content-bot.yml: Weekly scan via git log, creates issues for docs >12mo oldgamification-engine.yml: Event-driven contribution tracking, updatescontributors.jsonupdate-leaderboard.yml: Daily cron, injects top 10 into README between markersawesome-list-aggregator.yml: RSS/web scraping, LLM-ready summarization, auto-PRAutomation Scripts (4 Python modules)
Documentation
Technical Content
docs/BLUEPRINT.md: Complete implementation specification (15k+ words)docs/architecture/system-overview.md: Mermaid workflow diagramsdocs/tutorials/getting-started.md: Comparative quickstart for both technologiesdocs/best-practices/production-readiness.md: Operational playbookCode Recipes (standardized structure)
Data Model
Community Tracking (
community/contributors.json)[{ "username": "contributor", "points": 150, "contributions": {"prs_merged": 5, "reviews": 10}, "recent_activity": [...] }]Resource Aggregation (
community/processed_urls.json)scripts/config/trusted_sources.jsonImplementation Notes
<!-- LEADERBOARD_START -->markers in READMERepository scales to thousands of contributors via JSON-based tracking. LLM integration optional (falls back to simple text extraction).
Original prompt
Of course. Here is a detailed and comprehensive prompt designed to instruct a highly skilled AI software architect and developer to create the full design and implementation plan for the dynamic Delta Lake and Iceberg knowledge base.
Prompt: Blueprint for a Living, Community-Driven Data Engineering Knowledge Hub
Your Role: You are a world-class AI software architect and principal engineer. Your expertise spans building scalable open-source projects, fostering vibrant developer communities, and implementing robust CI/CD and automation pipelines.
Your Mission: Your task is to generate a complete and detailed technical blueprint for creating, maintaining, and expanding a definitive GitHub repository: delta-iceberg-knowledge-hub. This is not just a collection of files; it is a living, interactive, and self-sustaining ecosystem for data engineering best practices. Your output should be a comprehensive design document that includes directory structures, core file contents, detailed descriptions of automation workflows, function/method-level design for custom scripts, and a clear strategy for community engagement.
Part 1: Foundational Architecture and Core Content
Describe the foundational file and directory structure. For each core component, specify its purpose and provide a template or detailed content outline.
1.1. Root Directory Structure:
Generate the complete directory tree. For each directory, provide a one-sentence description of its purpose.
1.2. Core Governance and Onboarding Files:
For each file below, detail the essential sections it must contain:
README.md:
Vision Statement: A powerful, one-paragraph mission statement.
Quick Links: A navigable table of contents.
"Living Whitepaper" Philosophy: An explanation of the repository's dynamic nature.
Contribution Spotlight: A section that will be programmatically updated to feature top contributors.
Tech Stack: Icons and links for key technologies used (Delta Lake, Iceberg, Python, GitHub Actions, etc.).
CONTRIBUTING.md:
Contribution Workflow: Step-by-step guide from forking the repo to submitting a PR.
"Types of Contributions": Define how to contribute code recipes, documentation, bug fixes, and review others' work.
Style Guides: Links to style guides for Markdown (markdownlint), Python (black), and diagrams (mermaid).
Developer Certificate of Origin (DCO): Instructions for signing off on commits.
CODE_OF_CONDUCT.md: Implement the Contributor Covenant Code of Conduct.
LICENSE: Specify the Apache 2.0 License.
1.3. Content Architecture:
docs/:
Comparison Matrix (comparisons/feature-matrix.md): Design a detailed Markdown table comparing Delta and Iceberg across critical features (e.g., Time Travel, Schema Evolution, Partitioning, Compaction, Z-Ordering, Concurrency Control). Pre-populate with key features, leaving some cells as "Community contribution needed."
code-recipes/:
Recipe Structure: Define a mandatory template for every new recipe. This template must include:
problem.md: A clear description of the problem being solved.
solution.py / solution.sql: The fully commented code.
environment.yml / requirements.txt: A dependency file.
validate.sh: A simple script to run the recipe and validate its output.
Diagrams as Code: Mandate the use of Mermaid.js for all architectural diagrams. Provide a sample .md file showing how to embed a Mermaid diagram for version control and easy rendering on GitHub.
Part 2: Automation, CI/CD, and "Living" Maintenance
Design the suite of GitHub Actions workflows required to automate validation, maintenance, and content enrichment. For each workflow, provide the filename (e.g., ci.yml) and a detailed, step-by-step description of the jobs and logic.
2.1. ci-code-recipes.yml (Code Validation Workflow):
Trigger: On pull requests targeting the code-recipes/ directory.
Jobs:
lint-code: Lints Python code with black and flake8.
validate-recipe-execution: A matrix-based job that iterates through each changed recipe directory. For each recipe, it will:
Set up the specified environment (Python/Java).
Install dependencies from the requirement file.
Execute the validate.sh script.
Fail the PR if the validation script returns a non-zero exit code.
2.2. ci-docs.yml (Documentation and Link Validation):
Trigger: On pull requests targeting .md files.
Jobs:
lint-markdown: Runs markdownlint to enforce style consistency.
check-broken-links: Uses a tool like lychee-action to check all internal and external links in the changed Markdown files to prevent link rot.
2.3. stale-content-bot.yml (Content Freshness Workflow):
Trigger: Scheduled to run weekly (schedule: cron).
Logic Design:
This workflow will run a custom Python script (scripts/find_stale_docs.py).
find_stale_docs.py: Design the functions for this script.
get_file_last_modified(filepath): Uses Git history to find the last commit date for a file.
main():
Iterates through all .md files in the docs/ and tutorials/ directories.
If a file's last modification date is older than 12 mon...
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.