Skip to content

Commit 99cf4bd

Browse files
mckeeaMatteo Mattiuzzi
andauthored
feat(merge): promote develop to test (#5)
* update: editor's manual * Update deploy-docs.yml * Small update to editor's manual to check git workflow * LLM annotator for intros and keywords * corrected to the new url * Add date,version fields to listings * Fix: generate a correct sitemap.xml --------- Co-authored-by: Matteo Mattiuzzi <matteo.mattiuzzi@eea.europa.eu>
1 parent 68b6358 commit 99cf4bd

File tree

6 files changed

+151
-36
lines changed

6 files changed

+151
-36
lines changed

.github/scripts/build-docs.sh

Lines changed: 9 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,15 @@
11
#!/bin/bash
22
set -e
33

4-
# echo "🐍 Setting up Python environment..."
5-
# apt-get update
6-
# apt-get install -y python3 python3-venv python3-pip
7-
8-
# echo "📦 Creating virtual environment..."
9-
# python3 -m venv venv
10-
# source venv/bin/activate
11-
12-
# echo "⬆️ Upgrading pip inside virtual environment..."
13-
# pip install --upgrade pip
14-
15-
# echo "📦 Installing Python dependencies..."
16-
# pip install \
17-
# keybert \
18-
# ruamel.yaml \
19-
# pyyaml \
20-
# transformers==4.37.2 \
21-
# accelerate==0.27.2
22-
23-
# source venv/bin/activate
24-
25-
# echo "🛠 Setting up default Quarto configuration..."
26-
# mv _quarto_not_used.yaml _quarto.yaml
27-
28-
# echo "🏷 Generating keywords..."
29-
# python scripts/render/generate_keywords.py
30-
31-
#echo "🧹 Cleaning up cached _site directory..."
32-
#rm -rf _site
33-
34-
354
echo "🖼 Render all documents into to HTML/DOCX"
365
sudo cp /usr/bin/chromium /usr/bin/chromium-browser
37-
QUARTO_CHROMIUM_HEADLESS_MODE=new quarto render --to html
38-
QUARTO_CHROMIUM_HEADLESS_MODE=new quarto render --to docx --no-clean
6+
QUARTO_CHROMIUM_HEADLESS_MODE=new quarto render --to docx
397
find _site -type f -name 'index.docx' -delete
8+
QUARTO_CHROMIUM_HEADLESS_MODE=new quarto render --to html --no-clean
9+
10+
# Backup the correct sitemap as it may be overwritten by next operations
11+
sleep 5
12+
mv _site/sitemap.xml _site/sitemap.xml.bkp
4013

4114
echo "🛠 Generate index.qmd files for all DOCS/* folders"e
4215
node .github/scripts/generate_index_all.mjs
@@ -63,6 +36,9 @@ echo '<!DOCTYPE html>
6336
</body>
6437
</html>' > _site/index.html
6538

39+
# Revert the correct sitemap
40+
cp _site/sitemap.xml.bkp _site/sitemap.xml
41+
rm -f _site/sitemap.xml.bkp
6642

6743
echo "📄 Converting .docx files to .pdf..."
6844
#chmod +x ./convert_docx_to_pdf.sh

.github/scripts/generate_index_all.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ listing:
1919
type: table
2020
contents: .
2121
sort: title
22-
fields: [title]
22+
fields: [title, date, version]
2323
---
2424
`;
2525

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
from pathlib import Path
2+
import json
3+
import time
4+
import re
5+
import google.generativeai as genai
6+
import tiktoken
7+
import yaml
8+
from io import StringIO
9+
import os
10+
from pathlib import Path
11+
12+
# Configuration
13+
API_KEY = os.getenv("GEMINI_API_KEY")
14+
if not API_KEY:
15+
raise EnvironmentError("GEMINI_API_KEY environment variable not set")
16+
MODEL_NAME = "gemini-2.0-flash"
17+
TOKEN_LIMIT_PER_MINUTE = 950_000 # Keep a safe margin below 1M
18+
19+
SCRIPT_DIR = Path(__file__).resolve().parent
20+
INPUT_DIR = (SCRIPT_DIR / "../../DOCS").resolve()
21+
22+
PROMPT = """You are an AI assistant helping to enrich a Quarto Markdown (.qmd) technical document prepared for the European Environment Agency (EEA).
23+
24+
Your tasks:
25+
1. Read and understand the entire attached document.
26+
2. Generate a professional, engaging **Introduction** (max 1 paragraph) that clearly explains the document’s purpose, scope, and technical focus.
27+
3. Extract exactly 10 **precise and conceptually meaningful keywords or key phrases** that reflect the core scientific or technical content of the document.
28+
29+
Keyword guidance:
30+
- Do **not** use general terms like \"Urban Atlas\", \"metadata\", \"documentation\", \"nomenclature\", or \"report\".
31+
- Focus on **specific concepts, methods, environmental indicators, technical systems, data processing strategies**, or **analytical results** that are central to the document.
32+
- Use **multi-word phrases** when needed for clarity and specificity.
33+
- Think like an expert indexing the document for scientific search or semantic web use.
34+
35+
Return only the result as a raw JSON object (no code block, no explanation):
36+
37+
{
38+
\"introduction\": \"...\",
39+
\"keywords\": [\"keyword1\", \"keyword2\", ..., \"keyword10\"]
40+
}
41+
"""
42+
43+
# Setup Gemini
44+
genai.configure(api_key=API_KEY)
45+
model = genai.GenerativeModel(MODEL_NAME)
46+
encoding = tiktoken.get_encoding("cl100k_base")
47+
total_tokens_sent = 0
48+
49+
50+
# Function to update YAML frontmatter using PyYAML
51+
def update_yaml_header(content: str, description: str, keywords_list: list):
52+
lines = content.splitlines()
53+
if lines[0].strip() != "---":
54+
return content
55+
56+
try:
57+
end_idx = lines[1:].index("---") + 1
58+
except ValueError:
59+
return content
60+
61+
yaml_block = "\n".join(lines[1:end_idx])
62+
yaml_data = yaml.safe_load(yaml_block) or {}
63+
yaml_data["description"] = description.replace("\n", " ").strip()
64+
yaml_data["keywords"] = keywords_list
65+
66+
new_yaml_block = yaml.dump(yaml_data, sort_keys=False, allow_unicode=True).strip()
67+
new_lines = ["---"] + new_yaml_block.splitlines() + ["---"] + lines[end_idx + 1 :]
68+
return "\n".join(new_lines)
69+
70+
71+
# Function to process one document with Gemini
72+
def process_document_with_llm(doc_path: Path):
73+
print("Processing ", doc_path)
74+
global total_tokens_sent
75+
76+
file_contents = doc_path.read_text(encoding="utf-8")
77+
input_tokens = len(encoding.encode(file_contents))
78+
if total_tokens_sent + input_tokens > TOKEN_LIMIT_PER_MINUTE:
79+
print(
80+
f"[SKIPPED] {doc_path} would exceed token budget. Estimated at {input_tokens} tokens."
81+
)
82+
return
83+
84+
response = model.generate_content(
85+
contents=[
86+
{
87+
"role": "user",
88+
"parts": [
89+
{"text": PROMPT},
90+
{
91+
"inline_data": {
92+
"mime_type": "text/plain",
93+
"data": file_contents.encode("utf-8"),
94+
}
95+
},
96+
],
97+
}
98+
]
99+
)
100+
101+
total_tokens_sent += input_tokens
102+
103+
raw_text = response.text.strip()
104+
if raw_text.startswith("```"):
105+
raw_text = re.sub(r"^```(?:json)?\s*", "", raw_text)
106+
raw_text = re.sub(r"\s*```$", "", raw_text)
107+
108+
try:
109+
parsed_output = json.loads(raw_text)
110+
introduction = parsed_output["introduction"]
111+
keywords_list = parsed_output["keywords"]
112+
keywords = ", ".join(keywords_list)
113+
except (json.JSONDecodeError, KeyError) as e:
114+
print(f"[ERROR] Invalid response for {doc_path}:", raw_text)
115+
return
116+
117+
updated_content = update_yaml_header(file_contents, introduction, keywords_list)
118+
output_file = doc_path.with_name(doc_path.stem + ".qmd")
119+
output_file.write_text(updated_content, encoding="utf-8")
120+
121+
print("Estimated input tokens:", input_tokens)
122+
123+
124+
# Process all .qmd files
125+
BLACKLISTED_DIRS = {"templates", "includes", "theme"}
126+
127+
for doc_path in INPUT_DIR.rglob("*.qmd"):
128+
if any(part in BLACKLISTED_DIRS for part in doc_path.parts):
129+
continue
130+
process_document_with_llm(doc_path)
131+
132+
print("Total tokens sent:", total_tokens_sent)

.github/workflows/deploy-docs.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,13 @@ jobs:
2121
with:
2222
fetch-depth: 0
2323

24+
- name: Generate intros and keywords
25+
uses: addnab/docker-run-action@v3
26+
with:
27+
image: mckeea/llm-doc-annotator:latest
28+
options: -e GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY }} -v ${{ github.workspace }}:/app
29+
run: python .github/scripts/generate_intros_and_keywords.py
30+
2431
- name: Build Docs
2532
run: .github/scripts/build-docs.sh
2633

DOCS/guidelines/editor-manual.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: "Guide for Writing Techncial Documentation"
33
subtitle: "Copernicus Land Monitoring Service"
44
author: "European Environment Agency (EEA)"
5-
version: 0.5
5+
version: 0.6
66
description: "A comprehensive guide for creating technical documentation for the Copernicus
77
Land Monitoring Service using Quarto. It covers Markdown basics, document rendering,
88
and the review process, ensuring consistency and clarity in documentation."

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22

33
This repository contains technical documents for the CLMS, such as ATBD's, PUM's, or nomenclature guidelines.
44

5-
The CLMS documents library is deployed [here](https://eea.github.io/CLMS_documents/)
5+
The CLMS documents library is deployed [here](https://eea.github.io/CLMS_documents/main/DOCS/)

0 commit comments

Comments
 (0)