SOLR-17979 Improve changes2html.py for authors and PR detection (#3831)

janhoy · web-flow · commit 19ec326349d1 · 2025-11-04T11:04:17.000+01:00
- Authors with url and github nick handled - Plain PR ref `#123` detected as PR#123 with github link - Correct a changelog yaml missing JIRA issue - Fix links in dev-docs/changelog.adoc - Describe logchangeArchive task. - Remove mention of Perl as a requirement for build
diff --git a/changelog/v9.10.0/SOLR-17619 Use logchange for changelog management.yml b/changelog/v9.10.0/SOLR-17619 Use logchange for changelog management.yml
@@ -5,3 +5,6 @@ authors:
   - name: Jan Høydahl
     nick: janhoy
     url: https://home.apache.org/phonebook.html?uid=janhoy
+links:
+  - name: SOLR-17619
+    url: https://issues.apache.org/jira/browse/SOLR-17619
diff --git a/dev-docs/changelog.adoc b/dev-docs/changelog.adoc
@@ -37,7 +37,7 @@ solr/
 
 == 3. The YAML format
 
-Below is an example of a changelog yaml fragment. The full yaml format is xref:https://logchange.dev/tools/logchange/reference/#tasks[documented here], but we normally only need `title`, `type`, `authors` and `links`. For a change without a JIRA, you can add the PR number in `issues`:
+Below is an example of a changelog yaml fragment. The full yaml format is https://logchange.dev/tools/logchange/reference/#yaml-entry-format[documented here], but we normally only need `title`, `type`, `authors` and `links`. For a change without a JIRA, you can add the PR number in `issues`:
 
 [source, yaml]
 ----
@@ -120,8 +120,13 @@ The logchange gradle plugin offers some tasks, here are the two most important:
 
 | `logchangeRelease`
 | Creates a new changelog release by moving files from `changelog/unreleased/` directory to `changelog/vX.Y.Z` directory
+
+| `logchangeArchive`
+| Archives the list of released versions up to (and including) the specified version by transferring their summaries to the `archive.md` file, merging all existing archives, and deleting the corresponding version directories.
 |===
 
+The `logchangeRelease` and `logchangeGenerate` tasks are used by ReleaseWizard. The `logchangeArchive` task can be ran once for every major release or when the number of versioned changelog folders grow too large.
+
 These are integrated in the Release Wizard.
 
 === 6.2 Migration tool
@@ -242,5 +247,5 @@ Example report output (Json or Markdown):
 
 == 7. Further Reading
 
-* xref:https://github.com/logchange/logchange[Logchange web page]
-* xref:https://keepachangelog.com/en/1.1.0/[keepachangelog.com website]
+* https://github.com/logchange/logchange[Logchange web page]
+* https://keepachangelog.com/en/1.1.0/[keepachangelog.com website]
diff --git a/dev-docs/how-to-contribute.adoc b/dev-docs/how-to-contribute.adoc
@@ -33,7 +33,7 @@ In order to make a new contribution to Solr you will use the fork you have creat
 1. Create a new Jira issue in the Solr project: https://issues.apache.org/jira/projects/SOLR/issues
 2. Create a new branch in your Solr fork to provide a PR for your contribution on the newly created issue. Make any necessary changes for the given bug/feature in that branch. You can use additional information in these dev-docs to build and test your code as well as ensure it passes code quality checks.
 3. Once you are satisfied with your changes, get your branch ready for a PR by running `./gradlew tidy updateLicenses check -x test`. This will format your source code, update licenses of any dependency version changes and run all pre-commit tests. Commit the changes.
-* Note: the `check` command requires `perl` and `python3` to be present on your `PATH` to validate documentation.
+* Note: the `check` command requires `python3` to be present on your `PATH` to validate documentation.
 4. Open a PR of your branch against the `main` branch of the apache/solr repository. When you open a PR on your fork, this should be the default option.
 * The title of your PR should include the Solr Jira issue that you opened, i.e. `SOLR-12345: New feature`.
 * The PR description will automatically populate with a pre-set template that you will need to fill out.
diff --git a/dev-docs/solr-source-code.adoc b/dev-docs/solr-source-code.adoc
@@ -34,7 +34,7 @@ To build the documentation, type `./gradlew -p solr documentation`.
 
 `./gradlew check` will assemble Solr and run all validation tasks unit tests.
 
-NOTE: the `check` command requires `perl` and `python3` to be present on your `PATH` to validate documentation.
+NOTE: the `check` command requires `python3` to be present on your `PATH` to validate documentation.
 
 To build the final Solr artifacts run `./gradlew assemble`.
 
diff --git a/gradle/documentation/changes-to-html/changes2html.py b/gradle/documentation/changes-to-html/changes2html.py
@@ -138,40 +138,232 @@ def __init__(self, title="Solr Changelog"):
              self.GITHUB_ISSUE_PREFIX, 'GITHUB#{0}')
         ]
 
-    def extract_issue_from_text(self, text):
+    def _format_issue_link(self, url_prefix, issue_id, label):
+        """Format a single issue reference as an HTML anchor tag"""
+        return f'<a href="{url_prefix}{issue_id}">{label}</a>'
+
+    def _extract_markdown_issue(self, text):
         """
-        Extract the first JIRA/GitHub issue from markdown text.
-        Returns (issue_link_html, text_without_issue)
+        Extract markdown-formatted JIRA/GitHub issues like [SOLR-123](url) or [PR#123](url).
+        Returns (issue_link_html, text_without_issue) or (None, text) if not found.
         """
         for pattern, url_prefix, label_fmt in self.issue_patterns:
             match = re.search(pattern, text)
             if match:
                 issue_id = match.group(1)
                 label = label_fmt.format(issue_id)
-                issue_html = f'<a href="{url_prefix}{issue_id}">{label}</a>'
+                issue_html = self._format_issue_link(url_prefix, issue_id, label)
                 text_without = (text[:match.start()] + text[match.end():]).strip()
                 return issue_html, text_without
+
         return None, text
 
+    def _extract_plain_pr_references(self, text):
+        """
+        Extract plain GitHub PR references like #123 or #123 #456.
+        Only matches PRs that appear before the author list (before opening paren or at end).
+        Returns (issue_link_html, text_without_issue) or (None, text) if not found.
+        """
+        # Pattern: #\d+ optionally followed by more #\d+ before opening paren or end of string
+        pattern = r'#(\d+)(?:\s+#(\d+))*\s*(?=\(|$)'
+        match = re.search(pattern, text)
+
+        if not match:
+            return None, text
+
+        # Extract all PR numbers from the matched text
+        pr_numbers = re.findall(r'#(\d+)', match.group(0))
+        if not pr_numbers:
+            return None, text
+
+        # Format each PR as an HTML link and join with commas
+        pr_links = [self._format_issue_link(self.GITHUB_PR_PREFIX, pr_num, f'PR#{pr_num}')
+                    for pr_num in pr_numbers]
+        issue_html = ', '.join(pr_links)
+
+        # Remove the PR references from the text
+        text_without = (text[:match.start()] + text[match.end():]).strip()
+        return issue_html, text_without
+
+    def extract_issue_from_text(self, text):
+        """
+        Extract the first issue reference from text.
+        Tries in order: markdown JIRA/GitHub issues, plain GitHub PR references.
+        Returns (issue_link_html, text_without_issue) or (None, text) if not found.
+        """
+        # Try markdown-formatted issues first
+        issue_html, text_without = self._extract_markdown_issue(text)
+        if issue_html:
+            return issue_html, text_without
+
+        # Fall back to plain GitHub PR references
+        return self._extract_plain_pr_references(text)
+
+    def _format_single_author(self, author_text):
+        """
+        Format a single author entry to HTML.
+        Supports:
+        - Plain name: "Jan Høydahl" -> "Jan Høydahl"
+        - Markdown link: "[Jan Høydahl](url)" -> "<a href=\"url\">Jan Høydahl</a>"
+        - Name with GitHub: "Jan Høydahl @janhoy" -> "<a href=\"https://github.com/janhoy\">Jan Høydahl</a>"
+        - Link with GitHub: "[Jan Høydahl](url) @janhoy" -> "<a href=\"url\">Jan Høydahl</a> <a href=\"https://github.com/janhoy\">@janhoy</a>"
+        """
+        author_text = author_text.strip()
+
+        # Extract markdown link: [text](url)
+        markdown_link_match = re.search(r'\[([^\]]+)\]\(([^)]+)\)', author_text)
+        # Extract GitHub handle: @username
+        github_match = re.search(r'@(\w+)', author_text)
+
+        if markdown_link_match:
+            # Has markdown link
+            link_text = markdown_link_match.group(1)
+            link_url = markdown_link_match.group(2)
+            html = f'<a href="{link_url}">{self.escape_html(link_text)}</a>'
+
+            if github_match:
+                # Has both markdown link and GitHub handle
+                github_handle = github_match.group(1)
+                html += f' <a href="https://github.com/{github_handle}">@{github_handle}</a>'
+
+            return html
+        elif github_match:
+            # Has GitHub handle but no markdown link - extract name and link it to GitHub
+            github_handle = github_match.group(1)
+            # Remove the @handle part to get just the name
+            name = author_text.replace(f'@{github_handle}', '').strip()
+            return f'<a href="https://github.com/{github_handle}">{self.escape_html(name)}</a>'
+        else:
+            # Plain name with no links
+            return self.escape_html(author_text)
+
+    def _extract_one_author_group(self, text, start_pos):
+        """
+        Extract one author group starting from start_pos (pointing to an opening paren).
+        Returns (author_content, end_pos) or (None, start_pos) if no valid group.
+        Handles markdown links [text](url) inside the group.
+        """
+        if start_pos >= len(text) or text[start_pos] != '(':
+            return None, start_pos
+
+        paren_depth = 0
+        bracket_depth = 0
+        content = []
+
+        for i in range(start_pos, len(text)):
+            char = text[i]
+
+            # Track brackets to know if we're inside [text]
+            if char == '[' and bracket_depth >= 0:
+                bracket_depth += 1
+            elif char == ']' and bracket_depth > 0:
+                bracket_depth -= 1
+            # Only track paren depth outside brackets
+            elif bracket_depth == 0:
+                if char == '(':
+                    paren_depth += 1
+                elif char == ')':
+                    paren_depth -= 1
+                    if paren_depth == 0:
+                        # Found matching closing paren
+                        return ''.join(content[1:]).strip(), i  # Skip opening paren
+
+            content.append(char)
+
+        return None, start_pos
+
     def extract_authors(self, text):
-        """Extract authors from trailing parentheses"""
-        # Match (author1) (author2) ... at the end
-        match = re.search(r'\s*(\([^)]+(?:\)\s*\([^)]+)*\))\s*$', text)
-        if match:
-            authors_text = match.group(1)
-            text_without_authors = text[:match.start()].strip()
-
-            # Parse individual authors
-            authors = re.findall(r'\(([^)]+)\)', authors_text)
-            authors_list = []
-            for author_group in authors:
-                # Split by comma or "and"
-                for author in re.split(r',\s*|\s+and\s+', author_group):
+        """Extract authors from trailing parentheses, handling markdown links [text](url)"""
+        authors_list = []
+
+        # Find all author groups at the end of the text
+        # Work backwards from the end to find opening parentheses
+        i = len(text) - 1
+
+        # Skip trailing whitespace
+        while i >= 0 and text[i] in ' \t\n\r':
+            i -= 1
+
+        if i < 0 or text[i] != ')':
+            return None, text
+
+        # Find all complete author groups by working backwards
+        author_positions = []  # List of (start, end) positions
+
+        while i >= 0:
+            if text[i] == ')':
+                # Find the matching opening paren for this closing paren
+                paren_depth = 1
+                bracket_depth = 0
+                j = i - 1
+
+                while j >= 0 and paren_depth > 0:
+                    char = text[j]
+
+                    # Track brackets
+                    if char == ']':
+                        bracket_depth += 1
+                    elif char == '[':
+                        bracket_depth -= 1
+                    # Track parens outside brackets
+                    elif bracket_depth == 0:
+                        if char == ')':
+                            paren_depth += 1
+                        elif char == '(':
+                            paren_depth -= 1
+
+                    j -= 1
+
+                if paren_depth == 0:
+                    # Found matching opening paren at j+1
+                    start_pos = j + 1
+
+                    # Check if this is part of a markdown link [text](url)
+                    # Markdown links have ] immediately before the (
+                    if start_pos > 0 and text[start_pos - 1] == ']':
+                        # This is a markdown link URL, not an author group
+                        # Continue searching backwards
+                        i = j
+                    else:
+                        # This is an author group
+                        author_positions.insert(0, (start_pos, i))
+
+                        # Move past this group
+                        i = j
+
+                        # Skip whitespace before next potential group
+                        while i >= 0 and text[i] in ' \t\n\r':
+                            i -= 1
+
+                        # Check if there's another author group right before
+                        if i >= 0 and text[i] != ')':
+                            # No more author groups
+                            break
+                else:
+                    break
+            else:
+                break
+
+        # Now process the found author groups
+        if author_positions:
+            # Extract text before first author group
+            first_start = author_positions[0][0]
+            text_without_authors = text[:first_start].strip()
+
+            # Extract and format each author group
+            for start_pos, end_pos in author_positions:
+                author_content = text[start_pos + 1:end_pos]
+
+                # Split by comma or "and" for multiple authors in one group
+                for author in re.split(r',\s*|\s+and\s+', author_content):
                     author = author.strip()
                     if author:
-                        authors_list.append(author)
+                        formatted_author = self._format_single_author(author)
+                        authors_list.append(formatted_author)
+
+            if authors_list:
+                return authors_list, text_without_authors
 
-            return authors_list, text_without_authors
         return None, text
 
     def format_changelog_item(self, item_text):
@@ -183,17 +375,27 @@ def format_changelog_item(self, item_text):
         # Extract the issue
         issue_html, text_after_issue = self.extract_issue_from_text(item_text)
 
-        if not issue_html:
-            return self.linkify_remaining_text(item_text)
+        # Always try to extract authors, whether or not we found an issue
+        authors_list, description = self.extract_authors(text_after_issue if issue_html else item_text)
 
-        # Extract authors and clean description
-        authors_list, description = self.extract_authors(text_after_issue)
-        description = re.sub(r'^[:\s]+', '', description).strip()
-
-        # Build HTML
-        html = f'{issue_html}: {self.escape_html(description)}'
+        if issue_html:
+            # We have an issue link
+            description = re.sub(r'^[:\s]+', '', description).strip()
+            html = f'{issue_html}: {self.escape_html(description)}'
+        else:
+            # No issue link found
+            if authors_list:
+                # We have authors but no issue - just use the description part
+                html = self.escape_html(description)
+            else:
+                # No issue and no authors - linkify the full text
+                return self.linkify_remaining_text(item_text)
+
+        # Add authors if we have them
         if authors_list:
-            html += f'<br /><span class="attrib">({self.escape_html(", ".join(authors_list))})</span>'
+            # Authors are already formatted as HTML, don't escape
+            html += f'<br /><span class="attrib">({", ".join(authors_list)})</span>'
+
         return html
 
     def linkify_remaining_text(self, text):