Skip to content

Conversation

@pedrogaudencio
Copy link
Collaborator

@pedrogaudencio pedrogaudencio commented Jan 31, 2026

  • remove front matter
  • normalise list markers from hyphens to asterisks
  • fix link resolution into subjects
  • explicitly set first commit timestamp (upon article fetch)

Closes /issues/69


Open with Devin

* add normalizeListMarkers function with pre-compiled regex
* process markdown after htmlToMarkdown, before normalizeImageURLs
* preserve indentation for nested lists
* only affect line-start list markers, not mid-sentence hyphens
* add commitDateOptions struct for author/committer dates
* include dates field in createFileRequest with RFC3339 format
* ensures commit timestamp reflects when the tool runs
@pedrogaudencio pedrogaudencio self-assigned this Jan 31, 2026
@pedrogaudencio pedrogaudencio added the enhancement New feature or request label Jan 31, 2026
@pedrogaudencio pedrogaudencio marked this pull request as ready for review January 31, 2026 12:54
Copilot AI review requested due to automatic review settings January 31, 2026 12:54
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional flags.

Open in Devin Review

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the wiki2md/article-creator tooling so that Wikipedia-derived Markdown is normalized for Forkana (list markers, internal links) and initial repository commits carry explicit timestamps, while dropping YAML front matter from generated articles.

Changes:

  • Add normalizeListMarkers to convert hyphen-based unordered list items to asterisk markers while preserving indentation and leaving non-list hyphens untouched.
  • Add normalizeInternalLinks, stripLinkTitle, and extractWikiArticleName to rewrite various Wikipedia-style internal links into /:root/subject/... URLs, with comprehensive unit tests for list and link normalization.
  • Update article-creator to send commit date metadata via a dates field when creating README files, adjust the initial commit message, and update copyrights.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
custom/services/wiki2md/main.go Integrates list marker normalization and internal Wikipedia link rewriting into the article processing pipeline, and comments out front matter injection while adding the supporting regex/utility functions.
custom/services/wiki2md/main_test.go Adds unit tests for normalizeListMarkers, normalizeInternalLinks, stripLinkTitle, and extractWikiArticleName to validate the new normalization behavior and edge cases.
custom/services/article-creator/main.go Refactors request payloads to include commit date options for the initial README commit, updates the commit message, and slightly tidies config struct field alignment.
custom/services/article-creator/main_test.go Updates the file header copyright and removes a trailing blank line; no functional behavior changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@eighttrigrams
Copy link
Member

@pedrogaudencio seems counterintuitive to me to choose "*" over "-" list markers, i.e. to normalise the latter into the former. Was this just a flip of a coin, i.e. just making a choice (any), and have it normalised, or was there any reason for favouring asterisks? I looked at it from a typing perspective (I use dashes normally, seems easier, and asterisks are used for marking cursive and bold).

contentB64 := base64.StdEncoding.EncodeToString([]byte(content))

// Set commit timestamp to current time in RFC3339 format
now := time.Now().Format(time.RFC3339)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what we want this time to be. if i understand correctly, first we run wiki2md in batches,
and then later on article-creator from the fetched files. so if those run in relative close succession,
the now time is probably good. but it could also be, that the files sit for a while in the "staging" phase, and then the time now does not reflect when we fetched the article from wikipedia. i see that we have fetched_at data available, as an option.

i'd lean to use fetched_at. what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. This was mostly to address the first commit timestamp since this is what we generally display as "when the article was created":

Screenshot 2026-02-01 at 11 43 08

Instead of:

Screenshot 2026-02-01 at 11 53 29

(26 years ago means 0001-01-01T00:00:00Z which is the the zero value of time.Time in Go)

fetched_at was being used for the front matter (also removed in this PR) in wiki2md:

Screenshot 2026-02-01 at 11 50 37

We could potentially keep it, but at this point it would be only for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedrogaudencio I think I got that. We don't want to show when the article got created on Wikipedia, but rather when it got imported into our system. My point here was more about that our import is broken down in two phases, which don't need to happen in rapid succession, so the shown import date (the commit date) might be slightly off when the second phase happens later, so it might say the import is from today, but the article got fetched from Wikipedia 2 weeks ago. So the article we tag as from today might diverge how it looks on Wikipedia as of today. So this is a question of historical accuracy (which we might not care about at all, but in case this would be easy to address, one can do it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right - understood now! I guess a safe fix would be to set the first commit timestamp as the article file creation? Then it would match that first phase of importing the article. @eighttrigrams let me know if this sounds good and I'll create an issue to track it. 🙏

eighttrigrams
eighttrigrams previously approved these changes Jan 31, 2026
@taoeffect
Copy link
Member

Great work @pedrogaudencio!

If we're good to go let me know and I'll merge, otherwise let me know once you've addressed whatever you feel is worth addressing and I'll merge!

Opus 4.5 (no tools)

Details

Bugs

wiki2md/main.go:441 - Regex allows mismatched quotes

The linkTitleRE pattern allows mismatched quotes (e.g., ./test "title' would match):

var linkTitleRE = regexp.MustCompile(`^(.+?)\s+["'].*["']$`)

This also uses .* which could cause backtracking issues with crafted input. Fix by ensuring matching quote pairs and using a negated character class:

var linkTitleRE = regexp.MustCompile(`^(.+?)\s+(?:"[^"]*"|'[^']*')$`)

wiki2md/main.go:228-231 - Dead code with spurious nolint directives

The nolint directives serve no purpose alongside commented-out code and will confuse linters/reviewers:

	// Add front matter
	//nolint:nolintlint
	//nolint:gofumpt
	// md = addFrontMatter(title, md)

Either remove the front matter functionality entirely if it's no longer needed, or keep it active. If removing:

	// Front matter intentionally omitted per issue #XXX

If the addFrontMatter function is no longer called anywhere, it should be deleted from the codebase.


Potential Issues

wiki2md/main.go:481-485 - Fragment encoding inconsistency

The fragment is preserved as-is while the article name is decoded and re-encoded:

		var fragment string
		if hashIdx := strings.Index(articleName, "#"); hashIdx != -1 {
			fragment = articleName[hashIdx:]
			articleName = articleName[:hashIdx]
		}

If the original fragment contained URL-encoded characters (e.g., #Section%20Name), they remain encoded while the article name gets re-encoded. For consistency, consider decoding/re-encoding the fragment as well:

		var fragment string
		if hashIdx := strings.Index(articleName, "#"); hashIdx != -1 {
			fragment = articleName[hashIdx+1:] // exclude the #
			articleName = articleName[:hashIdx]
			if decodedFrag, err := url.PathUnescape(fragment); err == nil {
				fragment = "#" + url.PathEscape(decodedFrag)
			} else {
				fragment = "#" + fragment
			}
		}

Test Coverage Gap

wiki2md/main_test.go:654-658 - Missing test case for image with wiki-like path

The test confirms .png images aren't affected, but doesn't test images with paths that look like wiki links:

		{
			name:     "image link not affected",
			input:    "![Egypt](./Egypt.png)",
			expected: "![Egypt](./Egypt.png)",
		},

Add a test case to verify the ! check works properly:

		{
			name:     "image with wiki-like path not affected",
			input:    "![map](./Egypt)",
			expected: "![map](./Egypt)",
		},

Gemini 3 Flash (with tools)

Details

1. Link Normalization Fragility

In custom/services/wiki2md/main.go:440-515, normalizeInternalLinks uses regex to find and replace links. While it attempts to handle images by checking the preceding character (result[matchStart-1] == '!'), this logic is prone to index-out-of-bounds errors and context-matching failures.

  • Problem 1 (Index Out of Bounds): result[matchStart-1] will panic if the markdown string starts exactly with [ (e.g., at the very beginning of the file).
  • Problem 2 (Regex limitation): The internalLinkRE pattern \[([^\]]+)\]\(([^)]+)\) can match nested brackets incorrectly (though rare in basic markdown) and doesn't account for escaped characters.

Suggestion: Use a safer check for image links and handle the start-of-string case.

// custom/services/wiki2md/main.go:465
        // Check if this is an image link (preceded by !)
        if matchStart > 0 && result[matchStart-1] == '!' {
            continue
        }

Improved Code:

        // Safer check for image link or start of string
        if matchStart > 0 && result[matchStart-1] == '!' {
            continue
        }

2. Commented-out Code instead of Removal

In custom/services/wiki2md/main.go:232-235, the front matter logic is commented out with a nolint directive.

    // Add front matter
    //nolint:nolintlint
    //nolint:gofumpt
    // md = addFrontMatter(title, md)

Review: Since the requirement is to "Remove front matter," the call to addFrontMatter should be deleted entirely rather than commented out. The addFrontMatter function itself should be removed if it's no longer used in production code, keeping only what's necessary for tests.

3. List Normalization Scope

In custom/services/wiki2md/main.go:424, normalizeListMarkers replaces hyphen markers with asterisks.

var listMarkerRE = regexp.MustCompile(`(?m)^([ \t]*)-( )`)

Issue: This regex accounts for spaces and tabs but may fail on mixed indentation or specific edge cases where a hyphen might start a line but isn't a list (though the ( ) requirement mitigates this). A more robust approach would be to use a Markdown parser (like goldmark) to manipulate the AST, ensuring only list items are transformed. However, for a simple cleanup tool, the regex is acceptable if the index out-of-bounds in link normalization is fixed.

4. API Consistency for Commit Timestamps

In custom/services/article-creator/main.go:374-381, you are explicitly setting the Dates field.

        Dates: commitDateOptions{
            Author:    now,
            Committer: now,
        },

Review: While this matches Gitea's CommitDateOptions, ensure that now (generated via time.RFC3339) is actually desired. By setting this, you are overriding the server's default behavior of using the current server time. If the goal is "Import article from Wikipedia," you might eventually want to use the FetchedAt timestamp from the Wikipedia metadata instead of time.Now().

5. Redundant Regex Compilation in Loop

In custom/services/wiki2md/main.go, linkTitleRE, relativeWikiLinkRE, etc., are defined globally (which is good), but stripLinkTitle and extractWikiArticleName are called inside the reverse loop of normalizeInternalLinks.

// custom/services/wiki2md/main.go:480
        linkURL = stripLinkTitle(linkURL)
        articleName := extractWikiArticleName(linkURL)

Review: The logic is sound and follows the pre-compilation pattern established in the file. No immediate performance issue, but ensure url.PathUnescape errors are logged rather than silently ignored to assist in debugging malformed Wikipedia links.

6. Test Coverage Improvements

The test TestNormalizeInternalLinks in custom/services/wiki2md/main_test.go:604 is excellent. However, it lacks a case for the "start of string" link mentioned in point 1.

Suggestion: Add this test case to verify the fix for the potential panic:

        {
            name:     "link at start of string",
            input:    "[Start](./Link)",
            expected: "[Start](/:root/subject/Link)",
        },

* fix linkTitleRE to use alternation for matching quote pairs, preventing mismatched quotes like "title' from matching
* decode and re-encode URL fragments consistently with article names for uniform encoding
* add test case for images with wiki-like paths (no file extension) to verify ! check works correctly
* remove commented-out addFrontMatter call with spurious nolint directives
* delete unused addFrontMatter function from main code
* remove orphaned TestAddFrontMatter and TestAddFrontMatterSourceURL test functions
* remove unused gopkg.in/yaml.v3 import from test file
@pedrogaudencio
Copy link
Collaborator Author

@pedrogaudencio seems counterintuitive to me to choose "*" over "-" list markers, i.e. to normalise the latter into the former. Was this just a flip of a coin, i.e. just making a choice (any), and have it normalised, or was there any reason for favouring asterisks? I looked at it from a typing perspective (I use dashes normally, seems easier, and asterisks are used for marking cursive and bold).

@eighttrigrams I too choose to use hyphens when writing markdown (unless it's in git commits), but for some reason the wysiwyg editor tool we use in Forkana forces converting the bullet points into asterisks once it renders the content to be edited and that often "breaks" the diffs (after submitting it as changes):

Screenshot 2026-02-01 at 11 28 00

IMO it doesn't make sense to consider that the user edited any bullet point from hyphen format onto asterisk when actually they didn't:

Screenshot 2026-02-01 at 11 28 51

Note: the change here only parses the hyphens into asterisks in case they are bullet points (i.e. beginning of the sentence, followed by a space: - ).

Of course, another solution is to fix this instead/also directly in the wysiwyg editor tool we use in the frontend.

Additional context: this tool might only be used in development to populate Forkana with articles and doesn't necessarily mean that when batch importing articles from other sources to populate production will face the same issues parsing the content. So in the end, this change (when importing the articles from Wikipedia) is solving a specific problem and not exactly mandating the format of bullet points in Forkana - whereas the wysiwyg editor certainly does.

@pedrogaudencio
Copy link
Collaborator Author

Addressed

  • ✅ Regex allows mismatched quotes in linkTitleRE: changed to (?:"[^"]*"
  • ✅ Dead code with spurious nolint directives: removed commented code, addFrontMatter function, and orphaned test functions
  • ✅ Fragment encoding inconsistency: decoded/re-encoded consistently with article names
  • ✅ Missing test case for image with wiki-like path: added test case "image with wiki-like path not affected"

Skipped

  • Index out of bounds in normalizeInternalLinks: test exists with matchStart > 0 check to handle start-of-string correctly
  • List normalisation scope: regex is enough
  • API consistency for commit timestamps: commit time reflects article creation time, not Wikipedia fetch time
  • Redundant regex compilation in loop: not an issue
  • Missing test case for link at start of string: input [Egypt](./Egypt) starts at position 0 and test is passing

@eighttrigrams
Copy link
Member

@pedrogaudencio seems counterintuitive to me to choose "*" over "-" list markers, i.e. to normalise the latter into the former. Was this just a flip of a coin, i.e. just making a choice (any), and have it normalised, or was there any reason for favouring asterisks? I looked at it from a typing perspective (I use dashes normally, seems easier, and asterisks are used for marking cursive and bold).

@eighttrigrams I too choose to use hyphens when writing markdown (unless it's in git commits), but for some reason the wysiwyg editor tool we use in Forkana forces converting the bullet points into asterisks once it renders the content to be edited and that often "breaks" the diffs (after submitting it as changes):

Screenshot 2026-02-01 at 11 28 00 IMO it doesn't make sense to consider that the user edited any bullet point from hyphen format onto asterisk when actually they didn't: Screenshot 2026-02-01 at 11 28 51 Note: the change here only parses the hyphens into asterisks in case they are bullet points (i.e. beginning of the sentence, followed by a space: `- `).

Of course, another solution is to fix this instead/also directly in the wysiwyg editor tool we use in the frontend.

Additional context: this tool might only be used in development to populate Forkana with articles and doesn't necessarily mean that when batch importing articles from other sources to populate production will face the same issues parsing the content. So in the end, this change (when importing the articles from Wikipedia) is solving a specific problem and not exactly mandating the format of bullet points in Forkana - whereas the wysiwyg editor certainly does.

@pedrogaudencio Yeah, ok, I wasn't under the impression that this means that it mandates the use of "" in Forkana, I only thought if we cared about how the initial pages of Wikipedia imports then look like. Because those are many, and people use those as starting points. Then they would use "" for continuity, they wouldn't make change requests necessarily just for changing those to "-". But I think this is really not a pressing issue right now. Rather something to be addressed when we actually do a trial run of the imports, right before we do them.

@taoeffect taoeffect merged commit 7ad6a3d into master Feb 1, 2026
32 checks passed
@taoeffect taoeffect deleted the improve-populate-tools branch February 1, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request modifies/go

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve populate tools to better format the imported articles

3 participants