Skip to content

Markdown URLs with non-file extensions are incorrectly recognized as citations #2017

@EMjetrot

Description

@EMjetrot

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I use a mixture of PDF and markdown files as data sources, but noticed that links in the markdown text get interpreted as citations. Here is an example of some markdown with a link inside (ignore the danish language):

I kraft af dit ansættelsesforhold i styrelsen indsamler, opbevarer og behandler Koncern HR en række personlige oplysninger om dig. Du kan læse meget mere herom lige her [Sådan behandler Koncern HR persondata om dig](https://somefile.pdf).

Since the title [Sådan behandler Koncern HR persondata om dig] of the link uses square brackets like citations, the title wrongly ends up being listed among the sources while the link doesn't get formatted correctly in the answer.

before_change

Any log messages given by the failure

None

Expected/desired behavior

If we assume that a citation is always a file, we can modify the code to check whether the citation candidate contains a valid file extension. Only candidates with a recognized file extension will be treated as citations, while any other text enclosed in square brackets will remain unchanged.

Here is the code from line 32 in ./app/frontend/src/components/Answer/AnswerParser.tsx before the change:

const fragments: string[] = parts.map((part, index) => {
        if (index % 2 === 0) {
            return part;
        } else {
            let citationIndex: number;
            if (citations.indexOf(part) !== -1) {
                citationIndex = citations.indexOf(part) + 1;
            } else {
                citations.push(part);
                citationIndex = citations.length;
            }

            const path = getCitationFilePath(part);

            return renderToStaticMarkup(
                <a className="supContainer" title={part} onClick={() => onCitationClicked(path)}>
                    <sup>{citationIndex}</sup>
                </a>
            );
        }
    });

And here is the changed code where citation candidates are checked for common file extensions:

const fragments: string[] = parts.map((part, index) => {

        if (index % 2 === 0) {
            // This is text outside square brackets (regular text)
            return part;
        } else {
            // This is text inside square brackets (citation or markdown URL)
            // Check if the citation ends with a valid file extension
            const validFileExtensions = [
                ".pdf",
                ".html",
                ".docx",
                ".pptx",
                ".xlsx",
                ".jpg",
                ".jpeg",
                ".png",
                ".bpm",
                ".tiff",
                ".heiff",
                ".txt",
                ".json",
                ".csv",
                ".md"
            ];

            // Check if the part is a valid citation file
            const citationIsFile = validFileExtensions.some(ext => part.trim().toLowerCase().includes(ext));

            if (!citationIsFile) {
                // Return the part with square brackets, as it originally was
                return `[${part}]`;
            }

            // Handle citation: either it exists already or is new
            let citationIndex: number;
            if (citations.indexOf(part) !== -1) {
                citationIndex = citations.indexOf(part) + 1;
            } else {
                citations.push(part);
                citationIndex = citations.length;
            }

            const path = getCitationFilePath(part);

            // Return the citation as a clickable link with superscript
            return renderToStaticMarkup(
                <a className="supContainer" title={part} onClick={() => onCitationClicked(path)}>
                    <sup>{citationIndex}</sup>
                </a>
            );
        }
    });

This change ensures that markdown links are displayed correctly, and the citations section only includes valid source files.

after_change

Would it make sense to implement this code change in the repo?

OS and Version?

macOS (Sierra)

azd version?

azd version 1.10.1 (commit 31409a33266fb4a5fdbb644bc83988e725d6c7c9)

Versions

I'm running the release titled "2024-08-23: Optional speech output is now on-demand"

Thanks!

Thank you so much for this awesome repo!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA bug in the code that should be fixedopen issueA validated issue that should be tackled. Comment if you'd like it assigned to you.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions