-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
This issue is for a: (mark with an x
)
- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
I use a mixture of PDF and markdown files as data sources, but noticed that links in the markdown text get interpreted as citations. Here is an example of some markdown with a link inside (ignore the danish language):
I kraft af dit ansættelsesforhold i styrelsen indsamler, opbevarer og behandler Koncern HR en række personlige oplysninger om dig. Du kan læse meget mere herom lige her [Sådan behandler Koncern HR persondata om dig](https://somefile.pdf).
Since the title [Sådan behandler Koncern HR persondata om dig] of the link uses square brackets like citations, the title wrongly ends up being listed among the sources while the link doesn't get formatted correctly in the answer.

Any log messages given by the failure
None
Expected/desired behavior
If we assume that a citation is always a file, we can modify the code to check whether the citation candidate contains a valid file extension. Only candidates with a recognized file extension will be treated as citations, while any other text enclosed in square brackets will remain unchanged.
Here is the code from line 32 in ./app/frontend/src/components/Answer/AnswerParser.tsx before the change:
const fragments: string[] = parts.map((part, index) => {
if (index % 2 === 0) {
return part;
} else {
let citationIndex: number;
if (citations.indexOf(part) !== -1) {
citationIndex = citations.indexOf(part) + 1;
} else {
citations.push(part);
citationIndex = citations.length;
}
const path = getCitationFilePath(part);
return renderToStaticMarkup(
<a className="supContainer" title={part} onClick={() => onCitationClicked(path)}>
<sup>{citationIndex}</sup>
</a>
);
}
});
And here is the changed code where citation candidates are checked for common file extensions:
const fragments: string[] = parts.map((part, index) => {
if (index % 2 === 0) {
// This is text outside square brackets (regular text)
return part;
} else {
// This is text inside square brackets (citation or markdown URL)
// Check if the citation ends with a valid file extension
const validFileExtensions = [
".pdf",
".html",
".docx",
".pptx",
".xlsx",
".jpg",
".jpeg",
".png",
".bpm",
".tiff",
".heiff",
".txt",
".json",
".csv",
".md"
];
// Check if the part is a valid citation file
const citationIsFile = validFileExtensions.some(ext => part.trim().toLowerCase().includes(ext));
if (!citationIsFile) {
// Return the part with square brackets, as it originally was
return `[${part}]`;
}
// Handle citation: either it exists already or is new
let citationIndex: number;
if (citations.indexOf(part) !== -1) {
citationIndex = citations.indexOf(part) + 1;
} else {
citations.push(part);
citationIndex = citations.length;
}
const path = getCitationFilePath(part);
// Return the citation as a clickable link with superscript
return renderToStaticMarkup(
<a className="supContainer" title={part} onClick={() => onCitationClicked(path)}>
<sup>{citationIndex}</sup>
</a>
);
}
});
This change ensures that markdown links are displayed correctly, and the citations section only includes valid source files.

Would it make sense to implement this code change in the repo?
OS and Version?
macOS (Sierra)
azd version?
azd version 1.10.1 (commit 31409a33266fb4a5fdbb644bc83988e725d6c7c9)
Versions
I'm running the release titled "2024-08-23: Optional speech output is now on-demand"
Thanks!
Thank you so much for this awesome repo!