Skip to content

Commit 99aa0e8

Browse files
authored
Fix PDF extraction when MIME type contains charset
#1198 Fixes #915
2 parents ba211e3 + 5d72f69 commit 99aa0e8

File tree

3 files changed

+13
-1
lines changed

3 files changed

+13
-1
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
All changes that impact users of this module are documented in this file, in the [Common Changelog](https://common-changelog.org) format with some additional specifications defined in the CONTRIBUTING file. This codebase adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
44

5+
## Unreleased [patch]
6+
7+
> Development of this release was supported by the [French Ministry for Foreign Affairs](https://www.diplomatie.gouv.fr/fr/politique-etrangere-de-la-france/diplomatie-numerique/) through its ministerial [State Startups incubator](https://beta.gouv.fr/startups/open-terms-archive.html) under the aegis of the Ambassador for Digital Affairs.
8+
9+
### Fixed
10+
11+
- Increase robustness of PDF content type detection
12+
513
## 9.1.0 - 2025-10-01
614

715
_Full changeset and discussions: [#1197](https://github.com/OpenTermsArchive/engine/pull/1197)._

src/archivist/extract/index.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ export { ExtractDocumentError } from './errors.js';
1818
*/
1919
export default async function extract(sourceDocument) {
2020
try {
21-
if (sourceDocument.mimeType == mime.getType('pdf')) {
21+
if (mime.getExtension(sourceDocument.mimeType) == 'pdf') {
2222
return await extractFromPDF(sourceDocument);
2323
}
2424

src/archivist/extract/index.test.js

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -534,6 +534,10 @@ describe('Extract', () => {
534534
expect(await extract({ content: pdfContent, mimeType: mime.getType('pdf') })).to.equal(expectedExtractedContent);
535535
});
536536

537+
it('extracts content from PDF when MIME type includes charset parameter', async () => {
538+
expect(await extract({ content: pdfContent, mimeType: 'application/pdf; charset=utf-8' })).to.equal(expectedExtractedContent);
539+
});
540+
537541
context('when PDF contains no text', () => {
538542
it('throws an ExtractDocumentError error', async () => {
539543
await expect(extract({ content: await fs.readFile(path.resolve(__dirname, '../../../test/fixtures/termsWithoutText.pdf')), mimeType: mime.getType('pdf') })).to.be.rejectedWith(ExtractDocumentError, /contains no text/);

0 commit comments

Comments
 (0)