-
Notifications
You must be signed in to change notification settings - Fork 782
Exclude PDF-Files #453
Copy link
Copy link
Closed
Labels
Description
I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.
So, what usually works, is this bean:
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
<list>
<value>.*\.[Pp][Dd][Ff]$</value>
</list>
</property>
</bean>But this only excludes downloads based on file endings. So I added another Rule:
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>This has no effect, no entries in scope.log.
In crawl.log I have these entries:
2021-12-07T12:02:26.669Z 200 171746 https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?previousAction1=geschaeft&previousAction2=&previousAction3=&previousAction4=&action=download&dokumentId=79e664176005402cabea26e8b591cf77-332&dokumentVersion=5&dokumentAnsicht=Dokument&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 LLLRL https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?action=geschaeft&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 application/pdf #010 20211207120226169+466 sha1:5UZWSGMUDEYGYZDENDJZFGTUVJ3BGJFS https://www.government.example.com -
My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:
<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>
</property>This has no effect.
Help would be very much appreciated.
Reactions are currently unavailable