Skip to content

Exclude PDF-Files #453

@oschihin

Description

@oschihin

I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.

So, what usually works, is this bean:

<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="listLogicalOr" value="true" />
    <property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
        <list>
            <value>.*\.[Pp][Dd][Ff]$</value>
         </list>
  </property>
</bean>

But this only excludes downloads based on file endings. So I added another Rule:

<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>

This has no effect, no entries in scope.log.

In crawl.log I have these entries:

2021-12-07T12:02:26.669Z   200     171746 https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?previousAction1=geschaeft&previousAction2=&previousAction3=&previousAction4=&action=download&dokumentId=79e664176005402cabea26e8b591cf77-332&dokumentVersion=5&dokumentAnsicht=Dokument&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 LLLRL https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?action=geschaeft&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 application/pdf #010 20211207120226169+466 sha1:5UZWSGMUDEYGYZDENDJZFGTUVJ3BGJFS https://www.government.example.com -

My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:

<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
    <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="REJECT"/>
	    <property name="regex" value="^application\/[pz][di][fp]$"/>
	</bean>
</property>

This has no effect.

Help would be very much appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions