Skip to content

Add histotable serializer for highest uri precedence provider#647

Merged
ato merged 6 commits intointernetarchive:masterfrom
adam-miller:add-histotable-serializer-for-HighestUriPrecedenceProvider
Apr 24, 2025
Merged

Add histotable serializer for highest uri precedence provider#647
ato merged 6 commits intointernetarchive:masterfrom
adam-miller:add-histotable-serializer-for-HighestUriPrecedenceProvider

Conversation

@adam-miller
Copy link
Copy Markdown
Contributor

@adam-miller adam-miller commented Apr 22, 2025

This fixes an issue when using the HighestUriPrecedenceProvider where the nested generics in the Histotable fail to serialize properly.

Sample config:

  <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier">
    <property name="queuePrecedencePolicy">
      <bean class="org.archive.crawler.frontier.precedence.HighestUriQueuePrecedencePolicy"/>
    </property>
    <property name="maxRetries" value="40"/>
    <property name="retryDelaySeconds" value="180"/>
    <property name="dumpPendingAtClose" value="true"/>
  </bean>

@adam-miller
Copy link
Copy Markdown
Contributor Author

adam-miller commented Apr 22, 2025

I'm left with the following warning:

00:02  WARN: Class is not registered: org.archive.crawler.frontier.precedence.HighestUriQueuePrecedencePolicy$HighestUriPrecedenceProvider
Note: To register this class use: kryo.register(org.archive.crawler.frontier.precedence.HighestUriQueuePrecedencePolicy.HighestUriPrecedenceProvider.class);

However, this class is in the engine subproject and importing it into KryoBinding would create a circular dependency. I see that other classes such as BdbWorkQueue seem to get around this with an autoregisterTo() method, but I'm not clear on how to get this to get that function called if implemented in HighestUriQueuePrecedencePolicy

@ato
Copy link
Copy Markdown
Collaborator

ato commented Apr 22, 2025

I find it confusing, but it seems to me that in the pool create() in KryoBinding, the autoregisterTo static method gets called for the value class the BDB object cache is created for. So for example in BdbFrontier that's BdbWorkQueue:

this.allQueues = bdb.getObjectCache("allqueues", isRecovery, WorkQueue.class, BdbWorkQueue.class);

If we look at BdbWorkQueue's autoregisterTo() it calls kryo.autoregister() on the other classes that it references. If those classes in turn also define an autoregisterTo() they'll pull in more classes until in theory every class necessary for BdbWorkQueue and everything referenced from it is registered.

    public static void autoregisterTo(AutoKryo kryo) {
        kryo.register(BdbWorkQueue.class);
        kryo.autoregister(FetchStats.class); 
        kryo.autoregister(HashSet.class);
        kryo.autoregister(SimplePrecedenceProvider.class);
        kryo.autoregister(byte[].class);
    }

So I'm guessing maybe the intended way to do it is to add an autoregister for HighestUriQueuePrecedencePolicy to the end of BdbWorkQueue.autoregisterTo:

kryo.autoregister(HighestUriQueuePrecedencePolicy.class);

and then you can define a HighestUriQueuePrecedencePolicy.autoregisterTo() that in turn autoregisters Histotable.

I don't love this autoregister mechanism because the order determines the IDs kryo uses to identify the classes, which means you often can't load a checkpoint with a different Heritrix version.

@adam-miller adam-miller marked this pull request as ready for review April 23, 2025 01:02
@adam-miller
Copy link
Copy Markdown
Contributor Author

Thank you, that was what I was missing. This seems to work now, and fixes the warnings.

@ato ato merged commit 91cd4b8 into internetarchive:master Apr 24, 2025
3 checks passed
ato added a commit that referenced this pull request Apr 24, 2025
@adam-miller adam-miller deleted the add-histotable-serializer-for-HighestUriPrecedenceProvider branch April 24, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants