Skip to content

Commit 7d2533d

Browse files
committed
initial revision
1 parent c0fbd48 commit 7d2533d

File tree

12 files changed

+1100
-1
lines changed

12 files changed

+1100
-1
lines changed

README.md

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,66 @@
11
# anserini-solr-plugin
2-
Solr Plugin that supports Anserini style query expansion and reranking against Solr indexes
2+
3+
Solr Plugin that supports [Anserini](https://github.com/castorini/anserini) style query expansion and reranking against Solr indexes.
4+
5+
### Description
6+
7+
Supports following similarity implementations for paragraph text.
8+
9+
* **Query Likelihood (QL)** -- via built-in DirichletLM Similarity
10+
* **BM25** -- via built in BM25 Similarity (default)
11+
12+
Supports following query rewriting functionality (query A).
13+
14+
* **Bag of Words (BoW)** -- constructs OR query out of individual terms
15+
* **Sequential Dependency Model (SDM)** -- constructs query out of individual terms, bigrams (ordered and unordered).
16+
17+
Supports following query reranking functionality (query B). Constructs more complex query based on results returned from Query A and applies it to the top ${rerankCutoff} results from Query A.
18+
19+
* **Relevance Model 3 (RM3)** -- extracts feature vectors from query and results from query A and top feature vectors from top terms from top documents of the result, and interpolates them to create new reranking query.
20+
* **Axiomatic Reranker** -- computes mutual information between query terms and terms in top ${rerankedCutoff} documents, plus random documents not from top results, and scored. Uses top K terms to create new reranking query.
21+
* **Identity Reranker** -- a do-nothing reranker, passes the results from query A unchanged. Useful for debugging.
22+
23+
### Building
24+
25+
Steps to build the JAR file from the code and deploy to Solr are as follows:
26+
27+
```bash
28+
$ mvn clean package
29+
$ mkdir -p ${SOLR_HOME}/server/solr/lib
30+
$ cp target/anserini-solr-plugins-1.0-SNAPSHOT.jar ${SOLR_HOME}/server/solr/lib/
31+
```
32+
33+
### Configuration
34+
35+
The plugin expects additional field types `text_bm` and `text_ql` to be defined in managed-schema.xml of the `${SOLR_HOME}/server/solr/${INDEX_NAME}/conf/managed-schema`. These can be found in [solr/schema-additions.xml](solr/schema-additions.xml). This is needed to support the QL and BM25 similarities defined above.
36+
37+
The plugin requires two fields `para_text_bm` and `para_text_ql` with field types `text_bm` and `text_ql` as defined in the previous step. There are no other specific field requirements. An example schema can be found in [solr/update-schema.sh](solr/update-schema.sh).
38+
39+
Please restart Solr after these steps so its class loader can pick up the new JAR file you provided it in the Building section.
40+
41+
The plugin is defined (in `${SOLR_HOME}/server/solr/${INDEX_NAME}/conf/solrconfig.xml`) as detailed in [solr/update-plugin.sh](solr/update-plugin.sh).
42+
43+
### Running
44+
45+
Plugin can be run using HTTP GET requests. A typical URL would be something like the following.
46+
47+
```
48+
http://localhost:8983/solr/my_index_name/anserini?q=what+are+nails+made+of
49+
```
50+
51+
Main parameters to tweak behavior are listed below.
52+
53+
* q -- question, URL encoded. Mandatory parameter.
54+
* sim -- ql (Query Likelihood) or bm (BM25), default bm.
55+
* qtyoe -- Query Expansion type. Valid values are bow (Bag of Words) or sdm (Sequential Dependency Model), default is bow.
56+
* rtype -- Reranking type. Valid values are ax (Axiomatic), rm3 (Relevance Model 3), and id (Identity), default is rm3.
57+
* start and rows -- for pagination, defaults to 0 and 10 respectively.
58+
59+
For certain qtype and rtype, there are some additional parameters that are listed in [solr/update-plugin.sh](solr/update-plugin.sh) with prefixes "sdm.", "ax.", and "rm3."
60+
61+
### Dependencies
62+
63+
Currently the only dependency is Solr, since we have copy-pasted relevant parts of Anserini functionality in the interests of time. Plan is to make Anserini a dependency and leverage its functionality directly.
64+
65+
* Solr 8.1.1
66+

pom.xml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3+
<modelVersion>4.0.0</modelVersion>
4+
<groupId>com.elsevier</groupId>
5+
<artifactId>anserini-solr-plugins</artifactId>
6+
<packaging>jar</packaging>
7+
<version>1.0-SNAPSHOT</version>
8+
<name>anserini-solr-plugins</name>
9+
<url>http://maven.apache.org</url>
10+
11+
<dependencies>
12+
<!-- https://mvnrepository.com/artifact/org.apache.solr/solr-core -->
13+
<dependency>
14+
<groupId>org.apache.solr</groupId>
15+
<artifactId>solr-core</artifactId>
16+
<version>8.1.1</version>
17+
</dependency>
18+
<!-- https://mvnrepository.com/artifact/junit/junit -->
19+
<dependency>
20+
<groupId>junit</groupId>
21+
<artifactId>junit</artifactId>
22+
<version>4.12</version>
23+
<scope>test</scope>
24+
</dependency>
25+
26+
</dependencies>
27+
</project>

solr/schema-additions.xml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
<!-- following XML blocks must be copy-pasted inside the schema element of managed-schema.
2+
Don't care about ordering, managed-schema is regenerated by Solr and everything will
3+
be rearranged anyway.
4+
This needs to be done before setting up the fields.
5+
-->
6+
7+
<similarity class="solr.SchemaSimilarityFactory">
8+
<str name="defaultSimFromFieldType">text_bm</str>
9+
</similarity>
10+
11+
<fieldType name="text_bm" class="solr.TextField" positionIncrementGap="100" multiValued="true">
12+
<analyzer type="index">
13+
<tokenizer class="solr.StandardTokenizerFactory"/>
14+
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
15+
<filter class="solr.LowerCaseFilterFactory"/>
16+
</analyzer>
17+
<analyzer type="query">
18+
<tokenizer class="solr.StandardTokenizerFactory"/>
19+
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
20+
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
21+
<filter class="solr.LowerCaseFilterFactory"/>
22+
</analyzer>
23+
<similarity class="solr.BM25SimilarityFactory">
24+
<str name="b">0.75</str>
25+
<str name="k1">1.2</str>
26+
</similarity>
27+
</fieldType>
28+
29+
<fieldType name="text_ql" class="solr.TextField" positionIncrementGap="100" multiValued="true">
30+
<analyzer type="index">
31+
<tokenizer class="solr.StandardTokenizerFactory"/>
32+
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
33+
<filter class="solr.LowerCaseFilterFactory"/>
34+
</analyzer>
35+
<analyzer type="query">
36+
<tokenizer class="solr.StandardTokenizerFactory"/>
37+
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
38+
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
39+
<filter class="solr.LowerCaseFilterFactory"/>
40+
</analyzer>
41+
<similarity class="solr.LMDirichletSimilarityFactory">
42+
<str name="mu">2000</str>
43+
</similarity>
44+
</fieldType>
45+

solr/update-plugin.sh

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#!/bin/bash
2+
curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/qaindex/config -d '{
3+
"add-requesthandler": {
4+
"name": "/anserini",
5+
"class": "com.elsevier.asp.AnseriniRequestHandler",
6+
"defaults": {
7+
"sim" : "bm",
8+
"qtype" : "bow",
9+
"rtype" : "rm3",
10+
"rerankCutoff" : "50",
11+
"sdm.termWeight" : "0.85",
12+
"sdm.orderedWindowWeight" : "0.1",
13+
"sdm.unorderedWindowWeight" : "0.05",
14+
"rm3.fbTerms" : "10",
15+
"rm3.fbDocs" : "10",
16+
"rm3.originalQueryWeight" : "0.5",
17+
"ax.R" : "20",
18+
"ax.N" : "20",
19+
"ax.K" : "1000",
20+
"ax.M" : "30",
21+
"ax.beta" : "0.4",
22+
"start" : "0",
23+
"rows" : "10",
24+
"fl" : "pii,isbns_f,book_title,chapter_title,para_id,para_text"
25+
}
26+
}
27+
}'

solr/update-schema.sh

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/bin/bash
2+
curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/qaindex/schema -d '{
3+
"add-field": {
4+
"name": "pii",
5+
"type": "string",
6+
"stored": true,
7+
"indexed": true
8+
},
9+
"add-field": {
10+
"name": "isbns_f",
11+
"type": "string",
12+
"stored": true,
13+
"indexed": false,
14+
"multiValued": true
15+
},
16+
"add-field": {
17+
"name": "isbns_u",
18+
"type": "string",
19+
"stored": true,
20+
"indexed": true,
21+
"multiValued": true
22+
},
23+
"add-field": {
24+
"name": "book_title",
25+
"type": "text_general",
26+
"stored": true,
27+
"indexed": true
28+
},
29+
"add-field": {
30+
"name": "chapter_title",
31+
"type": "text_general",
32+
"stored": true,
33+
"indexed": true
34+
},
35+
"add-field": {
36+
"name": "para_id",
37+
"type": "string",
38+
"stored": true,
39+
"indexed": true
40+
},
41+
"add-field": {
42+
"name": "para_text_bm",
43+
"type": "text_bm",
44+
"stored": true,
45+
"indexed": true,
46+
"termVectors": true,
47+
"termPositions": true,
48+
"termOffsets": true
49+
},
50+
"add-field": {
51+
"name": "para_text_ql",
52+
"type": "text_ql",
53+
"stored": true,
54+
"indexed": true,
55+
"termVectors": true,
56+
"termPositions": true,
57+
"termOffsets": true
58+
}
59+
}'
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
package com.elsevier.asp;
2+
3+
import java.io.IOException;
4+
import java.util.ArrayList;
5+
import java.util.List;
6+
7+
import org.apache.lucene.analysis.Analyzer;
8+
import org.apache.lucene.analysis.TokenStream;
9+
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
10+
11+
public class AnalyzerUtils {
12+
13+
public static List<String> tokenizeQuery(String queryString, String fieldName, Analyzer analyzer) {
14+
List<String> queryTokens = new ArrayList<String>();
15+
try {
16+
TokenStream tokenStream = analyzer.tokenStream(fieldName, queryString);
17+
CharTermAttribute termAttr = tokenStream.getAttribute(CharTermAttribute.class);
18+
tokenStream.reset();
19+
while (tokenStream.incrementToken()) {
20+
String token = termAttr.toString();
21+
if (token.length() == 0) continue;
22+
queryTokens.add(token);
23+
}
24+
tokenStream.end();
25+
tokenStream.close();
26+
} catch (IOException e) {
27+
e.printStackTrace();
28+
}
29+
return queryTokens;
30+
}
31+
32+
}

0 commit comments

Comments
 (0)