Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 1 addition & 17 deletions .pre-commit-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,28 +162,12 @@ repos:
name: Fix Markdown
language: system
entry: uv
args: [ 'run', 'rumdl', 'fmt' ]
args: [ 'run', 'rumdl', 'check', '--fix' ]
env:
UV_PROJECT: dev-tools
UV_FROZEN: "1"
types: [ 'markdown']
require_serial: true
exclude:
glob:
# TODO: fix formatting of these files separately
- .github/PULL_REQUEST_TEMPLATE.md
- CONTRIBUTING.md
- dev-docs/file-formats.md
- dev-docs/github-issues-howto.md
- dev-tools/aws-jmh/README.md
- dev-tools/scripts/README.md
- lucene/backward-codecs/README.md
- lucene/distribution/src/binary-release/README.md
- lucene/luke/README.md
- lucene/luke/src/distribution/README.md
- lucene/MIGRATE.md
- lucene/SYSTEM_REQUIREMENTS.md
- README.md

- id: ruff-check
name: Fix Python
Expand Down
8 changes: 8 additions & 0 deletions .rumdl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,11 @@
line-length = 0
# not really a markdown file, but a template
exclude = [ "lucene/documentation/src/markdown/index.template.md" ]

[MD007]
# match indentation set in .editorconfig for least friction
indent = 4

[per-file-ignores]
# doesn't start with level 1 heading on purpose
".github/PULL_REQUEST_TEMPLATE.md" = [ "MD041" ]
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ In case your contribution fixes a bug, please create a new test case that fails
### IDE support

- *IntelliJ* - IntelliJ idea can import and build gradle-based projects out of the box. It will default to running tests by calling the gradle wrapper, and while this works, it is can be a bit slow. If instead you configure IntelliJ to use its own built-in test runner by (in 2024 version) navigating to settings for Build Execution & Deployment/Build Tools/Gradle (under File/Settings menu on some platforms) and selecting "Build and Run using: IntelliJ IDEA" and "Run Tests using: IntelliJ IDEA", then some tests will run faster. However some other tests will not run using this configuration.
- *Eclipse* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)).
- *VSCode* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L23)).
- *Neovim* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L32)).
- *Eclipse* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)).
- *VSCode* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L23)).
- *Neovim* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L32)).
- *Netbeans* - Not tested.

## Benchmarking
Expand All @@ -78,7 +78,7 @@ Feel free to share your findings (especially if your implementation performs bet

## Contributing your work

You can open a pull request at https://github.com/apache/lucene.
You can open a pull request at <https://github.com/apache/lucene>.

Please be patient. Committers are busy people too. If no one responds to your patch after a few days, please make friendly reminders. Please incorporate others' suggestions into your patch if you think they're reasonable. Finally, remember that even a patch that is not committed is useful to the community.

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ written in Java.

## Online Documentation

This README file only contains basic setup instructions. For more
This README file only contains basic setup instructions. For more
comprehensive documentation, visit:

- Latest Releases: <https://lucene.apache.org/core/documentation.html>
Expand All @@ -38,7 +38,7 @@ comprehensive documentation, visit:

## Building

### Basic steps:
### Basic steps

1. Install JDK 25 using your package manager or download manually from
[OpenJDK](https://jdk.java.net/),
Expand All @@ -48,7 +48,7 @@ comprehensive documentation, visit:
2. Clone Lucene's git repository (or download the source distribution).
3. Run gradle launcher script (`gradlew`).

We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.
We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at <https://jdk.java.net/> and learning more about Java, before returning to this README.

## Contributing

Expand Down
4 changes: 3 additions & 1 deletion dev-docs/file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,14 @@ on their own.
## How to split the data into files?

Most file formats split the data into 3 files:

- metadata,
- index data,
- raw data.

The metadata file contains all the data that is read once at open time. This
helps on several fronts:

- One can validate the checksums of this data at open time without significant
overhead since all data needs to be read anyway, this helps detect
corruptions early.
Expand Down Expand Up @@ -124,4 +126,4 @@ by merges. All default implementations do this.

## How to make backward-compatible changes to file formats?

See [here](../lucene/backward-codecs/README.md).
See [Index Backwards Compatibility](../lucene/backward-codecs/README.md).
2 changes: 1 addition & 1 deletion dev-docs/github-issues-howto.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ All issues/PRs associated with a milestone must be resolved before the release,

Once the release is done, the Milestone should be closed then a new Milestone for the next release should be created.

You can see the list of current active (opened) Milestones here. https://github.com/apache/lucene/milestones
You can see the list of current active (opened) Milestones here. <https://github.com/apache/lucene/milestones>

See [GitHub documentation](https://docs.github.com/en/issues/using-labels-and-milestones-to-track-work/about-milestones) for more details.

Expand Down
14 changes: 6 additions & 8 deletions dev-tools/aws-jmh/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,21 @@
limitations under the License.
-->

# EC2 Microbenchmarks

Runs lucene microbenchmarks across a variety of CPUs in EC2.

Example:

```console
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=yyyy
make PATCH_BRANCH=rmuir:some-speedup
```
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=yyyy
make PATCH_BRANCH=rmuir:some-speedup

Results file will be in build/report.txt

You can also pass additional JMH args if you want:

```console
make PATCH_BRANCH=rmuir:some-speedup JMH_ARGS='float -p size=756'
```
make PATCH_BRANCH=rmuir:some-speedup JMH_ARGS='float -p size=756'

Prerequisites:

Expand Down
1 change: 0 additions & 1 deletion dev-tools/scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,4 +194,3 @@ and prints a regular expression that will match all of them
### gitignore-gen.sh

TBD

51 changes: 27 additions & 24 deletions lucene/MIGRATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,21 @@ Starting with Lucene 11.0.0, the index upgrade policy has been relaxed to allow

#### Upgrade Scenarios

**Scenario 1: No format breaks (wider upgrade span)**
##### Scenario 1: No format breaks (wider upgrade span)

- Index created with Lucene 10.x can be opened directly in Lucene 11.x, 12.x, 13.x, 14.x (as long as MIN_SUPPORTED_MAJOR stays ≤ 10)
- Simply open the index with the new version; segments will be upgraded gradually through normal merging
- Optional: Call `forceMerge()` or use `UpgradeIndexMergePolicy` to upgrade segment formats immediately
- **Important**: You still only get one upgrade per index lifetime. Once MIN_SUPPORTED_MAJOR is bumped above 10, the index becomes unopenable and must be reindexed.

**Scenario 2: Format breaks occur**
##### Scenario 2: Format breaks occur

- If a major version introduces incompatible format changes, `MIN_SUPPORTED_MAJOR` will be bumped
- Indexes created before the new minimum will throw `IndexFormatTooOldException`
- Full reindexing is required for such indexes

**Scenario 3: After using your upgrade**
##### Scenario 3: After using your upgrade

- Index created with Lucene 10.x, successfully opened with Lucene 14.x
- The index's creation version is still 10 (this never changes)
- When Lucene 15+ bumps MIN_SUPPORTED_MAJOR above 10, this index becomes unopenable
Expand All @@ -72,6 +75,7 @@ try (Directory dir = FSDirectory.open(indexPath)) {
#### Error Handling

Enhanced error messages will clearly indicate:

- Whether the index creation version is below `MIN_SUPPORTED_MAJOR` (reindex required)
- Whether segments are too old to read directly (sequential upgrade required)

Expand All @@ -85,7 +89,7 @@ number of segments that may be merged together.
Query caching is now disabled by default. To enable caching back, do something
like below in a static initialization block:

```
```java
int maxCachedQueries = 1_000;
long maxRamBytesUsed = 50 * 1024 * 1024; // 50MB
IndexSearcher.setDefaultQueryCache(new LRUQueryCache(maxCachedQueries, maxRamBytesUsed));
Expand Down Expand Up @@ -124,11 +128,11 @@ DataInput.readGroupVInt method: subclasses should delegate or reimplement it ent

### OpenNLP dependency upgrade

[Apache OpenNLP](https://opennlp.apache.org) 2.x opens the door to accessing various models via the ONNX runtime. To migrate you will need to update any deprecated OpenNLP methods that you may be using.
[Apache OpenNLP](https://opennlp.apache.org) 2.x opens the door to accessing various models via the ONNX runtime. To migrate you will need to update any deprecated OpenNLP methods that you may be using.

### Snowball dependency upgrade

Snowball has folded the "German2" stemmer into their "German" stemmer, so there's no "German2" anymore. For Lucene APIs (TokenFilter, TokenFilterFactory) that accept String, "German2" will be mapped to "German" to avoid breaking users. If you were previously creating German2Stemmer instances, you'll need to change your code to create GermanStemmer instances instead. For more information see https://snowballstem.org/algorithms/german2/stemmer.html
Snowball has folded the "German2" stemmer into their "German" stemmer, so there's no "German2" anymore. For Lucene APIs (TokenFilter, TokenFilterFactory) that accept String, "German2" will be mapped to "German" to avoid breaking users. If you were previously creating German2Stemmer instances, you'll need to change your code to create GermanStemmer instances instead. For more information see <https://snowballstem.org/algorithms/german2/stemmer.html>

### Romanian analysis

Expand All @@ -155,6 +159,7 @@ Instead, call storedFields()/termVectors() to return an instance which can fetch
and will be garbage-collected as usual.

For example:

```java
TopDocs hits = searcher.search(query, 10);
StoredFields storedFields = reader.storedFields();
Expand Down Expand Up @@ -230,7 +235,6 @@ for the currently-positioned document (doing so will result in undefined behavio
`IOContext.READONCE` for opening internally, as that's the only valid usage pattern for checksum input.
Callers should remove the parameter when calling this method.


### DaciukMihovAutomatonBuilder is renamed to StringsToAutomaton and made package-private

The former `DaciukMihovAutomatonBuilder#build` functionality is exposed through `Automata#makeStringUnion`.
Expand Down Expand Up @@ -300,7 +304,7 @@ access the members using method calls instead of field accesses. Affected classe
- `TermAndVector` (GITHUB#13772)
- Many basic Lucene classes, including `CollectionStatistics`, `TermStatistics` and `LeafMetadata` (GITHUB#13328)

### Boolean flags on IOContext replaced with a new ReadAdvice enum.
### Boolean flags on IOContext replaced with a new ReadAdvice enum

The `readOnce`, `load` and `random` flags on `IOContext` have been replaced with a new `ReadAdvice`
enum.
Expand All @@ -324,6 +328,7 @@ To migrate, use a provided `CollectorManager` implementation that suits your use
to follow the new API pattern. The straight forward approach would be to instantiate the single-threaded `Collector` in a wrapper `CollectorManager`.

For example

```java
public class CustomCollectorManager implements CollectorManager<CustomCollector, List<Object>> {
@Override
Expand Down Expand Up @@ -354,12 +359,12 @@ List<Object> results = searcher.search(query, new CustomCollectorManager());

1. `IntField(String name, int value)`. Use `IntField(String, int, Field.Store)` with `Field.Store#NO` instead.
2. `DoubleField(String name, double value)`. Use `DoubleField(String, double, Field.Store)` with `Field.Store#NO` instead.
2. `FloatField(String name, float value)`. Use `FloatField(String, float, Field.Store)` with `Field.Store#NO` instead.
3. `LongField(String name, long value)`. Use `LongField(String, long, Field.Store)` with `Field.Store#NO` instead.
4. `LongPoint#newDistanceFeatureQuery(String field, float weight, long origin, long pivotDistance)`. Use `LongField#newDistanceFeatureQuery` instead
5. `BooleanQuery#TooManyClauses`, `BooleanQuery#getMaxClauseCount()`, `BooleanQuery#setMaxClauseCount()`. Use `IndexSearcher#TooManyClauses`, `IndexSearcher#getMaxClauseCount()`, `IndexSearcher#setMaxClauseCount()` instead
6. `ByteBuffersDataInput#size()`. Use `ByteBuffersDataInput#length()` instead
7. `SortedSetDocValuesFacetField#label`. `FacetsConfig#pathToString(String[])` can be applied to path as a replacement if string path is desired.
3. `FloatField(String name, float value)`. Use `FloatField(String, float, Field.Store)` with `Field.Store#NO` instead.
4. `LongField(String name, long value)`. Use `LongField(String, long, Field.Store)` with `Field.Store#NO` instead.
5. `LongPoint#newDistanceFeatureQuery(String field, float weight, long origin, long pivotDistance)`. Use `LongField#newDistanceFeatureQuery` instead
6. `BooleanQuery#TooManyClauses`, `BooleanQuery#getMaxClauseCount()`, `BooleanQuery#setMaxClauseCount()`. Use `IndexSearcher#TooManyClauses`, `IndexSearcher#getMaxClauseCount()`, `IndexSearcher#setMaxClauseCount()` instead
7. `ByteBuffersDataInput#size()`. Use `ByteBuffersDataInput#length()` instead
8. `SortedSetDocValuesFacetField#label`. `FacetsConfig#pathToString(String[])` can be applied to path as a replacement if string path is desired.

### Auto I/O throttling disabled by default in ConcurrentMergeScheduler (GITHUB#13293)

Expand Down Expand Up @@ -439,7 +444,6 @@ to the new coordinates:
|org.apache.lucene:lucene-analyzers-smartcn |org.apache.lucene:lucene-analysis-smartcn |
|org.apache.lucene:lucene-analyzers-stempel |org.apache.lucene:lucene-analysis-stempel |


### LucenePackage class removed (LUCENE-10260)

`LucenePackage` class has been removed. The implementation string can be
Expand Down Expand Up @@ -563,7 +567,7 @@ User dictionary now strictly validates if the (concatenated) segment is the same
unexpected runtime exceptions or behaviours.
For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.

```
```text
# concatenated "日本経済新聞" does not match the surface form "日経新聞"
日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞

Expand Down Expand Up @@ -631,7 +635,7 @@ is discouraged in favor of the default `MMapDirectory`.
### Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014)

`SpanQuery` and `PhraseQuery` now always calculate their slops as
`(1.0 / (1.0 + distance))`. Payload factor calculation is performed by
`(1.0 / (1.0 + distance))`. Payload factor calculation is performed by
`PayloadDecoder` in the `lucene-queries` module.

### Scorer must produce positive scores (LUCENE-7996)
Expand All @@ -645,9 +649,9 @@ As a side-effect of this change, negative boosts are now rejected and

### CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099)

Instead use `FunctionScoreQuery` and a `DoubleValuesSource` implementation. `BoostedQuery`
Instead use `FunctionScoreQuery` and a `DoubleValuesSource` implementation. `BoostedQuery`
and `BoostingQuery` may be replaced by calls to `FunctionScoreQuery.boostByValue()` and
`FunctionScoreQuery.boostByQuery()`. To replace more complex calculations in
`FunctionScoreQuery.boostByQuery()`. To replace more complex calculations in
`CustomScoreQuery`, use the `lucene-expressions` module:

```java
Expand All @@ -666,7 +670,6 @@ Changing `IndexOptions` for a field on the fly will now result into an
(`FieldType.indexOptions() != IndexOptions.NONE`) then all documents must have
the same index options for that field.


### IndexSearcher.createNormalizedWeight() removed (LUCENE-8242)

Instead use `IndexSearcher.createWeight()`, rewriting the query first, and using
Expand Down Expand Up @@ -744,7 +747,7 @@ Lucene.
### LeafCollector.setScorer() now takes a Scorable rather than a Scorer (LUCENE-6228)

`Scorer` has a number of methods that should never be called from `Collector`s, for example
those that advance the underlying iterators. To hide these, `LeafCollector.setScorer()`
those that advance the underlying iterators. To hide these, `LeafCollector.setScorer()`
now takes a `Scorable`, an abstract class that scorers can extend, with methods
`docId()` and `score()`.

Expand Down Expand Up @@ -981,10 +984,10 @@ removed in favour of the newly introduced `search(LeafReaderContextPartition[] p
### Indexing vectors with 8 bit scalar quantization is no longer supported but 7 and 4 bit quantization still work (GITHUB#13519)

8 bit scalar vector quantization is no longer supported: it was buggy
starting in 9.11 (GITHUB#13197). 4 and 7 bit quantization are still
supported. Existing (9.11) Lucene indices that previously used 8 bit
starting in 9.11 (GITHUB#13197). 4 and 7 bit quantization are still
supported. Existing (9.11) Lucene indices that previously used 8 bit
quantization can still be read/searched but the results from
`KNN*VectorQuery` are silently buggy. Further 8 bit quantized vector
`KNN*VectorQuery` are silently buggy. Further 8 bit quantized vector
indexing into such (9.11) indices is not permitted, so your path
forward if you wish to continue using the same 9.11 index is to index
additional vectors into the same field with either 4 or 7 bit
Expand Down
2 changes: 1 addition & 1 deletion lucene/SYSTEM_REQUIREMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Apache Lucene runs on Java 25 or greater.

It is also recommended to always use the latest update version of your
Java VM, because bugs may affect Lucene. An overview of known JVM bugs
can be found on https://cwiki.apache.org/confluence/display/LUCENE/JavaBugs
can be found on <https://cwiki.apache.org/confluence/display/LUCENE/JavaBugs>

With all Java versions it is strongly recommended to not use experimental
`-XX` JVM options.
Expand Down
1 change: 1 addition & 0 deletions lucene/backward-codecs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ we create fresh copies of the codec and format, and move the existing ones
into backwards-codecs.

Older codecs are tested in two ways:

* Through unit tests like TestLucene80NormsFormat, which checks we can write
then read data using each old format
* Through TestBackwardsCompatibility, which loads indices created in previous
Expand Down
7 changes: 3 additions & 4 deletions lucene/distribution/src/binary-release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ This is a binary distribution of Lucene. Lucene is a Java full-text
search engine. Lucene is not a complete application, but rather a code library
and an API that can easily be used to add search capabilities to applications.

* The Lucene web site is at: https://lucene.apache.org/
* The Lucene web site is at: <https://lucene.apache.org/>
* Please join the Lucene-User mailing list by sending a message to:
java-user-subscribe@lucene.apache.org
<java-user-subscribe@lucene.apache.org>

## Files in this binary distribution

Expand All @@ -42,8 +42,7 @@ Third-party licenses and notice files.

Please note that this package does not include all the binary dependencies
of all Lucene modules. Up-to-date dependency information for each Lucene
module is published to Maven central (as Maven POMs).
module is published to Maven central (as Maven POMs).

To review the documentation, read the main documentation page, located at:
`docs/index.html`

2 changes: 1 addition & 1 deletion lucene/luke/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ Integrated desktop GUI tool: a utility for browsing, searching and maintaining i

## Older releases

Older releases of Luke (prior to 8.1) can be found at https://github.com/DmitryKey/luke
Older releases of Luke (prior to 8.1) can be found at <https://github.com/DmitryKey/luke>
9 changes: 3 additions & 6 deletions lucene/luke/src/distribution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,11 @@
This is Luke, Apache Lucene low-level index inspection and repair utility.

Luke requires Java ${required.java.version}. You can start it with:
```
java -jar ${luke.cmd}
```

java -jar ${luke.cmd}

or, using Java modules:

```
java --module-path . --add-modules jdk.unsupported --module org.apache.lucene.luke
```
java --module-path . --add-modules jdk.unsupported --module org.apache.lucene.luke

Happy index hacking!