Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
There are a great many other small packages available in the contrib/miscellaneous package, which we’ll list briefly here:
IndexSplitter and MultiPassIndexSplitter are two tools for taking an existing index and breaking it into multiple parts. IndexSplitter can only break the index according to its existing segments, but is fast because it does simple file-level copying. MultiPassIndexSplitter can break at arbitrary points (equally by document count), but is slower because it visits documents one at a time and makes multiple passes.
BalancedSegmentMergePolicy is a custom MergePolicy that tries to avoid creating large segments while also avoiding allowing too many small segments to accumulate in the index. The idea is to prevent enormous merges from occurring, which because they are I/O- and CPU-intensive can affect ongoing search performance in a near-real-time search application. MergePolicy is covered in section 2.13.6.
TermVectorAccessor enables you to access term vectors from an index even in cases where the document wasn’t indexed with term vectors. You pass in a TermVectorMapper, described in section 5.9.3, that will receive the term vectors. If term vectors were stored in the index, they’re loaded directly and sent to the mapper. If not, the information is regenerated by visiting every term in the index and skipping to the requested document. Note that this regeneration process can be very slow on a large index.
FieldNormModifier is a standalone tool (defines a static main method) that allows you to recompute all norms in your index according to a specified similarity class. It visits all terms in the inverted index for the field you specify, computing the length in terms of that field for all nondeleted documents, and then uses the provided similarity class to compute and set a new norm for each document. This is useful for fast experimentation of different ways to boost fields according to their length by using a custom Similarity class.
HighFreqTerms is a standalone tool that opens the index at the directory path you provide, optionally also taking a specific field, and then prints out the top 100 most frequent terms in the index.
IndexMergeTool is a standalone tool that opens a series of indexes at the paths you provide, merging them together using IndexWriter.addIndexes. The first argument is the directory that all subsequent directories will be merged into.
SweetSpotSimilarity is an alternative Similarity implementation that provides a plateau of equally good lengths when computing field boost. You have to configure it to see the “sweet spot” typical length of your documents, but this can result in solid improvements to Lucene’s relevance. http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_IBM_Haifa_Team describes a set of experiments on the Trec 2007 Million Queries Track, including SweetSpotSimilarity, that provided sizable improvements to Lucene’s relevance.
PrecedenceQueryParser is an alternative QueryParser that tries to handle operator precedence in a more consistent manner.
AnalyzingQueryParser is an extension to QueryParser that also passes the text for FuzzyQuery, PrefixQuery, TermRangeQuery, and WildcardQuery instances through the analysis process (the core QueryParser doesn’t).
ComplexPhraseQueryParser is an extension to QueryParser that permits embedding of wildcard and fuzzy queries within a phrase query, such as (john jon jonathan~) peters*.