Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint
Share this Page URL
Help

13.2. The text extraction pool

One feature that separates a Jackrabbit content repository from a relational database is the ease by which it can handle normal files. You can drop digital documents such as PowerPoint presentations or PDFs into a content repository and have them searchable by content without any custom indexing setup. Let’s see how Jackrabbit does this.

Whenever a node is added, modified, or removed in Jackrabbit, the integrated Lucene index is updated to match the change. If the node contains binary properties, the contents of those properties are extracted with Tika and added to the index as text. Since text extraction can be time-consuming for some documents, Jackrabbit uses a set of background threads for this purpose. This allows the index to be updated immediately during a save, and then reupdated as soon as the extracted text becomes available. Together these updates create an illusion of a super-fast index whose accuracy improves incrementally over time.


  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial