Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint

1.3. Summary

We’ve introduced you to Apache Tika, an extensible Java-based framework for content analysis and detection. First, we explained the motivation for Tika by describing the proliferation of content types such as PDFs, Word, Excel, and HTML, and tools associated with performing functions on those types. The sad fact is that the typical pattern involves specializing knowledge of these types to particular applications, and needing to maintain a set of applications to deal with each type of file (the Microsoft Office suite, Adobe Photoshop, XML editors, and so forth). Beyond applications, many application programming interfaces (APIs) exist that handle these document types, but they’re highly heterogeneous—they make different assumptions, provide different interfaces, and support varying qualities of service. Enter Apache Tika, a mechanism to bridge the content type diversity and to deal with file types in a uniform way.

The nuances and complexity in writing an application (such as a search engine) that must deal with all of these content types at once quickly grow to be untenable without a technology like Tika. Clearly, the problem of obtaining information from these file types is centered around the ability to identify the type of file automatically (especially when dealing with large numbers of such files), extract the textual information, and extract common metadata (such as Dublin Core) useful for quickly comparing and understanding the file types. Language detection is also a needed feature (similar to MIME detection) to determine means for extracting out the textual information from each file type. As it turns out, Apache Tika (big surprise!) provides simple mechanisms to address these functions in modern software.


  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial