Loading…
ApacheCon North America will be held at the Intercontinetal Miami in Miami, Florida. Register now for the event taking place May 16-18, 2017. 
Thursday, May 18 • 2:40pm - 3:30pm
Evaluating Text Extraction: Apache Tika's™ New Tika-Eval Module - Tim Allison, The MITRE Corporation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Text extraction tools are essential for obtaining the textual content and metadata of computer files for use in a wide variety of applications, including search and natural language processing tools. Techniques and tools for evaluating text extraction tools are missing from academia and industry.

Apache Tika™ detects file types and extracts metadata and text from many file types. Tika is a crucial component in a wide variety of tools, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®.

In this talk, we will give an overview of the new tika-eval module that allows developers to evaluate Tika and other content extraction systems. This talk will end with a brief discussion of the results of taking this evaluation methodology public and evaluating Tika on large batches of public domain documents on a public vm over the last two years.

Speakers
avatar for Tim Allison

Tim Allison

Principal Artificial Intelligence Engineer, The MITRE Corporation
Tim has been working in natural language processing since 2002. In recent years, his focus has shifted to advanced search and content/metadata extraction. Tim is committer and PMC member on Apache PDFBox (since September 2016), and on Apache POI and Apache Tika since (July, 2013... Read More →


Thursday May 18, 2017 2:40pm - 3:30pm EDT
Brickell