FC IBM and DARPA target unstructured data - UIMA open sourced  

« NavisWorks' Results | Main | JSR 231 - Java Bindings for OpenGL »

IBM and DARPA target unstructured data - UIMA open sourced

Acronyms galore in this post...

UIMA stands for Unstructured Infomation Management Architecture. This is a Java-based framework for performing semantic analysis on unstructured data such as the text inside of emails, word documents, spreadsheets, PDFs, etc. In my post earlier today about Adobe I mentioned the use of PDF as an archival format for unstructured documents. One of the features that Adobe has added in recent versions has been the ability to do full text searches within a PDF as part of the reader. The textual contents of PDFs can also be indexed without too much trouble allowing developers to enable search across a collection of PDF documents simultaneously. All of this is quite useful and is available today but doesn't provide much precision or make it really easy for an analyst to spot trends in a large amount of unstructured data.

The UIMA attempts to take analysis one step further by providing a framework for semantic search - meaning that the PDFs (for example) could be searched for concepts and related topics rather instead of straight keyword-matching. The working group to organise this technology was sponsored by DARPA (Defese Advanced Research Projects Agency) in the US and organised by IBM research and now has been released to the open source community and is available on SourceForge.

More technical detail available here on IBM Alphaworks.

Leave a comment

Copyright © Nathan R. Doughty 1994-2005