Tuesday, October 9, 2012

Text Analytics and Taxonomies

What does text analytics have to do with taxonomies? Not so much, I had previously assumed, other than serving a similar objective of information retrieval. After all, text analytics is known as a natural language processing technology designed to obtain meaning for text without the traditional process of indexing to a taxonomy. At the recent Text Analytics World conference in Boston October 3 and 4, however, I learned that text analytics is much more and that the ties between text analytics and taxonomies are greater than I assumed.

The concept of text analytics is used more broadly than I realized, and, as defined in the opening keynote given by conference chair Tom Reamy, encompasses:
  • Text mining, based on natural language processing, statistics, and machine learning
  • Entity extraction, semantic technology that enables "fact extraction”
  • Sentiment analysis, comprising various method to look for positive and negative words
  • Auto-categorization, which is often rules-based
I was a presenter at this conference, and since I always talk about what I know, which is taxonomies, I endeavored to make a connection between taxonomies and text analytics. But to my surprise I was not the only one talking about taxonomies at Text Analytics World.  Two other presentations featured “taxonomies” in their titles thus comprising with mine a half afternoon “Text Analytics and Taxonomies” track. Furthermore, the subject of taxonomies was central to four other presentations and mentioned in a couple others.

My presentation, "Taxonomies for Text Analytics and Auto-Indexing," described how text analytics can be used with auto-categorization and taxonomies to achieve relatively high quality automated indexing results. Auto-categorization is a type of automated indexing that tends to make use of taxonomies, as categorization requires categories (taxonomy terms). Text analytics can be used as a technology to generate meaningful terms from texts, which in turn can be used auto-categorize content against a pre-existing taxonomy. Auto-categorization typically involves technologies of either complex rules to match terms or algorithms and machine learning. In either case, the terms picked up in auto-categorization would be more meaningful if they were first extracted with text analytics technologies based on natural language processing.

Another presentation looked at a different side to the relationship taxonomies and text analytics. Text analytics is also used as means to build taxonomies in the first place, by providing suggested terms that a taxonomist can then edit. Edee Edwards and Rena Morse of Silverchair Information Systems presented a case study on using text analytics to generate terms for taxonomy development. It required multiple iterations and refinements.

Other presenters on the subject of taxonomies and text analytics included the following:
  • Heather Edwards of the Associated Press explained how AP classifies the news using a custom-build taxonomy and rule-based auto-classification system.
  • Evelyn Kent of MCT SmartContent also presented how news items are classified  using a “context-based language” (taxonomy), and even demonstrated how the taxonomy is managed in the taxonomy tool (SmartLogic Semaphore Ontology Manager).
  • Anna Divoli of Pingar presented survey results of taxonomy user interface preferences from cases that involved automatically generated hierarchical and faceted taxonomies.
  • Alyona Medelyan also of Pingar discussed “controlled indexing” in her case study, which featured results of comparing human versus automated indexing (using machine learning and training sets) using the same taxonomy (the Agrovoc agriculture thesaurus of the FAO).
  • Sarah Ann Berndt of the Johnson Space Center spoke about “automatic generation of semantic markup” in a presentation that turned out to be mostly about the application of a taxonomy.
The subject of taxonomies had also come up in the opening keynote. Tom Reamy described three themes in text analytics: big data, sentiment analysis of social media, and enterprise text analytics. In all three areas he mentioned taxonomies. In the area of text mining and big data, text analytics can serve as a semi-automated taxonomy development. In sentiment analysis, new kinds of taxonomies are being developed for emotional sentiments. In enterprise search, text analytics bridges the gap between taxonomies and documents.

Even if text analytics and taxonomies are combined in different ways, what is common is that combining techniques, tools, and technologies in more challenging situations achieves better results. Techniques, tools, and technologies in this field do not have to compete, but can complement each other.


  1. Great post! Sums up the connection between taxonomy and text analytics really well. I notice a trend of companies moving away from rules based text analytics in favour of semantic machine learning technologies. I think this shift is intended to save resources required to constantly update taxonomy/categorization rules. I wonder if machines do better undertanding how to apply tags/or categorize content over time. It will be interesting to see if any studies emerge in the future.

  2. Great summary. Thanks. A very useful steer.