Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Last week (September 30 – October 1) I attended the Text Analytics World conference in Boston as an invited speaker.  This is the second year was fortunate to present at and attend this conference, which also meets in San Francisco in the spring. I posted a blog about the conference last fall, “Text Analytics and Taxonomies,” discussing the strong connections between taxonomies and text analytics in serving similar data/information retrieval goals. That connection between the two was again apparent at this year’s conference, with many speakers mentioning taxonomies, and I came away with additional analogies, beyond their shared purpose.

Problematic definition

Both taxonomies and text analytics are not well defined, and can have both a narrow definition and a broad definition. For taxonomies, the narrower meaning is a hierarchical tree of concepts arranged with broader and narrower relationships. The broad meaning of taxonomy is any controlled vocabulary, whether hierarchies, facets, thesauri, authority files, or simple terms lists to fill metadata fields. For text analytics, the narrower meaning is “text mining”, the process of deriving high-quality information contained in natural language text. But the conference chair, Tom Reamy of the KAPS Group, explained that the conference takes a broader definition of text analytics to include not only text mining but also, auto-categorization, sentiment analysis, predictive analytics, entity extraction, and machine learning.

There is also the issue of whether the name is appropriate. Some people don’t like the name taxonomies, and try to avoid it. Similarly, there are issues with the designation of “text analytics.” Discussion in the conference’s expert sessions and closing session, brought up the issue that perhaps a better name is needed for the field. Both “text” and “analytics” have issues, as they both have assumed narrower meanings. It comes out of the field of knowledge management, but that field is too broad. A more accurate label that Tom Reamy suggested was “unified data insights,” but it will stay text analytics for now.

Technology and human effort

Both taxonomies and text analytics rely on technology/software, but neither is a 100% automated solution, nor can the software products be used an out-of-the-box solutions without significant trained and skilled usage. If we consider the software as “tools” rather than “solutions,” we have a more realistic understanding of what the software can do. The process of building a taxonomy is aided by taxonomy or thesaurus management software, which is kind of a tool that an experienced taxonomist uses to manage the terms, relationships, synonyms, notes/definitions, and other term attributes. Similarly text analytics software, and auto-classification software in particular, requires expertise to leverage the tool for desired results. This was the theme of a presentation on selecting text analytics tools by Janine Johnson of Versik Analytics (who also used “tool” in her presentation title).

As I explained in my presentation, “Taxonomies for Auto-Tagging Unstructured Content,” both of the leading methods of auto-categorization, rules-based machine learning statistical methods, require considerable human input. In rules-based auto-categorization, experts need to write or edit rules for each taxonomy concept that leverage combinations of synonyms and proximity or other Boolean operators; and in machine-learning auto-categorization, experts need to identify and essentially pre-index a large set of sample documents for each taxonomy term, for the system to learn from the human indexed example.


Multidisciplinary background

Both taxonomies and text analytics are seen as a fields of expertise, methods of knowledge management, and at least parts of a solution to an organization’s information management problem. However they are not academic disciplines or majors. Rather, the educational background and skills of people who work in the fields of both taxonomies and text analytics is somewhat varied and multidisciplinary.

In taxonomies, library/information science is the most dominant background, but probably does not account for any more than half of practicing taxonomies. Information architecture/user experience design, database design, knowledge management, editorial, and subject matter (health, law, science, business, etc.) expertise are also common backgrounds.

In text analytics, computer science is the most common background. A show of hands of the conference participants indicated that the majority had computer science or engineering backgrounds. But linguistics is also important (although the small minority at this conference were more hesitant to reveal themselves). The keynote speaker, Dr. James Pennebaker, was a psychologist and explained why psychology is also important to text analytics. Participants in the closing expert panel answered my question on educational background with a similar answer of a combination of computer science/programming, linguistics, and cognitive sciences.

In addition to the interdisciplinary background of taxonomists and text analytics professionals, the applications of taxonomies and text analytics also span all disciplines and industries. Conference case studies included applications of text analytics in education, pharmaceuticals, healthcare, publishing, telecommunications, and federal agencies.