The Accidental Taxonomist: Text analytics

Showing posts with label Text analytics. Show all posts

Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Last week (September 30 – October 1) I attended the Text Analytics World conference in Boston as an invited speaker. This is the second year was fortunate to present at and attend this conference, which also meets in San Francisco in the spring. I posted a blog about the conference last fall, “Text Analytics and Taxonomies,” discussing the strong connections between taxonomies and text analytics in serving similar data/information retrieval goals. That connection between the two was again apparent at this year’s conference, with many speakers mentioning taxonomies, and I came away with additional analogies, beyond their shared purpose.

Problematic definition

Both taxonomies and text analytics are not well defined, and can have both a narrow definition and a broad definition. For taxonomies, the narrower meaning is a hierarchical tree of concepts arranged with broader and narrower relationships. The broad meaning of taxonomy is any controlled vocabulary, whether hierarchies, facets, thesauri, authority files, or simple terms lists to fill metadata fields. For text analytics, the narrower meaning is “text mining”, the process of deriving high-quality information contained in natural language text. But the conference chair, Tom Reamy of the KAPS Group, explained that the conference takes a broader definition of text analytics to include not only text mining but also, auto-categorization, sentiment analysis, predictive analytics, entity extraction, and machine learning.

There is also the issue of whether the name is appropriate. Some people don’t like the name taxonomies, and try to avoid it. Similarly, there are issues with the designation of “text analytics.” Discussion in the conference’s expert sessions and closing session, brought up the issue that perhaps a better name is needed for the field. Both “text” and “analytics” have issues, as they both have assumed narrower meanings. It comes out of the field of knowledge management, but that field is too broad. A more accurate label that Tom Reamy suggested was “unified data insights,” but it will stay text analytics for now.

Technology and human effort

Both taxonomies and text analytics rely on technology/software, but neither is a 100% automated solution, nor can the software products be used an out-of-the-box solutions without significant trained and skilled usage. If we consider the software as “tools” rather than “solutions,” we have a more realistic understanding of what the software can do. The process of building a taxonomy is aided by taxonomy or thesaurus management software, which is kind of a tool that an experienced taxonomist uses to manage the terms, relationships, synonyms, notes/definitions, and other term attributes. Similarly text analytics software, and auto-classification software in particular, requires expertise to leverage the tool for desired results. This was the theme of a presentation on selecting text analytics tools by Janine Johnson of Versik Analytics (who also used “tool” in her presentation title).

As I explained in my presentation, “Taxonomies for Auto-Tagging Unstructured Content,” both of the leading methods of auto-categorization, rules-based machine learning statistical methods, require considerable human input. In rules-based auto-categorization, experts need to write or edit rules for each taxonomy concept that leverage combinations of synonyms and proximity or other Boolean operators; and in machine-learning auto-categorization, experts need to identify and essentially pre-index a large set of sample documents for each taxonomy term, for the system to learn from the human indexed example.

Multidisciplinary background

Both taxonomies and text analytics are seen as a fields of expertise, methods of knowledge management, and at least parts of a solution to an organization’s information management problem. However they are not academic disciplines or majors. Rather, the educational background and skills of people who work in the fields of both taxonomies and text analytics is somewhat varied and multidisciplinary.

In taxonomies, library/information science is the most dominant background, but probably does not account for any more than half of practicing taxonomies. Information architecture/user experience design, database design, knowledge management, editorial, and subject matter (health, law, science, business, etc.) expertise are also common backgrounds.

In text analytics, computer science is the most common background. A show of hands of the conference participants indicated that the majority had computer science or engineering backgrounds. But linguistics is also important (although the small minority at this conference were more hesitant to reveal themselves). The keynote speaker, Dr. James Pennebaker, was a psychologist and explained why psychology is also important to text analytics. Participants in the closing expert panel answered my question on educational background with a similar answer of a combination of computer science/programming, linguistics, and cognitive sciences.

In addition to the interdisciplinary background of taxonomists and text analytics professionals, the applications of taxonomies and text analytics also span all disciplines and industries. Conference case studies included applications of text analytics in education, pharmaceuticals, healthcare, publishing, telecommunications, and federal agencies.

Tuesday, October 9, 2012

Text Analytics and Taxonomies

What does text analytics have to do with taxonomies? Not so much, I had previously assumed, other than serving a similar objective of information retrieval. After all, text analytics is known as a natural language processing technology designed to obtain meaning for text without the traditional process of indexing to a taxonomy. At the recent Text Analytics World conference in Boston October 3 and 4, however, I learned that text analytics is much more and that the ties between text analytics and taxonomies are greater than I assumed.

The concept of text analytics is used more broadly than I realized, and, as defined in the opening keynote given by conference chair Tom Reamy, encompasses:

Text mining, based on natural language processing, statistics, and machine learning
Entity extraction, semantic technology that enables "fact extraction”
Sentiment analysis, comprising various method to look for positive and negative words
Auto-categorization, which is often rules-based

I was a presenter at this conference, and since I always talk about what I know, which is taxonomies, I endeavored to make a connection between taxonomies and text analytics. But to my surprise I was not the only one talking about taxonomies at Text Analytics World. Two other presentations featured “taxonomies” in their titles thus comprising with mine a half afternoon “Text Analytics and Taxonomies” track. Furthermore, the subject of taxonomies was central to four other presentations and mentioned in a couple others.

My presentation, "Taxonomies for Text Analytics and Auto-Indexing," described how text analytics can be used with auto-categorization and taxonomies to achieve relatively high quality automated indexing results. Auto-categorization is a type of automated indexing that tends to make use of taxonomies, as categorization requires categories (taxonomy terms). Text analytics can be used as a technology to generate meaningful terms from texts, which in turn can be used auto-categorize content against a pre-existing taxonomy. Auto-categorization typically involves technologies of either complex rules to match terms or algorithms and machine learning. In either case, the terms picked up in auto-categorization would be more meaningful if they were first extracted with text analytics technologies based on natural language processing.

Another presentation looked at a different side to the relationship taxonomies and text analytics. Text analytics is also used as means to build taxonomies in the first place, by providing suggested terms that a taxonomist can then edit. Edee Edwards and Rena Morse of Silverchair Information Systems presented a case study on using text analytics to generate terms for taxonomy development. It required multiple iterations and refinements.

Other presenters on the subject of taxonomies and text analytics included the following:

Heather Edwards of the Associated Press explained how AP classifies the news using a custom-build taxonomy and rule-based auto-classification system.
Evelyn Kent of MCT SmartContent also presented how news items are classified using a “context-based language” (taxonomy), and even demonstrated how the taxonomy is managed in the taxonomy tool (SmartLogic Semaphore Ontology Manager).
Anna Divoli of Pingar presented survey results of taxonomy user interface preferences from cases that involved automatically generated hierarchical and faceted taxonomies.
Alyona Medelyan also of Pingar discussed “controlled indexing” in her case study, which featured results of comparing human versus automated indexing (using machine learning and training sets) using the same taxonomy (the Agrovoc agriculture thesaurus of the FAO).
Sarah Ann Berndt of the Johnson Space Center spoke about “automatic generation of semantic markup” in a presentation that turned out to be mostly about the application of a taxonomy.

The subject of taxonomies had also come up in the opening keynote. Tom Reamy described three themes in text analytics: big data, sentiment analysis of social media, and enterprise text analytics. In all three areas he mentioned taxonomies. In the area of text mining and big data, text analytics can serve as a semi-automated taxonomy development. In sentiment analysis, new kinds of taxonomies are being developed for emotional sentiments. In enterprise search, text analytics bridges the gap between taxonomies and documents.

Even if text analytics and taxonomies are combined in different ways, what is common is that combining techniques, tools, and technologies in more challenging situations achieves better results. Techniques, tools, and technologies in this field do not have to compete, but can complement each other.

Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Tuesday, October 9, 2012

Text Analytics and Taxonomies

Subscribe to The Accidental Taxonomist Blog