The Accidental Taxonomist: Auto-categorization

Showing posts with label Auto-categorization. Show all posts

Friday, November 24, 2017

Auto-categorization and Taxonomies

Taxonomies and thesauri are only truly useful if their terms are appropriately indexed or tagged to content. My path to taxonomist had been as an indexer, so I always value the importance of human indexers. Nevertheless, I must acknowledge that automated indexing, also called auto-categorization, is becoming increasingly common and important.

At the most recent Taxonomy Boot Camp conference (November 6-7, in Washington, DC), a trend I discerned was the increasingly commonplace use of auto-categorization (or at least machine-aided indexing) with taxonomies. Conference presentations didn’t state auto-categorization as something new but rather sometime more matter of-the-fact, and by the way, the software vendor used in this case is so-and-so. There were also sessions on artificial intelligence and taxonomy and on leveraging taxonomy management with machine learning. There is also a lot of interest in text analytics, a field broader than auto-categorization, which justified the first Text Analytics Forum conference co-located with and immediately following Taxonomy Boot Camp (which I, unfortunately, did not have time for).

When conference speakers and others state that automated indexing has been proven repeatedly in test comparisons to be more “reliable” and more “consistent” than human/manual indexing, while true, that does not mean it is better. Human indexing is certainly not as consistent, as two trained indexers will not index exactly the same way, but the way they differ is rarely so substantial. One indexer may add an additional index term. Another indexer may index with a slightly different, but related, term. Automated indexing, on the other hand, while consistent, is not as correct. Depending on the method, it can be approximately 20% inaccurate, indexing with completely wrong terms or completely missing the most appropriate terms. That’s where “machine-aided indexing” comes in, where indexing is initially automated, but a human quickly reviews the suggested terms, adding or deleting terms as appropriate.

The primary reason for implementing automated indexing is not so much to achieve consistent indexing, but rather to achieve efficient indexing. This is because the amount of content to be indexed in many organizations is growing too fast to be kept up with by manual indexing. Publishers of external content for subscribers have also transitioned to partial automated indexes or machine-aided indexing.

While enterprise search engines do not utilize taxonomies by default (but can be configured to make use of them), auto-categorization software generally uses some form of taxonomies. Search engines can function out-of-the-box without any taxonomies or controlled vocabularies, although a search thesaurus (a.k.a synonym ring) can significantly improve search precision and recall. Auto-categorization software, on the other hand, relies on “categories,” which can be simple controlled vocabularies or hierarchical or faceted taxonomies. Thus, as auto-categorization is gaining wider adoption, the need for taxonomies to support them is also growing.

Automated indexing technologies have not advanced significantly in recent years, but there have been improvements in auto-categorization software by effectively combining more than one technology method within the same software product. The main technology methods are (1) rules-based and (2) machine-learning. Regardless of the method, automated indexing is still not fully automated. Humans are required to put in time and effort beforehand to either write or edit rules for each taxonomy term, or to provide and test training sets of sample documents to index for machine learning. These could be dedicated roles or additional tasks to be performed by the taxonomist.

Auto-categorization is also becoming more common, because software products that effectively combine taxonomy management with auto-categorization have become more established and better integrated. Although there are many organizations which continue to use distinctly separate software for each of taxonomy management and auto-categorization, organizations newer to taxonomy adoption prefer to have a single solution. Synaptica is the one major taxonomy management vendor which does not yet include fully integrated auto-categorization, and they are very actively working on incorporating the technology. I have separate chapters in my book, The Accidental Taxonomist for software for taxonomy management and software for auto-categorization, but in my second edition I ended up repeating more vendors in both sections.

Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Last week (September 30 – October 1) I attended the Text Analytics World conference in Boston as an invited speaker. This is the second year was fortunate to present at and attend this conference, which also meets in San Francisco in the spring. I posted a blog about the conference last fall, “Text Analytics and Taxonomies,” discussing the strong connections between taxonomies and text analytics in serving similar data/information retrieval goals. That connection between the two was again apparent at this year’s conference, with many speakers mentioning taxonomies, and I came away with additional analogies, beyond their shared purpose.

Problematic definition

Both taxonomies and text analytics are not well defined, and can have both a narrow definition and a broad definition. For taxonomies, the narrower meaning is a hierarchical tree of concepts arranged with broader and narrower relationships. The broad meaning of taxonomy is any controlled vocabulary, whether hierarchies, facets, thesauri, authority files, or simple terms lists to fill metadata fields. For text analytics, the narrower meaning is “text mining”, the process of deriving high-quality information contained in natural language text. But the conference chair, Tom Reamy of the KAPS Group, explained that the conference takes a broader definition of text analytics to include not only text mining but also, auto-categorization, sentiment analysis, predictive analytics, entity extraction, and machine learning.

There is also the issue of whether the name is appropriate. Some people don’t like the name taxonomies, and try to avoid it. Similarly, there are issues with the designation of “text analytics.” Discussion in the conference’s expert sessions and closing session, brought up the issue that perhaps a better name is needed for the field. Both “text” and “analytics” have issues, as they both have assumed narrower meanings. It comes out of the field of knowledge management, but that field is too broad. A more accurate label that Tom Reamy suggested was “unified data insights,” but it will stay text analytics for now.

Technology and human effort

Both taxonomies and text analytics rely on technology/software, but neither is a 100% automated solution, nor can the software products be used an out-of-the-box solutions without significant trained and skilled usage. If we consider the software as “tools” rather than “solutions,” we have a more realistic understanding of what the software can do. The process of building a taxonomy is aided by taxonomy or thesaurus management software, which is kind of a tool that an experienced taxonomist uses to manage the terms, relationships, synonyms, notes/definitions, and other term attributes. Similarly text analytics software, and auto-classification software in particular, requires expertise to leverage the tool for desired results. This was the theme of a presentation on selecting text analytics tools by Janine Johnson of Versik Analytics (who also used “tool” in her presentation title).

As I explained in my presentation, “Taxonomies for Auto-Tagging Unstructured Content,” both of the leading methods of auto-categorization, rules-based machine learning statistical methods, require considerable human input. In rules-based auto-categorization, experts need to write or edit rules for each taxonomy concept that leverage combinations of synonyms and proximity or other Boolean operators; and in machine-learning auto-categorization, experts need to identify and essentially pre-index a large set of sample documents for each taxonomy term, for the system to learn from the human indexed example.

Multidisciplinary background

Both taxonomies and text analytics are seen as a fields of expertise, methods of knowledge management, and at least parts of a solution to an organization’s information management problem. However they are not academic disciplines or majors. Rather, the educational background and skills of people who work in the fields of both taxonomies and text analytics is somewhat varied and multidisciplinary.

In taxonomies, library/information science is the most dominant background, but probably does not account for any more than half of practicing taxonomies. Information architecture/user experience design, database design, knowledge management, editorial, and subject matter (health, law, science, business, etc.) expertise are also common backgrounds.

In text analytics, computer science is the most common background. A show of hands of the conference participants indicated that the majority had computer science or engineering backgrounds. But linguistics is also important (although the small minority at this conference were more hesitant to reveal themselves). The keynote speaker, Dr. James Pennebaker, was a psychologist and explained why psychology is also important to text analytics. Participants in the closing expert panel answered my question on educational background with a similar answer of a combination of computer science/programming, linguistics, and cognitive sciences.

In addition to the interdisciplinary background of taxonomists and text analytics professionals, the applications of taxonomies and text analytics also span all disciplines and industries. Conference case studies included applications of text analytics in education, pharmaceuticals, healthcare, publishing, telecommunications, and federal agencies.

Tuesday, October 9, 2012

Text Analytics and Taxonomies

What does text analytics have to do with taxonomies? Not so much, I had previously assumed, other than serving a similar objective of information retrieval. After all, text analytics is known as a natural language processing technology designed to obtain meaning for text without the traditional process of indexing to a taxonomy. At the recent Text Analytics World conference in Boston October 3 and 4, however, I learned that text analytics is much more and that the ties between text analytics and taxonomies are greater than I assumed.

The concept of text analytics is used more broadly than I realized, and, as defined in the opening keynote given by conference chair Tom Reamy, encompasses:

Text mining, based on natural language processing, statistics, and machine learning
Entity extraction, semantic technology that enables "fact extraction”
Sentiment analysis, comprising various method to look for positive and negative words
Auto-categorization, which is often rules-based

I was a presenter at this conference, and since I always talk about what I know, which is taxonomies, I endeavored to make a connection between taxonomies and text analytics. But to my surprise I was not the only one talking about taxonomies at Text Analytics World. Two other presentations featured “taxonomies” in their titles thus comprising with mine a half afternoon “Text Analytics and Taxonomies” track. Furthermore, the subject of taxonomies was central to four other presentations and mentioned in a couple others.

My presentation, "Taxonomies for Text Analytics and Auto-Indexing," described how text analytics can be used with auto-categorization and taxonomies to achieve relatively high quality automated indexing results. Auto-categorization is a type of automated indexing that tends to make use of taxonomies, as categorization requires categories (taxonomy terms). Text analytics can be used as a technology to generate meaningful terms from texts, which in turn can be used auto-categorize content against a pre-existing taxonomy. Auto-categorization typically involves technologies of either complex rules to match terms or algorithms and machine learning. In either case, the terms picked up in auto-categorization would be more meaningful if they were first extracted with text analytics technologies based on natural language processing.

Another presentation looked at a different side to the relationship taxonomies and text analytics. Text analytics is also used as means to build taxonomies in the first place, by providing suggested terms that a taxonomist can then edit. Edee Edwards and Rena Morse of Silverchair Information Systems presented a case study on using text analytics to generate terms for taxonomy development. It required multiple iterations and refinements.

Other presenters on the subject of taxonomies and text analytics included the following:

Heather Edwards of the Associated Press explained how AP classifies the news using a custom-build taxonomy and rule-based auto-classification system.
Evelyn Kent of MCT SmartContent also presented how news items are classified using a “context-based language” (taxonomy), and even demonstrated how the taxonomy is managed in the taxonomy tool (SmartLogic Semaphore Ontology Manager).
Anna Divoli of Pingar presented survey results of taxonomy user interface preferences from cases that involved automatically generated hierarchical and faceted taxonomies.
Alyona Medelyan also of Pingar discussed “controlled indexing” in her case study, which featured results of comparing human versus automated indexing (using machine learning and training sets) using the same taxonomy (the Agrovoc agriculture thesaurus of the FAO).
Sarah Ann Berndt of the Johnson Space Center spoke about “automatic generation of semantic markup” in a presentation that turned out to be mostly about the application of a taxonomy.

The subject of taxonomies had also come up in the opening keynote. Tom Reamy described three themes in text analytics: big data, sentiment analysis of social media, and enterprise text analytics. In all three areas he mentioned taxonomies. In the area of text mining and big data, text analytics can serve as a semi-automated taxonomy development. In sentiment analysis, new kinds of taxonomies are being developed for emotional sentiments. In enterprise search, text analytics bridges the gap between taxonomies and documents.

Even if text analytics and taxonomies are combined in different ways, what is common is that combining techniques, tools, and technologies in more challenging situations achieves better results. Techniques, tools, and technologies in this field do not have to compete, but can complement each other.