Sunday, November 22, 2020

What it a Thesaurus and What is it Good For

It is somewhat ironic that in the domain of controlled vocabularies and knowledge organizations systems that there continue to exist differing meanings for “controlled vocabulary,” “taxonomy,” “thesaurus,” “ontology,” and “knowledge graph.” Hopefully, I have provided some clarification regarding what a taxonomy is and is not in my previous posts on taxonomy vs. classification, taxonomy vs. navigation, and when a taxonomy should not be hierarchical. Let’s turn now to thesauri.

Different meanings of thesaurus

I recently attended a webinar on taxonomies, ontologies, and knowledge graphs, in which a thesaurus was described as a set of synonyms for each identified concept in a list. This is not the right definition for this context. A set of synonyms for each of list of concepts is what we taxonomists call a “synonym ring”, and what administrators of enterprise search engines would call a “search thesaurus.” The use of the word “thesaurus” in this case refers to the dictionary-type thesaurus (as the default Thesaurus entry in Wikipedia) such as Roget’s Thesaurus, where synonyms are presented for each word. Synonyms are included to support search, by matching potential words and phrases entered by users into the search box with the words and phrases that likely occur in the text of content, so that content is not missed due to the searcher using a different synonym.

  • The “search thesaurus” (synonyms ring) differs from the synonym-dictionary thesaurus, however, in several ways, due to their different uses:
  • A search thesaurus includes phrases, not just single words as in a dictionary thesaurus.
  • A search thesaurus comprises concepts that are nouns, verbal nouns, or noun phrases, not just any part of speech as a dictionary may include.
  • The “synonyms” in a search thesaurus are appropriately equivalent terms that can be used interchangeably in all cases for the content repository, not synonyms that may be used in only some cases, as the dictionary suggests.

However, in the context of taxonomies/ontologies (not the context of search administration), the designation thesaurus has a significantly different meaning. Also referred to as in information thesaurus or information-retrieval thesaurus (to distinguish it from the synonym dictionary type), there is a different entry in Wikipedia for Thesaurus (Information Retrieval), which defines it as “a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects.” This is the meaning that relates to taxonomies and ontologies. More significant than the Wikipedia definition, are the published standards/guidelines for how to construct thesauri: ISO 25964 Thesauri and interoperability with other vocabularies and ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. While the latter does not name thesauri in its title (although it did in an earlier version), it is essentially about thesauri and defines, in section 4.1 Definitions, a thesaurus: “A controlled vocabulary arranged in a known order and structured so that the various relationships among terms are displayed clearly and identified by standardized relationship indicators.

So, a thesaurus is a kind of controlled vocabulary or a kind of knowledge organization system which is quite structured and has certain standard features: terms that are noun phrases, hierarchical relationships between terms, associative (related, but not hierarchically) relationships between terms, “synonym” or variants, which are called nonpreferred terms, and scope notes on terms. Other metadata on terms is possible, and variations of hierarchical and associative relationships may also be possible.

Thesaurus usefulness

On the continuum chart of controlled vocabulary (knowledge organization system) types, a thesaurus falls between a taxonomy and an ontology in its level of complexity and support for semantics.

Charg of controlled vocabulary typesControlled vocabulary types

Since both taxonomies and ontologies are recognized as useful, it would seem illogical that something that is in between should not be considered at least as a useful. A thesaurus has the benefits of supporting more semantics than a taxonomy while not being as complex as an ontology.

Even if most relationships are hierarchical, there may be times when creating an associative relationship between related subjects seems logical and would be helpful to users, such as relating between a process and agent, action and property, cause and effect, object and origins, discipline and practitioner, etc. Or it might not be subjects. For example, ecommerce may want to recommend “related” product categories, or content on activities could relate activities to products. In an expert people finder, person names can be related to subject areas of expertise, If the scope of “related” types is kept limited, then the generic associative relationships (“related term”) may suffice without getting to level of complexity of an ontology where there are multiple types of defined semantic relationships.

The added associative relationships and comprehensive inclusion of synonyms/nonpreferred terms also supports better (more comprehensive) tagging, whether manual or automated, by providing suggestions to the indexers or providing context for the auto-classification tool.

Finally, the overall structure of a thesaurus is more flexible than that of a taxonomy. A taxonomy groups concepts into categories with a limited number of top concepts (or “top terms”). A concept which has no broader and no narrower concept relationships, sometimes called an “orphan,” is considered an error in a taxonomy. In a thesaurus, on the other hand, where an over-arching hierarchical structure is not required (although may exist) and associative relationships are included, it is OK to have a concept with no broader and no narrower relationships, but at least an associative relationship. Thus, the taxonomist does not always have to force new concept into an existing hierarchy which might not be ideal.

Software for thesaurus management

Software to support the development and maintenance of thesauri has also been available for some time. (Taxobank has a historic list, not updated since 2013.) There actually is no such thing as “taxonomy” management software, because the software used to create taxonomies is really “thesaurus” management software, and the added thesaurus features, such as associative relationships, are just not utilized when creating a simple taxonomy.

As taxonomies have become more popular than thesauri, the software vendors have reflected that by having a hierarchical display (instead of alphabetical) as the default, and by marketing their solutions for taxonomies and ontologies and de-emphasizing or omitting mention of thesauri. For example, the basic core module of the PoolParty Semantic suite is appropriately named Thesaurus Server, since you can easily create thesauri with it, but the default hierarchical display suggests the use for taxonomies, and the website's product page says it’s for “Enterprise Taxonomy and Ontology Management.”

Thesauri today

Thesaurus design principles are applicable to both thesauri and taxonomies. Therefore, thesauri continue to be taught in library science and information science degree programs, including courses on information architecture. The book Information Architecture for the Web and Beyond (Rosenfeld, Morville, and Arango)(aka the polar bear book, due to its cover design), even in its 4th edition of 2015, devotes 20 pages, nearly half the chapter “Thesauri, Controlled Vocabularies and Metadata,” to thesauri.

The main impediment to thesauri is that the most common implementations these days, variations of off-the-shelf content management systems (CMS), usually do not support features of thesauri. Associative relationships are rarely supported. Synonyms/nonpreferred terms may be only partially supported (such as in the tagging view but not in retrieval). Thus, we tend to see thesauri implemented only in custom (home-grown) end-user systems, such as those of publishers of information retrieval databases.

Information retrieval thesauri have been around for a long time, and perhaps that is also part of the problem in their acceptance today in business and industry. People may consider thesauri as some kind of legacy knowledge organization system that was more predominant when we only had printed systems, not digital systems. It’s true that thesauri are designed to be useful in print, but their design is also adaptable and relevant to digital implementations. They can also form part of a larger system of interlinked controlled vocabularies.

This brings us to the next topic, ontologies, which can link to thesauri. Next month’s blog post will address the different meanings of ontology.

 



No comments:

Post a Comment