Sunday, March 28, 2021

Industry Uses for Taxonomies

It’s always interesting to hear about new and different uses of taxonomies. For example, recently I learned that a company would like a taxonomy in a way I had not heard before: to help their RFP team find content more efficiently to put together its responses to RFPs (request for proposals) of prospective clients. Inquiring about my course recently someone wrote: “Typically, I work with creative asset manager... However, I’m fascinated by the use and application of Taxonomies in broader industries.” So, today I will address the use of taxonomies in various industries.Information teechology and services graph

In a broad sense, taxonomies are most often used to support information management and retrieval. With the features of controlled concepts, synonyms, and structure, taxonomies provide better results than full-text search, and they also provide guidance to users.

Taxonomies may be used to support both users within an organization and those who are external, sometimes with the same taxonomy and the same content, but often with different taxonomies and different content, and occasionally with different taxonomies for the same (or subset of the same) content.

The information management use of taxonomies (content lifecycle management, content reuse, data analysis, etc.) is an internal use of taxonomies by employees of an organization with respect to internal content. The information retrieval use of taxonomies pertains to both internal and external users of taxonomies. Internal users who make use of taxonomies for information retrieval do so to better find information within the organization to do their job. External users of taxonomies use them to help find published information or content, products or services, jobs, activities, etc. within a site or online service.

Taxonomies for external users of content and information

Organizations or agencies where information publishing or sharing for external audiences is core to their business or mission have long recognized the importance of thesauri or taxonomies. This began with periodical/journal articles and has since spread to include all kinds of content and data. Examples include the following:

  • News or other online report publishers, library/research database vendors, or other subscription content services.
  • Organization where public or member information is significant, including government agencies with content-rich websites, international organizations, non-profit organizations, and professional and trade associations.
  • E-commerce and marketplaces, including B2C, B2B, and C2C (marketplaces), and product information sharing between manufacturers, distributors, wholesalers, and retailers.
  • Educational publishers or education technology companies, which provide digital learning.
  • Services selling or distributing digital media, such as movies, music, ebooks, images, animations, video clips, graphics, etc.
  • Job board websites or sites that match consultants/contractors/freelancers to projects.
In addition, with the growth of content marketing, which is the use of web-based content by companies and organizations to attract visitors to their sites, the content of websites has grown immensely, with more pages, blog posts, posted media files, documents to download, etc. To help website visitors find information and content on their sites, taxonomies have now become important for organizations in all kinds of businesses or services.

In all of these cases of publishing content for external users of taxonomies, there continues to be an internal use of taxonomies as well, for managing the content, including content reuse, rights management, retention/lifecycle management, quality management, new content integration, multilingual management, etc.

Taxonomies for internal users of content and information

Over the past decades the amount of internal digital information, and the applications in which they are contained, in organizations has grown exponentially. Taxonomies can aid in the management and retrieval of information and content items in content management systems, document management systems, digital asset management systems, collaboration spaces, intranets, etc. This applies to all industries, although in some sectors the management of internal information or assets is especially critical.

  • Media, entertainment, advertising, marketing, as industries or functions that deal with large volumes digital assets or media (images, videos, audio files) that need to be managed.
  • Highly regulated industries, such as pharmaceuticals, banking and finance, energy, and telecommunications, which need to manage documents, information, and data better to help with regulatory compliance.
  • Manufacturing, technology, engineering, R&D, and related industries which have large volumes of technical documentation, manuals, policies, and procedures that have become digitized.
  • Organizations such as professional service or research-focused firms that have critical content management tasks, such as internally publishing reports, proposals, or presentations which involve a degree of content reuse.

In addition, large companies in any industry now have so much content that taxonomies have become valuable to in helping their employees find the information they need quickly, whether on an intranet or other enterprise content management system. Having content in multiple systems could lead to multiple taxonomies, so a centrally managed taxonomy that is kept in sync with multiple types of content systems is a recommended strategy.
 

Saturday, February 6, 2021

Who Should Create Taxonomies?

Taxonomy word cloud

More and more organizations of various types and sizes are recognizing the benefits of information/content taxonomies, to make it easier to more accurately and quickly find information, be recommended information, and be able to formulate complex queries of data.  

In many cases, however, where taxonomies are not central to the product/service of a company (such as e-commerce retail or information publishing) or function of an organization (such as research), the task of creating and maintaining a taxonomy is not big enough to justify hiring a professional taxonomist. Creating a taxonomy is a temporary project, and then updating it is often a part-time task, which could even be shared among several people.

Taxonomy creation should not be underestimated, however. It may appear easy to create a taxonomy, but it is not easy to create a good taxonomy. If a taxonomy is not well-designed it cannot serve its purpose well. You may as well rely on a search engine alone than try to utilize a bad taxonomy.

Not creating the taxonomy yourself

Some approaches to developing a taxonomy without a dedicated taxonomist include using existing taxonomies, creating a taxonomy by term extraction, or hiring a consultant.

Reusing existing taxonomies

To serve its purpose best, a taxonomy should be custom-created to serve its content, users, and system. An existing external taxonomy is usually not adequate. It may be suitable for limited scope of a geographic taxonomy, industrial classification, a list of organization names, a list of languages. More information about licensing taxonomies is in my blog post “Taxonomy Licensing”  Even when using an existing taxonomy, there is still work to edit and adapt the external taxonomy, which requires taxonomy expertise

Creating a taxonomy by automatically extracting terms from content

Software, including some taxonomy management software, such as PoolParty, can extract candidate taxonomy terms from a body of content (documents or web pages) that is intended to be tagged with the taxonomy. This is an effective method to enhance a taxonomy, to add misting concepts and alternative labels (synonyms). However, this is not a practical way to start creating a taxonomy, which requires a logical structure. Taxonomy-creation expertise is still needed.

Hiring a taxonomy consulting or temporary contractor

This is a good idea. A consultant or contractor will provide a combination of guidance and actual taxonomy building, although a consultant tends to provide more guidance, and a contractor tends to do more taxonomy building. A contractor requires a certain time commitment, such as 3-6 months full-time, whereas there is lots of flexibility in engaging a consultant. After the consultant or contractor is finished, though, someone needs to maintain and update the taxonomy to the same specifications.

When a taxonomy is not very large, it may be more efficient and cost-effective to create it from scratch oneself without reusing an existing taxonomy or relying on a consultant or contractor, although getting a consultant to at least review the taxonomy might still be a good idea.

Taxonomy management as part of a role

What is much more common for an organization than to have a taxonomist is to have one or more positions where taxonomy management is part of the job description. Searches on web job boards return hundreds of job opening with “taxonomy” in the job description, whereas only a small fraction of them have taxonomy or taxonomist in the job title. Common job titles include:  Content Designer, Content Manager, Content Strategist, Data Architect, Data Catalog…, Data Strategist, Digital Asset Manager, Digital Content…, Digital Librarian, Information Architect, Information Scientist, Knowledge Engineer, Knowledge Management…, Metadata Specialist, Product Manager, SharePoint Developer, Solutions Architect, etc. There are also positions more centered in marketing and in web development.

Often, though, the need for a taxonomy emerges at a time when a new position is not created, so an existing employee must take on the task. This common scenario is behind the title of my book and this blog, The Accidental Taxonomist. Those that take on taxonomy work may come from a wide variety of roles or departments including marketing for a website taxonomy, IT or human resources for an intranet taxonomy, IT for content/document management systems administration, and technical documentation/publishing. Knowledge management and metadata/data management are also good candidate roles for taxonomy management.

In situations where the taxonomy is used to manage and retrieve content in specialized subject areas, subject matter experts may also be involved in taxonomy creation, at least for the parts of the taxonomy that correspond to their expertise. 

Not having sufficient taxonomy skills

In either case, whether taxonomy management was originally part of the job description or not, people who assume partial taxonomy responsibilities often do not have the skills. This is usually the case when a taxonomy project first arises. Even when someone is newly hired, successful applicants may not to meet all job description duties, such as taxonomy experience, especially if the skill is only a minor part of the job.

Related job skills may make it easier to created taxonomies, but without experience or training, one cannot simply create a good taxonomy. Related skills tend to be in the area of library/information science, indexing, information architecture, digital asset management, content management, records management, and possibly product management.

Librarians tend to have training in cataloging and classification, sometimes in thesaurus creation, and less likely in taxonomy creation. Taxonomies resemble classification schemes, but function differently, so it would be a mistake to model a taxonomy as a classification scheme. See my blog post "Classification Systems vs. Taxonomies." I had taught a continuing education course on taxonomies through a graduate school of library and information science for years, since MLIS graduates had not learned taxonomies as part of their degree program.

Information architects know how to organize information in a web user interface well, so they may have a good sense on how to structure a taxonomy at a high level. However, there are details and nuances of a large taxonomy, such as the development of synonyms/alternative labels, with which they may not have experience. Also, a taxonomy should not be confused with a navigation scheme, as explained my blog post "Navigation Schemes vs. Taxonomies."

Digital asset managers, content managers, and product managers know about the metadata management for their content, and taxonomies usually fit into the larger metadata scheme. However, their experience with taxonomy creation is usually limited to a subject area and the context and constraints of the system in which they are working. So, the very basic taxonomy skills that they develop may not be transferable to another system or another subject domain.

Subject matter or domain experts, including product managers, often play an important role in taxonomy development. From my experience in working with subject matter experts, though, they often tend to design more of a classification scheme for their domain and create taxonomy concepts that are too granular to be practical for end-using search and retrieval.


Where to learn taxonomy skills

There are many continuing education options to learn taxonomy creation, some through library/information science schools, some through professional associations, and some through commercial conference and training programs. I have been providing taxonomy training since 2007, through online courses, conference workshops, and corporate workshops, both in-person and virtual. I have been impressed with the diversity of backgrounds, job roles, organization types, and global locations of the workshop participants over the years.

The current situation of all-virtual conferences means that I am teaching more virtual workshops than usual this spring, and they are accessible to more people. Following is a list of upcoming live virtual taxonomy workshops, all with interactive participation, and thus with limited enrollment. They vary slightly in their focus and scheduling. All times indicated are Eastern.

"Taxo Update: Latest in Designing & Maintaining Taxonomies"
Monday, March 22, 12:00 - 4:00 pm ET (4 hours)
A preconference workshop of Computers in Libraries with separate registration (no need to register for the entire conference)

"Taxonomy and Metadata Design"
Monday-Tuesday, March 29-30, 10:00am - 2:00pm ET each day (8 hours over two days)
Through Technology Transfer, Rome (with the availability simultaneous interpretation and slides translated into Italian).

"Connecting Users to Content: An Introduction to Taxonomy Design & Creation"
Wednesday-Friday, April 21-23, 2:00-4:00 pm EDT each day (6 hours over three days)
A preconference workshop of the IAConference with separate registration (no need to register for the entire conference)

Wednesday, January 20, 2021

Hierarchies in Taxonomies, Thesauri, Ontologies, and Beyond

Hierarchies are a defining feature of taxonomies, and they are also characteristic of other controlled vocabularies or knowledge organization systems, such as classification schemes, thesauri, and ontologies. The problem is that the definitions and rules for hierarchies vary depending on the kind of knowledge organization system, so you cannot assume that a hierarchy in one system converts to a hierarchy in another system.

“Hierarchy” can have various types and uses. Not all kinds of hierarchies are reflected in even in taxonomies, which tend to be quite flexible. The rules are stricter when it comes to thesauri. Finally, in ontologies, there is only one kind of hierarchy.

The hierarchies permitted in thesauri are specified in the ANSI/NISO Z39.19 and ISO 25964-1 standards, as a reciprocal inverse relationship pair of Broader term (BT) / Narrower term (NT). There are three kinds specified in these standards:

  • Generic-specific   which refers to “is a” or “are a kind of”
        Example:
        Basketball is a kind of sport.
        Basketball BT Sports;  Sports NT Basketball
        Baketball has broader concept Sports; Sports has narrower concept Basketball
  • Generic-instance – which refers to “is a named entity instance of”
       Example:
      Michael Jordan is a named basketball player.
      Jordan, Michael BT Basketball players; Basketball players NT Jordan, Michael
      Jordan, Michael has broader concept Basketball players; Basketball players has narrower concept Jordan, Michael
  • Whole-part – which refers to “is in” or “is an integral part or component of”
       (not to be confused with “part” as a participant taking part in, or member of)
       Example:
       Locker rooms are in athletic facilities.
       Locker rooms BT Athletic facilities; Athletic facilities NT Locker rooms
       Locker rooms has broader concept Athletic facilities; Athletic facilities has narrower concept Locker rooms

The types of hierarchies permitted in taxonomies include all of those designated for thesauri, plus a little more flexibility due to the absence of the associative relationships. In thesauri, if the relationship between a pair of concepts is better described as associative (“Related term” - RT) than hierarchical, then they cannot be hierarchically related. In a taxonomy which lacks associative relationships, in some cases a relationship that is not accepted as hierarchical in a thesaurus may be accepted as hierarchical in a taxonomy. An example is the pair of concepts Stress and Stress management. Technically, the relationship between these two concepts is associative and not hierarchical, because Stress management is not a kind of or a part of Stress. But in a taxonomy (not a thesaurus), designating Stress management as a narrower concept of Stress may be acceptable.

As for classification schemes, despite their name, they do not always conform to class-subclass (as "is a kind of") conventions. For example, in the Dewey Decimal Classification system, 910 Geography & travel comes under 900 History. But geography and travel are not kinds of/sub-categories of history. Classification schemes may have a tendency to force a hierarchy when it’s not really an accepted taxonomic hierarchy.

Despite the looser rules for hierarchies of taxonomies and classification schemes, there are also kinds of hierarchies that are not taxonomic hierarchical. These include organizational chart hierarchies, hierarchies of (military) rank, family tree hierarchies, the ordering of social sciences concepts of as Maslow’s hierarchy of needs, or Bloom’s Taxonomy of learning objectives. The hierarchies in these cases are not broader/narrower, but rather reflect importance, influence, sequence, or some other aspect of the notion of hierarchical order. In taxonomies and thesauri, concepts in such organizational hierarchies need to be treated instead as siblings at the same level all sharing the same broader concept, such as Learning objectives as the single broader concept for all six of Bloom's learning objectives, Needs as the single broader concept for all five of Maslow's needs, Military ranks as the single broader concept for all ranks, or Job titles as the single broader concept for all job titles.

Finally, in ontologies, hierarchies may be of less significance, but they are still a feature. While relations between concepts/entities are “semantic,” with specific descriptive labels, and thus are not necessarily hierarchical, there are may be hierarchical relations between classes, when designating subclasses of classes. However, the kind of hierarchical relationship that is created between ontology classes and subclasses is limited strictly to the generic-specific type, for “is a kind of.”

Conclusions

These distinctions in hierarchies have ramifications if you want to combine, import, or convert one knowledge organization system to another. When converting a thesaurus to a taxonomy, it is possible that some of the associative relationships could be accepted as hierarchical. When converting a taxonomy to a thesaurus, existing hierarchical relationships should be reviewed to see if any should be converted to associative.

Converting a taxonomy or thesaurus to an ontology would require identifying and remove whole-part hierarchical relationships (and adding new broader concept relations to the orphaned concepts) and converting generic-instance hierarchical relationships to class-individual relationships rather than class-subclass. In fact, this may involve so much effort, which cannot be automated, that the better approach to converting a taxonomy to an ontology is probably to apply a more generic ontology as a layer to the existing taxonomy/thesaurus, which some software tools, such as PoolParty, support. Extending a taxonomy into an ontology is the subject of my next conference presentation “Ontology Design by Enriching Taxonomies” at the Data-Centric Architecture Forum on February 3.

Saturday, December 5, 2020

Differing Definitions of Ontologies

In my last blog post I discussed the different definitions and features of thesauri. Now, I will turn to the next kind of knowledge organization system in the spectrum of complexity: ontologies.

Actually, to consider an ontology as a more (or most) complex type of controlled vocabulary or knowledge organization system, after thesauri, due to additional features, is just one perspective or definition of ontologies, which is not universally shared.

When I first learned about ontologies, coming from my taxonomist perspective, I considered ontologies as merely a more complex type of taxonomy or thesaurus, characterized by customized semantic relationships between concepts (rather than merely hierarchical or associative relationships), more expressive attributes for concepts (rather than mere scope notes), and the grouping of concepts into classes to manage the semantic relationships and attribute types. In fact, I wrote in 2008 for the first edition of my book “An ontology can be considered a type of taxonomy with even more complex relationships than in a thesaurus,” which the following graphic represents.

As my understanding has evolved, I would consider this just to be one kind of understanding or definition of ontology among others.  In other words, a controlled vocabulary that has the features of semantic relationships, classes of concepts, and attributes for concepts, can be considered a kind of ontology, but there are other definitions and understanding of ontology within the field of information/knowledge management.

While we usually refer to “controlled vocabularies” as the over-arching category for these things, it is probably better to go up a further level and call an ontology a kind of “knowledge  organization system,” rather than a kind of controlled vocabulary. Controlled vocabularies are kinds of knowledge organization systems, where the emphasis is on managed terms or concepts for the purpose of tagging or categorizing and information retrieval. Ontologies, by themselves, are not necessarily for information retrieval, at least not directly. And this is one of the points of differing definitions of ontologies.

Differing definitions and perspective

There are differing definitions of the word ontology: (1) branch of philosophy that studies existence, being, becoming, and reality (Wikipedia: Ontology), and (2) a representation, formal naming, and definition of categories, entities, properties, and relations within a domain (Wikipedia: Ontology (information science)). Of course, we are interested in the second definition, although there are some connections between the two. 

The second definition, however, is already multidisciplinary, as it is a concept shared in both information science and computer science. Information scientists (including librarians, taxonomists, and knowledge managers) and computer scientists do not have different definitions of ontologies, but rather different approaches to and perspectives of ontologies and different purposes for the ontologies they create.  For computer scientists, modeling data and information helps them design a computer program to perform desired functions. For information scientists, modeling data and information makes it easier to retrieve information with complex queries. Information scientists consider an ontology as a kind of knowledge organization system, whereas computer scientists tend to consider an ontology as a form of knowledge representation.

Yet even among information scientists, who consider ontologies as knowledge organization systems and have the same objectives in developing ontologies, there are different understandings of what exactly constitutes an ontology and how it relates to other knowledge organization systems, such as taxonomies. This is due to (1) different emphasis on various ontology components, (2) the question of adherence to ontology standards, and (3) the way different ontology software tools model ontologies and their relations to taxonomies differently.

Differing understandings of ontology components

There is a shared understanding that ontologies are composed of things, their properties/attributes, and their relationships.

Ontology model example with classes, relations, and attributes
Ontology example with components: classes, relations, and attributes
However, there are differences in understand of the two kinds of “things”: classes and individuals. Classes are categories or groups of things with shared characteristics, whereas individuals are specific instances of things. This seems obvious, but if you approach ontology design from the perspective of taxonomy design it can become less certain. Is an individual the most specific concept (also called “leaf node”) in a hierarchy, or is an individual a named entity/proper noun? The definition of components of ontologies does not answer this question, because ontology structures are meant to model data, not to organize taxonomy concepts that could be either generic (common nouns)  named entities (proper nouns). Drawing the line between classes and individuals can be challenging, but whether this matters may depend on what tool you are using.

Furthermore, ontologies may have other components, such as axioms, rules, restrictions, events, and function terms, but ontologies as knowledge organization systems rarely have most of these.

Differing ontology standards or languages

In 2004 the World Wide Web Consortium (W3C) published the Web Ontology Language (OWL) specification, which is based on the Resource Description Framework (RDF), as “a Semantic Web language designed to represent rich and complex knowledge about things, groups of things, and relations between things,” which has become widely adopted. Now it is common to think that ontologies must follow OWL guidelines. But (information science) ontologies have existed before OWL, and an ontology does not have to follow OWL to be called an ontology. There are other ontology languages besides OWL, but they are not as common. To share and reuse ontologies, it is recommended to follow the OWL standard.

Differing ontology modeling software

While one could design the high-level model of an ontology in a mind-mapping tool, there would be no enforcement of standards or best practices (preventing duplications or incomplete data, etc.), and it’s difficult to scale, so dedicated ontology modeling software is recommended. However, ontology modeling/editing software does not model ontologies all in the same way.

The main difference is probably between stand-alone ontology software (such as Protégé or TopBraid Composer) and software that combines ontology with taxonomy/thesaurus development and editing (such as PoolParty, Semaphore, or Graphite). Stand-alone ontology editing software supports creating a detailed ontology as single model, thus including classes, multiple levels of subclasses, and individuals (instance concepts). In integrated software that combines taxonomy/thesaurus development with ontology development, the taxonomy or thesaurus (or multiple controlled vocabularies) is created in one space with one set of software features, and the ontology is created in another space with a different set of features. The ontology (or even just parts of it) is then applied to the taxonomy, so that concepts in the taxonomy inherit the attribute types and relationships of their associated class, and the taxonomy concepts are like individuals in the ontology. The ontology can be considered a semantic layer in the model, as the following graphic illustrates.

These two different approaches to ontology modeling thus result in different definitions of an ontology. A ontology is likely to be considered as a more complex type of knowledge organization system by users of stand-alone ontology software, whereas an ontology is likely to be considered and expressive semantic layer applied to one more taxonomies by users of integrated taxonomy/ontology software.

Ontology lite or ontology-like

When I was still considering ontologies more akin to thesauri with semantic relationships, and I expressed such views in a discussion forum, someone (whom I don’t remember), referred to this kind of ontology as “ontology lite,” since  it has features of an ontology, but does not fully follow an ontology model and standards. This is not necessarily a bad thing. Controlled vocabularies and knowledge organization systems can be considered along a continuum, and you should build what works for your situation.

Another kind of ontology-like structure is when you start linking multiple controlled vocabularies together. My initial experience with working on commercially implemented ontologies had been with such ontology-like systems, which were not actually called ontologies, at a former employer Gale. There we had controlled vocabularies (also called object classes) for subjects, persons, places events, products, companies/organizations, named works, etc., many of which had customized reciprocal relationship pairs between them (such as the relationship pair Creator/Creatby, between person names who were authors, and named works) and many customized term attributes (such as Birthdate, Death date, Birth city/state/country, Death city, state/country for persons).

I also heard this approach recently from a speaker, Ahren Lehnart, at Taxonomy Boot Camp conference, who described the linking of controlled vocabularies with related match (not equivalent match) relationships as “trending toward” creating an ontology.

 

Sunday, November 22, 2020

What it a Thesaurus and What is it Good For

It is somewhat ironic that in the domain of controlled vocabularies and knowledge organizations systems that there continue to exist differing meanings for “controlled vocabulary,” “taxonomy,” “thesaurus,” “ontology,” and “knowledge graph.” Hopefully, I have provided some clarification regarding what a taxonomy is and is not in my previous posts on taxonomy vs. classification, taxonomy vs. navigation, and when a taxonomy should not be hierarchical. Let’s turn now to thesauri.

Different meanings of thesaurus

I recently attended a webinar on taxonomies, ontologies, and knowledge graphs, in which a thesaurus was described as a set of synonyms for each identified concept in a list. This is not the right definition for this context. A set of synonyms for each of list of concepts is what we taxonomists call a “synonym ring”, and what administrators of enterprise search engines would call a “search thesaurus.” The use of the word “thesaurus” in this case refers to the dictionary-type thesaurus (as the default Thesaurus entry in Wikipedia) such as Roget’s Thesaurus, where synonyms are presented for each word. Synonyms are included to support search, by matching potential words and phrases entered by users into the search box with the words and phrases that likely occur in the text of content, so that content is not missed due to the searcher using a different synonym.

The “search thesaurus” (synonyms ring) differs from the synonym-dictionary thesaurus, however, in several ways, due to their different uses:
  • A search thesaurus includes phrases, not just single words as in a dictionary thesaurus.
  • A search thesaurus comprises concepts that are nouns, verbal nouns, or noun phrases, not just any part of speech as a dictionary may include.
  • The “synonyms” in a search thesaurus are appropriately equivalent terms that can be used interchangeably in all cases for the content repository, not synonyms that may be used in only some cases, as the dictionary suggests.

However, in the context of taxonomies/ontologies (not the context of search administration), the designation thesaurus has a significantly different meaning. Also referred to as in information thesaurus or information-retrieval thesaurus (to distinguish it from the synonym dictionary type), there is a different entry in Wikipedia for Thesaurus (Information Retrieval), which defines it as “a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects.” This is the meaning that relates to taxonomies and ontologies. More significant than the Wikipedia definition, are the published standards/guidelines for how to construct thesauri: ISO 25964 Thesauri and interoperability with other vocabularies and ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. While the latter does not name thesauri in its title (although it did in an earlier version), it is essentially about thesauri and defines, in section 4.1 Definitions, a thesaurus: “A controlled vocabulary arranged in a known order and structured so that the various relationships among terms are displayed clearly and identified by standardized relationship indicators.

So, a thesaurus is a kind of controlled vocabulary or a kind of knowledge organization system which is quite structured and has certain standard features: terms that are noun phrases, hierarchical relationships between terms, associative (related, but not hierarchically) relationships between terms, “synonym” or variants, which are called nonpreferred terms, and scope notes on terms. Other metadata on terms is possible, and variations of hierarchical and associative relationships may also be possible.

Thesaurus usefulness

On the continuum chart of controlled vocabulary (knowledge organization system) types, a thesaurus falls between a taxonomy and an ontology in its level of complexity and support for semantics.


Controlled vocabulary types

Since both taxonomies and ontologies are recognized as useful, it would seem illogical that something that is in between should not be considered at least as a useful. A thesaurus has the benefits of supporting more semantics than a taxonomy while not being as complex as an ontology.

Even if most relationships are hierarchical, there may be times when creating an associative relationship between related subjects seems logical and would be helpful to users, such as relating between a process and agent, action and property, cause and effect, object and origins, discipline and practitioner, etc. Or it might not be subjects. For example, ecommerce may want to recommend “related” product categories, or content on activities could relate activities to products. In an expert people finder, person names can be related to subject areas of expertise, If the scope of “related” types is kept limited, then the generic associative relationships (“related term”) may suffice without getting to level of complexity of an ontology where there are multiple types of defined semantic relationships.

The added associative relationships and comprehensive inclusion of synonyms/nonpreferred terms also supports better (more comprehensive) tagging, whether manual or automated, by providing suggestions to the indexers or providing context for the auto-classification tool.

Finally, the overall structure of a thesaurus is more flexible than that of a taxonomy. A taxonomy groups concepts into categories with a limited number of top concepts (or “top terms”). A concept which has no broader and no narrower concept relationships, sometimes called an “orphan,” is considered an error in a taxonomy. In a thesaurus, on the other hand, where an over-arching hierarchical structure is not required (although may exist) and associative relationships are included, it is OK to have a concept with no broader and no narrower relationships, but at least an associative relationship. Thus, the taxonomist does not always have to force new concept into an existing hierarchy which might not be ideal.

Software for thesaurus management

Software to support the development and maintenance of thesauri has also been available for some time. (Taxobank has a historic list, not updated since 2013.) There actually is no such thing as “taxonomy” management software, because the software used to create taxonomies is really “thesaurus” management software, and the added thesaurus features, such as associative relationships, are just not utilized when creating a simple taxonomy.

As taxonomies have become more popular than thesauri, the software vendors have reflected that by having a hierarchical display (instead of alphabetical) as the default, and by marketing their solutions for taxonomies and ontologies and de-emphasizing or omitting mention of thesauri. For example, the basic core module of the PoolParty Semantic suite is appropriately named Thesaurus Server, since you can easily create thesauri with it, but the default hierarchical display suggests the use for taxonomies, whereas the website's product page says it’s for “Enterprise Taxonomy and Ontology Management.”

Thesauri today

Thesaurus design principles are applicable to both thesauri and taxonomies. Therefore, thesauri continue to be taught in library science and information science degree programs, including courses on information architecture. The book Information Architecture for the Web and Beyond (Rosenfeld, Morville, and Arango)(aka the polar bear book, due to its cover design), even in its 4th edition of 2015, devotes 20 pages, nearly half the chapter “Thesauri, Controlled Vocabularies and Metadata,” to thesauri.

The main impediment to thesauri is that the most common implementations these days, variations of off-the-shelf content management systems (CMS), usually do not support features of thesauri. Associative relationships are rarely supported. Synonyms/nonpreferred terms may be only partially supported (such as in the tagging view but not in retrieval). Thus, we tend to see thesauri implemented only in custom (home-grown) end-user systems, such as those of publishers of information retrieval databases.

Information retrieval thesauri have been around for a long time, and perhaps that is also part of the problem in their acceptance today in business and industry. People may consider thesauri as some kind of legacy knowledge organization system that was more predominant when we only had printed systems, not digital systems. It’s true that thesauri are designed to be useful in print, but their design is also adaptable and relevant to digital implementations. They can also form part of a larger system of interlinked controlled vocabularies.

This brings us to the next topic, ontologies, which can link to thesauri. Next month’s blog post will address the different meanings of ontology.