The Accidental Taxonomist

Sunday, April 30, 2023

Taxonomies for Content Components

The primary purpose of taxonomies is to support consistent topical tagging (indexing) of content and full and accurate content retrieval based on the tagged taxonomy concepts that the end-user selects. The unit of content that is tagged makes a difference in the retrieval results and user experience. Users want to find specific content, such as a paragraph, a captioned image, a timestamp section within an audio or video file. This is not always possible. The traditional method of tagging is to tag the entire file, document, or web page, even if the specific topic with the desired information is only part of the larger file, such as a few sentences within a web page or document of multiple paragraphs. The user then spends time (or wastes time) trying to find the desired information in the larger file.

Content components

Fortunately, there are methods to tag and retrieve content at smaller units, such as a text section identified with a heading, within a longer document. These methods depend on having “structured” content, where sections are marked off using a markup language, most commonly Extensible Markup Language (XML). As XML is rather generic, there have emerged standards specifically for XML-based component-based content management, including DITA (Darwin Information Typing Architecture).

www.dita-ot.org

Structuring content was not originally developed for the purpose of detailed topical tagging/indexing and retrieval, though, but rather for the purpose of creating (authoring) and publishing content, especially to the web, more efficiently. Originally, the focus of structured content was on marking up the document style and supporting keyword tags for the entire document. The first content management systems (CMSs) were developed shortly after the web in the 1990s to facilitate the publishing of web pages, although later a distinction emerged be web content management systems and enterprise content management systems.

By the early 2000s, component content management systems (CCMSs) emerged, whereby content is managed in units (components) smaller and more specific than an entire document. CCMSs enable content publishing to be more modular and flexible, supporting content reuse, and making it easier to update content, by updating only the relevant components, instead of the entire document. CCMSs are especially used for creating technical documentation, but they are not limited to that use. Examples of CCMSs include Adobe FrameMaker, Documentum, Hereto, Kontent.ai, Quark, Paligo, Sanity, and Tridion Docs. While more precise tagging was not the original goal of CCMSs, it is a beneficial outcome.

Taxonomies and component content management

CCMSs, along with all CMSs, have come to support taxonomies and tagging better over the years. This includes both support for more taxonomy features, such as hierarchies and synonym (alternative labels), and support for importing and exporting taxonomies in standard interoperable formats. With respect to CCMSs, taxonomies can be built out to a greater level of detail, with concepts specific to the component topics of CCMS. However, whoever is creating the taxonomy should remember not to create concepts that are so specific that a concept is applicable to only a single component topic. A single taxonomy concept should retrieve multiple results.

CCMSs, along with all CMSs, can also connect to or integrate with taxonomies managed in dedicated taxonomy management systems, such as PoolParty. Since organizations tend to have multiple CMSs, each for different kinds of content and purposes, they are likely to end up creating multiple, separate (siloed) taxonomies with similar or overlapping concepts. Therefore, the best strategy for enterprise taxonomy management is to manage taxonomies centrally, either as a single master taxonomy or with multiple taxonomies linked together in dedicated taxonomy management software, which can connect to CMSs with APIs (application programming interfaces) to push the taxonomy out to the CMSs, including CCMSs. Additionally, prebuilt integrations of taxonomy management systems and CCMSs, such as PoolParty and Tridion Docs, are becoming more common.

There is also a growing interest in taxonomies at conferences dealing with component content management. Last October I attended the LavaCon conference for content strategy for the first time, where my pre-conference workshop on taxonomies was well attended. Two weeks ago, I participated in the ConVEx conference, where there is more focus on component content management than at LavaCon. (ConVEx was formerly the DITA North America conference.) In contrast to LavaCon’s two presentations on taxonomies, ConVEx had a track with the “taxonomy” theme and five presentations focused on taxonomies and another three presentations with topics related to taxonomies.

Component content management enables more targeted topic tagging and opens up more possibilities for rich taxonomies. Thus, as a taxonomist, I look forward to learning more about CCMSs and how they taxonomies can best be applied in these systems.

Friday, March 31, 2023

Taxonomy and Information Architecture Compared

There is considerable overlap between the fields of information taxonomies and information architecture. Both involve information organization, labeling, search, and findability. In some organizations the job roles and titles are combined. I previously blogged on “Information Architecture and Taxonomies,” observing that “information architecture” in name seemed to be declining while aspects of its practice continued to be strong, since it was an underlying theme in several of the talks at major taxonomy conference, Taxonomy Boot Camp in 2013.

Photo of Information Architecture Conference opening: welcome on the screen and a jazz band playing

Information Architecture Conference opening. Photo Marisela Meskus

This week, for the first time, I am attending in person the Information Architecture Conference, being held in New Orleans March 28 - April 1, so it’s been interesting to hear how information architects consider taxonomies.

How Information Architecture and Taxonomy Overlap

The fields of information architecture and taxonomy are related beyond the stated shared practices of information organization, labeling, search, and findability.

When I give an introduction to taxonomies, I explain that a taxonomy is an intermediary between users and content to connect users to content by means of terms that the users understand and by the display of the terms in hierarchies, facet-filters, or type-ahead suggestions, which enable users to explore and interact with the taxonomy. This is clearly an aspect of information architecture.

In my own career path, I discovered taxonomy and information architecture at the same time. I had been working as a “controlled vocabulary editor” and had the opportunity to work on an interdisciplinary team for a newly design information product. A user interface for school library research database included both a hierarchical taxonomy that was designed to fit with a particular user interface.

At the Information Architecture Conference, I asked for a raise of hands of my session audience of how many had worked with taxonomies, and it seemed to be over 80%. At the conference, I met information architects who specialized in taxonomies, and taxonomists who had an interest and done some work in information architecture. Even though I identify as a taxonomist, I already knew a number of speakers at the Information Architecture conference due to the overlapping communities.

How Information Architecture and Taxonomy Differ

Information architecture is a discipline and a profession that is larger and more established than that of taxonomies. Although taxonomy work is growing, there are still more college courses on information architecture than on taxonomies, more books on information architecture than on taxonomies, and more people with “information architect” than “taxonomist” as a job title (based on LinkedIn searches).

Listening to sessions at the Information Architecture Conference and having discussions with participants, I began to see a clearer picture on how the fields of information architecture and taxonomies differ.

The Information Architecture Conference brings together a community of professionals who share ideas and experiences. There is no comparable taxonomist community as taxonomy work, compared to information architecture work, tends to be done by those with different professional backgrounds: information architects, librarians, content managers, metadata architects, indexers, ontologists, etc. It’s telling that there is not just one conference at which I present about taxonomies but multiple. (Knowledge management, content strategy, knowledge graphs, and data science are the fields of conferences at which I have spoken about taxonomies in the past year.) The only conference about taxonomies, Taxonomy Boot Camp, is more of specialized track within the KM World conference, and aims to provide taxonomy best practices and case studies to managers and directors of content, product, or knowledge management. It is not really a forum for taxonomists to discuss topics of their profession, as the Information Architecture Conference is.

It seems that information architecture is more of a discipline and a field, whereas taxonomy is more of tool or system (although a very important one). In addition to information architects in organizations in various industries and consultants, the Information Architecture Conference includes professors and students in the field. By contrast taxonomy is not a field of study, research, or focus in academia. It is a focus area only in industry and consulting. Information architecture seems to allow more room for theory than does the taxonomy field.

How Information Architecture and Taxonomy Are Related

From a "taxonomic" perspective, which is broader? For information architects, taxonomy is narrower than information architecture. There is no doubt that information architecture is broader in various ways, including content/information organization, design, user experience, and even organization of non-digital information spaces. For example, information architects are concerned not only with taxonomies to support searching and browsing for information, but also with content organization and navigation menu structuring in websites and in software user interfaces.

Taxonomists, on the other hand, do not consider taxonomies as a sub-field of information architecture, but rather consider the two fields as adjacent and closely related. This is because the taxonomies that information architects create tend to be small, such as term lists for metadata properties or facets or as hierarchies to model menu navigation or site maps. Professional taxonomists tend to work on large dynamic taxonomies or thesauri that are used to tag/index and retrieve content or data in one or more systems, often where the user interface is already prescribed.

The related fields or disciplines are also different. Information architecture has a closer relationship with fields of design, user experience, sociology, and psychology. Taxonomy has a closer relationship with indexing/tagging, natural language processing, ontologies, Semantic Web technologies, and knowledge management. One related field shared by both information architecture and taxonomy is structured content, which was also a subject of presentations at this year's Information Architecture conference and the field of my next conference.

Saturday, February 25, 2023

Related Concepts in Taxonomies


A and B are related; C and D are related.

Taxonomies and thesauri are characterized by having hierarchical relationships linking their terms. The associative relationship (or related concept, Related Term, or RT), on the other hand, is a fundamental feature of thesauri, but it is merely an optional feature of taxonomies.

An over-simplistic distinction between taxonomies and thesauri is the presence of associative relationships, although I would disagree, because taxonomies can have associative relationships, and there are other structural design differences between taxonomies and thesauri. (See my past blog posts Taxonomies vs. Thesauri and Taxonomies vs. Thesauri: Practical Implementations)

The associative (related) relationship is a generic, nonhierarchical, symmetrical (same in both directions), reciprocal relationship between pairs of terms/concepts in a thesaurus or taxonomy. "Related concept" actually refers to a kind of relationship, not a kind of concept. The following figure illustrates that Data protection and Privacy are related.

It is true that many taxonomies do not have associative relationships. This is for various reasons. The function of the taxonomy in the user interface may not require the support of related concepts, such as when the taxonomy is displayed only as facets for refining results or only as type-ahead taxonomy term suggestions when a user enters a search string into a search box. The taxonomy may be implemented in a system (such as a commercial off-the-shelf content management system or SharePoint) that does not support the links/navigating to related concepts in the user interface. A taxonomy may be too small to make beneficial use of associative relationships if most of the taxonomy can quickly be browsed and seen. Finally, and perhaps of the greatest potential significance, is that relationships across different types of concepts can instead be better supported with customized semantic relationships based on custom schema and ontologies, which can be applied to a taxonomy. For example, having Physicians practice Medicine and Medicine isPracticedBy Physicians, instead of Physicians related Medicine.

It is not so much the presence but rather the extent of associative relationships that also distinguishes thesauri from taxonomies. In a traditional thesaurus, associative relationships are as prolific as hierarchical relationships, and perhaps even more so, and they occur between terms of all different kinds and different types of relatedness. The thesaurus standards (ANSI/NISO Z39.19 and ISO 25964-1) provide a list of possible types of associative relationships (process and agent, action and target, cause and effect, object and property, object and origins, and discipline and object, among many others). When taxonomies have associative relationships, they tend to be limited to only certain categories, facets, or concept schemes of the taxonomy.

Related Concepts and SKOS Concept Schemes

Most taxonomies these days, if they are of any significant size (hundreds or thousands of concepts) and intended for use in more than one application, are created in the SKOS (Simple Knowledge Organization System) data model. (Smaller taxonomies might be created in a spreadsheet and imported into a content management system.) The highest level of organizational structure in SKOS is the concept scheme. SKOS-based taxonomy management software will group and display multiple concept schemes together in a single “project” or “knowledge model,” which is intended for a single business use, set of content, user audience, or implementation (with some overlap of multiple use cases acceptable). While SKOS does not provide any recommendation on what you should use concept schemes for, it has become common practice to designate a concept scheme for a taxonomy facet or a metadata property/field. Even when concept schemes are not currently implemented as facets, they might be in the future, so it is good practice to created concept schemes to represent facets. The structure of concept schemes representing facets is also is also a good organizing principle for constructing any taxonomy. Concept schemes also tend to reflect top-level “classes” of ontologies (although not the very esoteric top class of “Thing”).

SKOS permits the creation of related concept relationships both within and between concept schemes. SKOS also has mapping relationships called matching properties, including relatedMatch, for use between concept schemes, whether they are in the same “project” (sharing the same, initial, domain part of a URI) or not. The option to use either related or relatedMatch across concept schemes of the same project can be a source of confusion.

Best Practices for SKOS Related Concepts

If you are implementing concept schemes each as a facet/filter/refinement in a user interface, then it is best practice not create associative (related) relationships between concepts in different concept schemes. Facets function as mutually exclusive aspects or dimensions of content items and queries. Any “relatedness” is implicit based on the search results, but not from the taxonomy itself, which should be flexible to allow any combination of concepts from facets and not prescribe relatedness. For example, a user may want to filter a search on movies by which movies meet selected criteria (facets) of a chosen genre, actor, director, topical theme, and country of production, and the result set will implicitly indicate in which movies where these aspects are related.

Enriching a taxonomy with the semantics of an ontology, in addition to supporting additional data attributes (such as movie production year, actor nationality and birth date, etc.), supports connections across concept types that can be utilized in a front-end application. The user can search not only for movies, but also search for other entities, such as actors (who appear in movies of a certain genre directed by a certain director), or directors (who directed movies on certain themes from certain countries), etc. This involved creating customized, semantic relationships between classes which correspond to the concept schemes: Actor performsIn Movie title and Movie title hasActor Actor, Movie title isProducedIn Country and Country isOriginOf Movie title, etc. These semantic relationships, of course, make any generic SKOS related relationships across the concept schemes unnecessary, redundant, and rather meaningless.

Thus, regardless of the use of your concept schemes, the related concept relationship is best not used between concepts in different concept schemes. Rather, the related concept relationship is better used between concepts within a concept scheme, especially topical (subject) concepts, for example, relating the concepts Data quality and Quality management. Relatedness between named entities within a concept scheme, on the other hand, such as concept schemes for People, Organizations, and Geographic places, is best left to be implicit from the retrieved content and not prescribed in a taxonomy, which may be dependent on the content, change over time, and be too subjective.

Even if the current end-user application of a taxonomy does not support user interaction with related links, associative relationships can support tagging, both manual and automated. Finally, a taxonomy typically has a longer life than a single application, so incorporating in related concept relationships while the taxonomy is being built and regularly maintained is a good practice for the future use of the taxonomy.

Tuesday, January 31, 2023

Taxonomies vs. Ontologies

The question often comes up: how are taxonomies and ontologies different? While there are some short simple answers (such as: taxonomies are hierarchies, and ontologies are semantic networks), it is understandable that the distinction is not that clear. There is considerable overlap. Ontologies may contain taxonomies, and taxonomies can be semantically enriched to become ontology-like. The same software tools, for example PoolParty, support the creation of both.

One of the trends in data/information/knowledge management in the convergence of systems, methods, and technologies, including the convergence of taxonomies and ontologies. It’s gotten to the point that some people will refer to taxonomies and ontologies almost interchangeably, as if they are essentially the same thing. They are not, although they are increasingly combined. It’s interesting that one of the most active discussion channels within the Taxonomy Talk community on Discord is on ontologies.

Taxonomy vs. Ontology (https://graphviews.poolparty.biz/GraphViews)

Uses

Although both taxonomies and ontologies are kinds of knowledge organization systems, which support access to information, their specific uses tend to differ. The primary use of information taxonomies is for consistent tagging and accurate and comprehensive retrieval of content items. These could be documents, components (sections) of documents, web or intranet pages, or digital assets (image, audio, video files, etc.). Ontologies, with their inclusion or linkages to instances/individuals, with their various attributes, are more focused on the specifics of data: data retrieval, data comparison, and data analysis. Taxonomies are primarily for what a content item is about (although content/document types may also be part of taxonomy), as in “get me all the information resources about…,” or “get me a list of products with…” and specifying set of features and price range as filters. Ontologies, on the other hand, can support more complex, multistep queries, such as “get me a list of products with…” a set of features and price range, whose vendors are located in Canada and have a minimum annual revenue of CAD $50 million.

In comparing retrieval of content and data, for example, taxonomies can retrieve a spreadsheet file, whereas ontologies can retrieve data from individual cells in the spreadsheet. Ontologies can traverse data in a database. While this could be a relational database, increasingly ontologies are used with graph databases, since ontologies are also structured as graphs.

Origins

Another major difference between taxonomies and ontologies is their origins. Information taxonomies (not biological taxonomies) originated in the discipline of library science. Specifically, I would say that taxonomies have evolved as a kind of flexible hybrid of classification systems and thesauri. Ontologies, on the other hand, (when not in philosophy) tend to be taught and researched as a part of computer science. Again, there has also been convergence of library science and computer science in the field of information science. Nevertheless, library/information science and computer/information science are different approaches.

Taxonomies have also become an area of interest in information architecture, user experience design, content management, and digital asset management. Taxonomies are also related to terminology management and information search and retrieval. Ontologies, on the other had, have become an area of interest in data science, data engineering, and graph data management. Ontologies also borrow concepts from set theory in mathematics and logic from philosophy.

Taxonomies and ontologies follow different standards, but the standards have also converged in a way. Taxonomies have no standard of their own but follow the thesaurus standards (ANSI/NISO Z.39.19 and ISO 25964) for recommended best practices. Ontologies are based on W3C standards of RDF, RDF-Schema, and the formal language of OWL (Web Ontology Language). The W3C then published a recommendation for taxonomies, thesauri, and other knowledge organization systems called SKOS (Simple Knowledge Organization System) in 2009, and since then it has become widely adopted. SKOS is based on RDF, as is the ontology standards RSF-S. As a result, SKOS and RDF-S statements or namespaes can be combined in the same knowledge organization system, and taxonomies and ontologies can thus be combined.

Features

Both taxonomies and ontologies aim to describe a knowledge domain with collections of entities structured into groups or types, with relationships between them. Ontologies go further in describing the relationships in more detail. Attributes are also more extensive in ontologies. Both support the options for notes or definitions.

Concepts or Entities

Taxonomies are comprised of concepts (sometimes called terms), which are things. Concepts can be generic or specific and may even include named entities (unique proper nouns). Taxonomies do not differentiate between generic concepts and named entities, which correspond to “individuals” in an ontology. Ontologies, on the other hand, distinguish between two types of entities: classes and individuals. Classes can be broad or specific, but, as the name implies, they are intended to contain something, either subclasses or individuals. By contrast, leaf nodes (the narrowest concepts in a hierarchy) in a taxonomy could actually be quite broad in meaning.

Individuals, as defined by an ontology, tend to be named entities (proper nouns), and they should be uniquely individual. This may not be obvious. A brand name product is a proper noun, but technically it is not an individual, because there are numerous specific instances of the product owned by different people. There may be some differences of opinion on how to define individuals.

Relationships

Taxonomies follow thesaurus standards for relationships. Thesaurus hierarchical relationships comprise three types: generic-specific or “is a” kind of relationship, generic-instance (where the instance is a named entity or proper noun), and whole-part. Ontologies have only generic-specific “is a” hierarchical relationships, which are between classes and subclasses. The relationship between an individual and a class is not considered hierarchical in an ontology but rather a relationships of class-member. Also, the whole-part relationship is not considered hierarchical in ontologies (but could be created as a semantic relationship).

While generic-instance is a permitted hierarchical relationship type In a taxonomy, named entity concepts (proper nouns) are not so often narrower to a corresponding generic concept, but rather tend to be grouped in their own separate concept scheme to serve as a separate search facet or filter.

A generic associative (“related”) relationship may exist in taxonomies, although it is more of a feature of thesauri. It is bidirectional and reciprocal, and it tends to be used between concepts within the same concept scheme, which often corresponds to a class in an ontology. Ontologies do not have a generic associative relationship. Instead, ontologies have semantic relations which are designated by the ontology creator, just as the classes are designated, and they are not used within classes but across a specified pair of classes. Suggestions of what might be of related interest to the end-user is not within the scope of an ontology’s purpose which is more structured and based on rules. Ontologies may have other bidirectional reciprocal relationships, such as “goes with,” “has sibling, “accompanies,” etc.

Equivalency and alternative labels

In a taxonomy, each concept has a single preferred label in each language for display and any number of alternative labels and hidden labels per language to help match on searching or tagging. In the traditional thesaurus model, “nonpreferred” terms redirect to “preferred” terms. The alternative labels are sufficiently equivalent in the context of the taxonomy and content to be used for a given concept, and thus might not be exact synonyms. Alternative labels include synonyms, near synonyms, and possibly even narrower terms not deemed needed as concepts with preferred labels.

In ontologies, the OWL element sameAs is intended for equivalency of individuals, and equivalentClass is for the equivalency of classes, and they mean exact equivalence. But there is no designation of one name being preferred and the other alternative. They all are preferred. The use of sameAs and equivalentClass are not intended for use within a single ontology, but rather across different ontologies. So, those OWL elements are similar to the SKOS exactMatch relationship, which is used across concept schemes or taxonomies. They do not support search within the same data set as alternative labels do.

Enforcement of rules

SKOS is a data model for taxonomies and thesauri, but it does not specify any rules for usage. Rather, the taxonomy creator should attempt to follow the guidelines, not exactly rules, in the thesaurus standards (ANSI/NISO Z39.19 and ISO 25964-1). The quality standards include disjoint labels (a label can be used only once for a concept, preferred or alternative, and for only one concept), single relationships (a pair concepts my have hierarchical or associative relationships between them, but not both), and no hierarchical cycles. The standard for ontologies, on the other hand, OWL, has many rules built into it. This makes OWL ontologies more powerful by supporting inferencing and reasoning.

Conclusions

Taxonomies and ontologies share some features, but each has its own additional features. Thus, a combination of a SKOS taxonomy with an OWL ontology combines the features of both. Furthermore, the combination of a taxonomy with an ontology also enables a combination of uses, namely the search and retrieval for both content and data together. Rather than a convergence of taxonomies and ontologies, they are carefully and deliberately combined to maximize their benefits.

Friday, December 30, 2022

Taxonomy Definition

I usually explain that a taxonomy is a structured kind of controlled vocabulary, which is list of terms (or concepts) usually used to tag content to aid in its retrieval. The structure can be hierarchical, faceted, or a combination. Other people have defined taxonomies for a general audience in more simplistic ways as a kind of hierarchical classification system. So, while a taxonomy has two main features (naming and structure), my preferred definition has focused on the controlled vocabulary and naming aspect, whereas other definitions focus on the hierarchical classification aspect of taxonomies. However, a taxonomy and a classification system are not necessarily the same. While it is understandable that a definition is simplified for a general audience, it should not be simplified to the extent of being misleading.

I have blogged previously on the differences between taxonomies and classification systems, so I won’t repeat all the differences again. The main point is that a classification system is generic and rigid and is intended to be used widely, such as the Dewey Decimal Classification for libraries, whereas a taxonomy tends to be customized for a particular use case and context and is flexible and undergoes changes.

Meanwhile, there are also a few well-known classification systems that are called “taxonomies,” such as the Linnaean taxonomy of organisms and Bloom’s taxonomy of educational objectives. These seem quite different from the information-retrieval type of taxonomy. The Linnaean hierarchical levels have names (Kingdom, Phylum, Class, etc.). The relationship of the hierarchical levels to each other are not all of the thesaurus standards: generic-specific, generic-instance, or whole-part. Rather, the Linnaean taxonomic relationship are generic-specific only, or more precisely that of member of class or subclass. Bloom's taxonomy has a completely different hierarchical model that does not follow thesaurus standards at all.

How does a taxonomy of concepts for information retrieval relate to a scientific taxonomy? They are similar, and the differences are not so great that there should be considered different meanings of the word “taxonomy.” If we consider that taxonomies are systems to name and organize things hierarchically, then a taxonomy for information retrieval, comprised of terms for tagging and retrieving content (documents, images, etc.), can be considered a taxonomy of a controlled vocabulary, in contrast to taxonomies of things, such as organisms. This is a slightly different perspective than to consider a taxonomy as a kind of controlled vocabulary, as I previously had. The following diagram illustrates a possible way to consider how information-retrieval taxonomies related to classification systems and controlled vocabularies.

Diagram showing that information taxonomies are at the interssection of classification systems and controlled vocabularies

Several kinds of knowledge organization systems are defined by their published standards. For thesauri, there are ANSI/NISO Z39.19 and ISO 25964. For terminologies, there is ISO/TC 37/SC 3 and other related standards. For ontologies, there is OWL (Web Ontology Language) from the W3C. There is no standard, however, specifically for “taxonomies” or even for “classification systems,” which is a reason why these remain difficult to define. The designations “classification system,” “classification scheme,” and “taxonomy” have been used interchangeably.

Wikipedia provides the definition at the entry for Taxonomy: “A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types.” But then it goes on to say, “it may refer to a categorisation of things or concepts.” Thus, an information-retrieval taxonomy is a categorization of concepts (also called terms in a controlled vocabulary). It is not a classification system, since the goal is not to classify things, not even the things tagged with the taxonomy concepts, but rather to organize the set of concepts that have been identified as appropriate for tagging and retrieving a set of content.