The Accidental Taxonomist: SKOS

Showing posts with label SKOS. Show all posts

Tuesday, December 30, 2025

Taxonomy Benefits Over an Ontology

In a recent conversation based on a LinkedIn post, someone asked “Why choose a taxonomy over an ontology?” This is a good question, since there has been a growing understanding that ontologies build upon taxonomies by adding more semantics, which enable additional benefits. I have presented at conferences on the topic of extending a taxonomy with an ontology. Taxonomies, however, have benefits that ontologies alone cannot provide.

I have compared taxonomies and ontologies in a past blog post (Taxonomies vs. Ontologies). Comparing their uses to taxonomies, ontologies support more complex multi-part searches, enable searching on data and not just content or full documents, and can connect across data in different repositories and sources, which leads to creating knowledge graphs or a semantic layer. Additionally, ontologies support modeling and exploration of complex relationships, graph visualizations, and support for reasoning and inferencing based on logic. Meanwhile, ontologies also include the basic feature of taxonomies of unlimited hierarchies of classes and subclasses. Thus, it may seem as if ontologies are superior to taxonomies and provide greater benefits than taxonomies.

Taxonomies, however, especially those based on the SKOS (Simple Knowledge Organization System) data model, have features and benefits not supported by ontologies alone which are based only on OWL and RDFS standards. These taxonomy (or more broadly “controlled vocabulary”) features include the incorporation of synonyms to support searching and tagging, the support of multilingual concepts, the inclusion of definitions and notes in a standardized manner, the ability to map and link taxonomies together based on equivalent or related concepts, the alignment of the taxonomy with end-user applications including browsable hierarchies and facets for filtering, and finally the ease of implementation into various content systems.

Taxonomies are richer than ontologies in their linguistic aspects, including both synonyms and labels in other languages. Taxonomies are traditionally based on thesauri, which include the feature of having “equivalence” among multiple terms, whereby a preferred term may be “used for” other nonpreferred terms. The SKOS data model specifies a preferred label and any number of alternative labels and hidden labels for a concept. Furthermore, concepts may have labels in multiple languages, and this supports tagging content in different languages and retrieval by users of different languages.

In ontologies, there exists the OWL property of sameAs for equivalence of individuals and equivalentClass for equivalence of classes, but both tend to be used to declare equivalence across different datasets rather than for use within a single ontology, as there is no designation of preferred and alternative names. So, these OWL properties are more like mapping properties than support of synonyms within a controlled vocabulary. As such they do not support the basic purpose of alternative labels in a taxonomy, which is to enable matches to support searching on variant labels and tagging despite different words in texts for the same thing.

The SKOS data model for taxonomies defines properties for scope notes, editorial notes, history notes, examples, and definitions. These are standardized fields and thus the meanings of these notes fields are consistent across taxonomies, supporting interoperability and migration. In OWL ontologies there exists an annotation property, but its use broadly includes labels, definitions, synonyms, attribution, notes, or comments. With such inconsistent use, annotations are not well supported in importing, exporting, or linking of ontologies.

SKOS also has a set of mapping relationships. While OWL supports equivalence with SameAs and equivalentClass, SKOS taxonomies have not only equivalence relationships, exactMatch, but also closeMatch, narrowMatch, broadMatch, and relatedMatch, and thus all concepts in two separate taxonomies can be mapped to each other, unlike two ontologies which may share only a few matches. The full mapping of one taxonomy for another supports various uses, including using one taxonomy in the front end and the other in the back end, tagged to content.

Finally, taxonomies are better suited for various content-based implementation and applications, especially with out-of-the-box systems, such web content management systems, digital asset management systems, SharePoint, etc. A taxonomy modeled is several SKOS concept schemes can designate each concept scheme as a facet in faceted search/browse system, in which a facet serves as a filter. A taxonomy built as a hierarchy tree can be implemented so that users can expand the tree to browse to narrower concepts and then they can retrieve content tagged with the most specific concept desired. Ontologies, even if they contain hierarchies of classes and subclasses, are typically visualized as graphs, and any hierarchies are not displayed in a front-end application. Furthermore, ontology visualizations are usually not linked to actual content or data as they serve just for visualizing.

In sum, while ontologies add richer semantics/ meaning to relationships and attributes, taxonomies have richer semantics/meaning for concepts. Combining a taxonomy and ontology can bring the best of both worlds, and semantic web standards of SKOS, OWL, and RDF-S are all compatible for combining within a single project, since they are all based on the RDF (Resource Description Framework) data model. However, in many cases, a taxonomy with rich meaning for concepts, support for synonyms in search and tagging, along interactive displays of hierarchies and/or facets, is all that is needed. You can always add an ontology later.

Sunday, November 9, 2025

Schema Vocabularies and Value Vocabularies

There are different types of controlled vocabularies for information and knowledge management. Usually, we think of the various kinds of controlled vocabularies for purposes of tagging and finding information, such as term lists, authority files, thesauri, and taxonomies. In the broader context of information and knowledge management, there also exist higher-level controlled vocabularies called schema vocabularies. In this context, the better known (default) controlled vocabularies comprising specific concepts or terms for tagging content are called value vocabularies, since their terms/concepts are considered values.

This dichotomy of schema and value vocabularies occurs particularly within the context of metadata. Metadata management comprises two components: (1) a list of metadata types, also called elements, properties, or fields; and (2) the terms or values possible for each metadata element. I discussed types of metadata in more detail in my last blog post, "Types of Metadata Schema." Thus, a schema vocabulary comprises the names of metadata elements, and a value vocabulary is list of terms/concepts for a specific metadata element. For example, a schema vocabulary, might include Country, Language, Source, and Topic; and the multiple values vocabularies would be the lists of approved countries, languages, sources, and topics. It should be noted that in some systems, e.g. RDF, OWL, etc., the distinction between metadata elements and metadata values can be fuzzy. Furthermore, not all schema vocabulary elements have a corresponding value vocabulary (a controlled vocabulary), though, as some metadata elements may be for such values as title, description, and date.

In my observation, we speak of “vocabularies” rather than “controlled vocabularies” in this context, especially with respect to schema, for various reasons. Schema vocabularies are referred to simply as “vocabularies,” rather than “controlled vocabularies,” because they are not traditional controlled vocabularies used for tagging, and also because their “control” is different from the control of value vocabularies. Value vocabularies can be changed but through defined policies and procedures, which depend on the implementation and ownership, and changes can be frequent, e.g. weekly, monthly, quarterly, or annually. Schema vocabularies, on the other hand, are intended to be standard, and are updated only very infrequently, such as once per 5-10 years, and usually by a standards body. Schema vocabularies provide control by their very nature. Meanwhile, it is often necessary to call out the controlled feature of value vocabularies, since some metadata properties may have uncontrolled keywords as their values.

Schema vocabularies may be metadata schema, such as Dublin Core (for published resources) or IPTC metadata (for photos), but other kinds of information and content management schema can also be considered as schema vocabularies in that a “vocabulary” defines the various elements. Such other schema vocabularies include SKOS (Simple Knowledge Organization System), DCAT (Data Catalog Vocabulary), and iiRDS (intelligent information Request and Delivery Standard), among others. Our panel “Using Schema and Value Vocabularies to Provide Consistency Across Structured Content” addressed these schema and other data frameworks, which are similar to but not the same as schema, such as OWL and DITA, at the recent DCMI (Dublin Core Metadata Initiative) conference in Barcelona in October. Other speakers were Joseph Busch, who had the idea of this topic for a conference panel, Lief Erickson, Noz Urbina, and Peter Winstanley.

DCMI 2025 Panel: "Schema and Value Vocabularies for Consistency"

My presentation the DCMI panel, was "Schema and Value Vocabularies for Thesauri and Taxonomies," which explained that SKOS is a schema vocabulary, and specific SKOS-based taxonomies and thesauri are value vocabularies. SKOS (Simple Knowledge Organization System) is the W3C data model schema for knowledge organization systems, especially taxonomies and thesauri. It can also be considered a schema vocabulary, because it has standard elements with defined display names and machine-readable concatenated forms. In fact, the designation “elements” is what is used in the SKOS model. SKOS, however, is a special kind of schema vocabulary, and it’s not a metadata schema. When SKOS-based taxonomies or thesauri serve as the value vocabularies for metadata elements, those metadata elements are managed as specific SKOS Concept Schemes. In a faceted taxonomy, each Concept Scheme serves as a facet.

Taxonomists don’t usually think of vocabularies being classified as either "schema vocabularies" or "value vocabularies." However, as taxonomies have increasingly been integrated with metadata and serve purposes beyond just browsing, searching and retrieving content, it’s important to see the bigger picture of where taxonomies as value vocabularies fit in, and where taxonomies can provide more benefits.

Thursday, September 18, 2025

Narrower Terms vs. Alternative Terms

A number of years ago I worked on a project of cleaning up a large taxonomy on occupations and job titles. My client contact was sometimes confused between terms to be used as synonyms/variants for a preferred term and terms to be used as narrower terms to a preferred term. This initially surprised me, because the difference seemed so obvious. A more recent project raised the issue again, and I realize challenges.

The word “term” can be confusing, considering the different types of terms that exist. Both variant terms (also called synonym, nonpreferred terms, or entry terms) and narrower terms are kinds of terms. By contrast, focusing on concepts that may have various labels, the distinctions between a concept’s narrower concepts and its alternative labels is quite clear. The widely adopted SKOS (Simple Knowledge Organization System) data model standard follows the concept-based approach. SKOS is now followed by all dedicated taxonomy management software systems.

Many taxonomies, however, are not yet managed in dedicated taxonomy management systems but rather in spreadsheets or internally developed tools, neither of which follow SKOS. This is the case of both my projects in question. Each “term” in the spreadsheet-based tool had its own row, which resulted multiple rows for the same concept. Broader categories were in another column to the right. This format is potentially confusing because the variants appeared in a column as did the hierarchical levels, and you had to remember which column was which.

Regardless of the tool used, what makes it even more confusing is that a narrower concept could be either a variant term or a hierarchically narrower term. What may variously be called synonyms, variants, nonpreferred terms, entry terms, or alternative labels are not merely literal synonyms, but they could be any terms or labels that may be used in tagging to trigger the use of the concept or preferred term. This includes terms whose meaning is narrower or more specific than the term/concept in question, since the latter includes more specific terms within its scope. So, tagging the occurrence of a concept with a “broader” concept is acceptable.

For example, in a medical taxonomy a concept can be Radiation therapy. Radiotherapy is an alternative label. But then there are specific types of radiation therapy, such as Brachytherapy, Radioimmunotherapy, and Radionuclide therapy. These could be added to the taxonomy either as narrower concepts or as alternative labels to Radiation therapy, depending on how specific the taxonomy should be.

When creating or editing a taxonomy, it is often difficult to decide how specific the taxonomy should be in certain places. Terms that are too specific to warrant use as concepts should then be relegated to the status of variants/alternative labels. Deciding what is too specific depends on the concept’s relative specificity within the entire taxonomy in addition to considering the potential usage of the specific concept.

In sum, if you are not ready to adopt SKOS-based taxonomy management software, at the very least you should adopt a SKOS-based approach in conceptualizing and labeling your taxonomy. Call things “concepts” and “labels”, not “terms.” Concepts are in hierarchical relationships to each other. Labels are the names for concepts. The “preferred label” is the displayed form of the name (such as in facets in the fronted application), and “alternative labels” are variant labels to match against strings of text that may be used for the concept and trigger tagging with the concept. Furthermore, alternative labels could be displayed differently from preferred labels, such as in italics and/or a different colored shaded cell.

Saturday, February 25, 2023

Related Concepts in Taxonomies


A and B are related; C and D are related.

Taxonomies and thesauri are characterized by having hierarchical relationships linking their terms. The associative relationship (or related concept, Related Term, or RT), on the other hand, is a fundamental feature of thesauri, but it is merely an optional feature of taxonomies.

An over-simplistic distinction between taxonomies and thesauri is the presence of associative relationships, although I would disagree, because taxonomies can have associative relationships, and there are other structural design differences between taxonomies and thesauri. (See my past blog posts Taxonomies vs. Thesauri and Taxonomies vs. Thesauri: Practical Implementations)

The associative (related) relationship is a generic, nonhierarchical, symmetrical (same in both directions), reciprocal relationship between pairs of terms/concepts in a thesaurus or taxonomy. "Related concept" actually refers to a kind of relationship, not a kind of concept. The following figure illustrates that Data protection and Privacy are related.

It is true that many taxonomies do not have associative relationships. This is for various reasons. The function of the taxonomy in the user interface may not require the support of related concepts, such as when the taxonomy is displayed only as facets for refining results or only as type-ahead taxonomy term suggestions when a user enters a search string into a search box. The taxonomy may be implemented in a system (such as a commercial off-the-shelf content management system or SharePoint) that does not support the links/navigating to related concepts in the user interface. A taxonomy may be too small to make beneficial use of associative relationships if most of the taxonomy can quickly be browsed and seen. Finally, and perhaps of the greatest potential significance, is that relationships across different types of concepts can instead be better supported with customized semantic relationships based on custom schema and ontologies, which can be applied to a taxonomy. For example, having Physicians practice Medicine and Medicine isPracticedBy Physicians, instead of Physicians related Medicine.

It is not so much the presence but rather the extent of associative relationships that also distinguishes thesauri from taxonomies. In a traditional thesaurus, associative relationships are as prolific as hierarchical relationships, and perhaps even more so, and they occur between terms of all different kinds and different types of relatedness. The thesaurus standards (ANSI/NISO Z39.19 and ISO 25964-1) provide a list of possible types of associative relationships (process and agent, action and target, cause and effect, object and property, object and origins, and discipline and object, among many others). When taxonomies have associative relationships, they tend to be limited to only certain categories, facets, or concept schemes of the taxonomy.

Related Concepts and SKOS Concept Schemes

Most taxonomies these days, if they are of any significant size (hundreds or thousands of concepts) and intended for use in more than one application, are created in the SKOS (Simple Knowledge Organization System) data model. (Smaller taxonomies might be created in a spreadsheet and imported into a content management system.) The highest level of organizational structure in SKOS is the concept scheme. SKOS-based taxonomy management software will group and display multiple concept schemes together in a single “project” or “knowledge model,” which is intended for a single business use, set of content, user audience, or implementation (with some overlap of multiple use cases acceptable). While SKOS does not provide any recommendation on what you should use concept schemes for, it has become common practice to designate a concept scheme for a taxonomy facet or a metadata property/field. Even when concept schemes are not currently implemented as facets, they might be in the future, so it is good practice to created concept schemes to represent facets. The structure of concept schemes representing facets is also is also a good organizing principle for constructing any taxonomy. Concept schemes also tend to reflect top-level “classes” of ontologies (although not the very esoteric top class of “Thing”).

SKOS permits the creation of related concept relationships both within and between concept schemes. SKOS also has mapping relationships called matching properties, including relatedMatch, for use between concept schemes, whether they are in the same “project” (sharing the same, initial, domain part of a URI) or not. The option to use either related or relatedMatch across concept schemes of the same project can be a source of confusion.

Best Practices for SKOS Related Concepts

If you are implementing concept schemes each as a facet/filter/refinement in a user interface, then it is best practice not create associative (related) relationships between concepts in different concept schemes. Facets function as mutually exclusive aspects or dimensions of content items and queries. Any “relatedness” is implicit based on the search results, but not from the taxonomy itself, which should be flexible to allow any combination of concepts from facets and not prescribe relatedness. For example, a user may want to filter a search on movies by which movies meet selected criteria (facets) of a chosen genre, actor, director, topical theme, and country of production, and the result set will implicitly indicate in which movies where these aspects are related.

Enriching a taxonomy with the semantics of an ontology, in addition to supporting additional data attributes (such as movie production year, actor nationality and birth date, etc.), supports connections across concept types that can be utilized in a front-end application. The user can search not only for movies, but also search for other entities, such as actors (who appear in movies of a certain genre directed by a certain director), or directors (who directed movies on certain themes from certain countries), etc. This involved creating customized, semantic relationships between classes which correspond to the concept schemes: Actor performsIn Movie title and Movie title hasActor Actor, Movie title isProducedIn Country and Country isOriginOf Movie title, etc. These semantic relationships, of course, make any generic SKOS related relationships across the concept schemes unnecessary, redundant, and rather meaningless.

Thus, regardless of the use of your concept schemes, the related concept relationship is best not used between concepts in different concept schemes. Rather, the related concept relationship is better used between concepts within a concept scheme, especially topical (subject) concepts, for example, relating the concepts Data quality and Quality management. Relatedness between named entities within a concept scheme, on the other hand, such as concept schemes for People, Organizations, and Geographic places, is best left to be implicit from the retrieved content and not prescribed in a taxonomy, which may be dependent on the content, change over time, and be too subjective.

Even if the current end-user application of a taxonomy does not support user interaction with related links, associative relationships can support tagging, both manual and automated. Finally, a taxonomy typically has a longer life than a single application, so incorporating in related concept relationships while the taxonomy is being built and regularly maintained is a good practice for the future use of the taxonomy.

Wednesday, August 31, 2022

SKOS-XL for Taxonomies

I recently posted about SKOS (Simple Knowledge Organization System). If you have read anything about SKOS, then you might have come across SKOS-XL (SKOS eXtension for Labels) and wondered what that is. The World Wide Web Consortium (W3C) released its recommendations for SKOS and SKOS-XL at the same time in 2009 but chose to make them separate recommendations. One way to see it is that, by separating out SKOS-XL, SKOS is indeed truly “simple.” In the detailed SKOS reference, SKOS-XL is an appendix.

www.w3.org/TR/skos-reference/skos-xl.html

Extending labels to become resources

“Things, not strings” is a tagline for semantic models, such as SKOS, which emphasize concepts in taxonomies and other knowledge organization systems and not terms or words. Of course, strings of text exist, and when associated with concepts they are called “labels.” The distinction between a label and the concept that the label describes may seem indistinguishable or perhaps just philosophical. The main difference is that concepts are unique within a taxonomy, but labels are not. A concept may have multiple labels (synonyms or names in different languages), and the same label might apply to different concepts (homographs).

SKOS specifies preferred labels, alternative labels, and hidden labels as options for concepts. Hidden labels can be considered as a type of alternative label that should never be displayed. Alternative labels may display, depending on the front-end application. Preferred labels are what are displayed, especially in hierarchies and facets.

Concepts, as things, have properties or characteristics. Labels do not. But sometimes there are reasons to assign properties to labels, such as to indicate the purpose or use of different labels. In this sense, you would want to turn a string into a thing. More correctly, a thing is called a resource, as described by the Resource Description Framework (RDF) the model upon which SKOS is based. This is what SKOS-XL supports: converting labels to resources. It does this by adding three more elements not found in SKOS: label, label relation, and literal form. It is the label relation in particular that enables the extension to establish a link between a concept and a label. Further details are in the W3C's SKOS-XL recommendation, which I am not going to repeat here.

Use for SKOS-XL

A typical use case for SKOS-XL to assign properties to labels is if you want to have different labels for different user groups, such as a medical taxonomy for shared medical content to be accessed by both medical professionals and lay people. Medical professionals may prefer a concept labeled Neoplasms, while lay people could call it Cancer. Different user groups could be based in different regions. Although different ISO-code based language labels can be used to distinguish regions in addition to language (such as en-US and en-GB), you may not want to duplicate the vast majority of preferred labels and merely distinguish the few that are actually different.

While SKOS permits multiple alternative labels, aside from hidden labels, there is no way to distinguish their types or purposes in SKOS. You may want to alternative labels support search in one front-end application and not another. You may want to designate official acronyms as distinct from other alternative labels. You may even want to distinguish between different kinds of hidden labels, such as those that should be hidden because they might be pejorative or offensive, and those that you wish to hide only from a type-ahead display because they are near duplicates of other alternative labels and too many alternative labels would clutter up the display. Finally, there may be alternative labels used by only certain users or in certain regions.

SKOS-XL lets you assign properties or attributes to labels. Assigning the purpose or use of the label is only one possibility, although it is the most common use of SKOS-XL. You may wish to manage more administrative metadata about labels, such as the source or origin of different labels.

Implementing SKOS-XL

The principle of SKOS-XL is not complex, but implementation can be more challenging, and if you are building taxonomies with the SKOS-XL capability, you would want to use taxonomy management software that supports SKOS-XL, such as PoolParty. Taxonomy management software products are quite consistent when it comes to their user interface for supporting the editing of basic SKOS taxonomies, but they are not the same for creating and editing SKOS-XL labels, which is a less common function.

Having properties, such as types, for terms is not new, but required some more innovation in the SKOS model of things (concepts), not strings (terms). It was common for non-SKOS taxonomy/thesaurus management software, which treated different terms with the same meaning as equivalence relationships, to support the customization of relationships, including the equivalence relationship. SKOS-XL ensures that this earlier feature is supported in the current standard, in machine-readable format.

For SKOS-XL to be more widely used and maybe even more elegantly supported requires a great sharing of use cases. I hope the taxonomist community will share their experiences with SKOS-XL, so we can talk about practices and recommendations and not just theory.

Further information:

“Taxonomy Management Based on SKOS-XL” 2016 presentation slides
“From SKOS over SKOS-XL to Custom Ontologies” 2016 webinar video and slides
“What SKOS-XL adds to SKOS” 2011 blog post by Bob DuCharme

Sunday, June 26, 2022

SKOS Taxonomies

Over the 26 years that I have been involved in controlled vocabularies, thesauri, and taxonomies, the biggest change I have seen in the field is the adoption of SKOS (Simple Knowledge Organization System) as a schema model and standard.

If you are creating taxonomies exclusively within a single system (such as the SharePoint Term Store or controlled tags or categories of a content management system, documentation management system, DAM, etc.), then you probably have not paid much attention to SKOS. It’s true that taxonomies created within and used within a single system, do not have to follow an external standard. But that is not the trend of information management and technology anymore. Connectivity, interoperability, data sharing and reuse, data-centric architecture, vendor-neutral formats, linked data and linked open data, breaking down data silos, enterprise-wide knowledge, and enterprise knowledge graphs have become the preferred trends and directions.

Different Kinds of Standard

With respect to standards, there exist two basic kinds: (1) standards for design, functionality, and a consistent user experience, and (2) standards for compatibility, interoperability, and machine-readability. For this reason, there are two separate sets of standards for taxonomies and other knowledge organization systems. Another way to think of it is that there are standards for each the front end (user interface and experience) and the back end (computer-readable code) of taxonomies, and they are somewhat independent yet still compatible with each other.

For taxonomies and thesauri, more has been written about the front-end design and best practice standards than the back-end interoperability standards. This is for several reasons. The design and best practices standards (ANSI/NISO Z39.19 and ISO 25964 and its predecessors ISO 2788 and ISO 5964), have been around longer. They are lengthier and more detailed than interoperability standards, and they apply to taxonomies and thesauri regardless of their digital or nondigital format. So, this article will focus instead on the back-end, interoperability standard, which is SKOS.

SKOS Background

SKOS is a recommendation for "a common data model for sharing and linking knowledge organization systems via the Semantic Web". These knowledge organization systems include thesauri (as defined by the ANSI/NISO and ISO thesaurus standards), taxonomies, classification schemes, subject heading systems, and other controlled vocabularies. SKOS is based on RDF (Resource Description Framework), a World Wide Web Consortium (W3C) standard for description and exchange of graph data. RDF specifies that all statements consist of subject-predicate-object triples, and all resources have URIs (uniform resource identifiers).

The development of SKOS aimed to build upon RDF to provide a recommended schema for thesauri. SKOS development was first undertaken as the Semantic Web Advanced Development for Europe (SWAD-Europe) project before being adopted and supported by the W3C in 2004. The W3C formally released the SKOS recommendation in 2009.

Meanwhile, the W3C had been working on other recommendations for web-based ontologies, including RDF Schema (RDFS) and Web Ontology Language(OWL). SKOS is compatible with RDFS and OWL, and elements from the different models can be combined. Furthermore, SKOS can even be considered as a very generic upper ontology itself, and the W3C documentation describes SKOS in terms of OWL and RDFS expressions.

The main types of elements of SKOS are concepts, lexical labels, documentation properties (notes), semantic relationships, mapping properties, and concept collections. (Concepts, concept schemes, and collections are ontology classes, and the others are ontology properties.) In their machine-readable form, the SKOS elements are concatenated with no spaces, such as preLabel, scopeNote, and exactMatch.

SKOS Concepts, Labels, and Notes

SKOS is concept-centric. Making a distinction between concepts and labels is the biggest departure from traditional thesaurus standards and past controlled vocabulary practice. A concept is an idea of something, and a label is a name for that idea. Thus, a concept may have multiple labels. For the organization of a vocabulary, especially as a hierarchy, one of the various labels needs to be designated as the preferred displayed label. The others are alternative labels and its sub-type, hidden label, which may be used to designate that the label should not display to end-users. Labels for the same concept may exist in multiple languages, but there may be only one preferred label per language.

Notation is intended for use as an appending part of a label, such as an alpha-numeric code, which is commonly used in classification schemes.

Documentation comprises various types of notes, including scope note, editorial note, change note, and history note. Definition and example are additional documentation types. Scope notes are commonly used in thesauri to clarify the usage of a concept in tagging/indexing for the specific context of controlled vocabulary and its set of content. They serve an important role for manual tagging. Other note types may be utilized for administration and management of the controlled vocabulary. Definitions may be entered for more technical controlled vocabularies or when the controlled vocabulary also serve the function of a glossary.

SKOS Concept Schemes and Collections

What constitutes an individual "taxonomy," "thesaurus" or other controlled vocabulary? This may not be very clear. SKOS introduces the formal organizing unit called a concept scheme, as a “collection of concepts.” A concept scheme is a single controlled vocabulary, thesaurus, hierarchical taxonomy, facet within a faceted taxonomy, or metadata property within a larger metadata schema.

There are some advanced, lesser used features of SKOS, including in scheme, which allows you to control whether a concept is in a concept scheme regardless of whether it’s within the concept scheme’s hierarchy (which is otherwise the default). There is also a special designation of top concept for the top concepts of a concept scheme, a designation which could be utilized for a front-end display implementation.

Collections are an additional optional way to designate a grouping of concepts for a purpose, such as the taxonomy concepts to be used in only specified implementations or those of subject categories for subject matter expert review. Furthermore, concepts can be ordered within collections.

SKOS Relations and Mapping Properties

SKOS includes what are called semantic relations, although this name could cause confusion, since they are the basic thesaurus relationships (broader, narrower, and related), not customizable semantic relations characteristic of ontologies. These thesaural-type relationships are used between concepts within the same concept scheme. In addition, SKOS specifies broader transitive and narrower transitive, meaning the inheritance of the relationship to additional levels of the hierarchy. This is usually assumed to be the case by default, and thus these specifically transitive relations are rarely implemented, but if there are reasons not to inherit and extend the logical hierarchy by default, then the transitive relations may be used. (I have not come across a use case, though.)

Since SKOS specifies concept schemes, SKOS also specifies an additional set of relation types called mapping properties that are to be used between concepts in different concept schemes or different taxonomies. These comprise exact match, close match, narrower match, broader match, and related match. Exact match and close match are used to map existing taxonomies together, often so that one is used in the tagging and the other is used in the retrieval. The other mapping relations may be used to extend one taxonomy with another while still maintaining a distinction between the two.

Following is a table of SKOS elements by type (class or property) with the concatenated machine-readable forms.

Implementation of SKOS

Most commercial and open-source taxonomy/thesaurus management software now supports SKOS. There are also simple free tools called SKOS editors. SKOS elements are presented in their full human readable names (such as Preferred Label, instead of prefLabel), so it is intuitive to understand. Thus, taxonomists don’t have to worry about SKOS, but should at least be familiar with its principles. Familiarity with SKOS makes it easier to switch from using one software package to another. Software may vary, however, in how well they support some of the less common features, such as in scheme, collections, and broader/narrower transitive.

Taxonomy/thesaurus management software often has the additional administrative grouping of related concept schemes for the same implementation into what may be called a “project” or “knowledge model.” SKOS mapping relations tend to be used more often across concept schemes that are managed in different projects, rather than within the same project. Within the same project, concept schemes tend to represent facets (which have no relations between them) or ontology classes (which have customized semantic relations between them).

Since all elements of SKOS are standard machine-readable, you can leverage any element with rules for usage, such as for how tagging should be done and how concepts and relationships are displayed. Custom applications of SKOS vocabularies are thus common.

If you want to dive into all the details of SKOS, consult these resources from the W3C:

SKOS is intended to be flexible, and it is more suggestive than restrictive. Thus, a SKOS-based taxonomy or thesaurus could still be poorly designed, and that’s why the other standards for best practices, ANSI/NISO Z39.19 and ISO 25964 are also important.

Friday, December 17, 2021

Named Entities in Taxonomies

I have long felt that there is some uncertainty as to where named entities (names of specific people, places, organizations, products, etc.) fit into taxonomies. Standards suggest one way, and practice tends to follow different way in dealing with these proper nouns. As taxonomy trends evolve so does the position on these named entities. The fact that taxonomies are not well-defined leaves it open to question as whether to taxonomies should have any named entities in them, or if taxonomies should comprise only topics.

Historical trends

A historical perspective is needed. Modern, digital information retrieval taxonomies evolved out of thesauri. Thesauri, which originally came out in print format, first appeared in the 1960s and then were formalized by various standards published in the 1970s. The thesaurus standards state clearly that the relationships between a named instance and its type is one of the three kinds of hierarchical relationships permitted and supported in thesauri (the other two being generic-specific and whole-part). While taxonomies may omit the associative (related term) relationship of thesauri, they tend to follow the hierarchical standards of thesauri. Thus, named entities could be included in the taxonomy as the narrowest terms, narrower to a term for whatever “type” they are. But should it always be this way?

Then faceted taxonomies started being implemented in the early 2000s, first in ecommerce and then by the end of the decade in intranets, content management systems, digital asset management systems, and various content-rich websites. Once facets became adopted in information retrieval applications (aside from ecommerce), it became obvious from a user design perspective that named entities belonged in a different facet than the subjects. Facets are for refining a complex search query by different aspects. Sometimes these aspects follow the types of questions: What? Who? Where? When? “What” is usually for a subject,” but “who,” “where,” and “when” (for taxonomy terms naming events, not date ranges) refer to named entities. Sometimes people start a query about a subject, and sometimes people start a query about a named entity, and facets allow people to start off searching any way they wish.

Then in 2009 the World Wide Web Consortium published the Simple Knowledge Organization System (SKOS) recommendation for taxonomies, thesauri, and other controlled vocabularies, which over the following decade became adopted as the standard model for building machine-readable taxonomies. One of the elements described in SKOS is that of the concept scheme, which is defined merely as “an aggregation of one or more SKOS concepts.” There is nothing comparable in the thesaurus standards. While a taxonomist may choose what to do with an “aggregation” of concepts, it has proven practical to separate out different kinds of named entities into concept schemes separate from concept schemes for topics. Thus, the widespread adoption of SKOS has contributed to the trend of separating different named entity sets, which had already started with faceted taxonomies.

My initial, and longest, experience in the domain of taxonomies and controlled vocabularies was as a controlled vocabulary editor at the library database vendor Gale. At Gale (and its predecessor company), named entity controlled vocabularies ("name authorities") have been separate from the subjects, but there were reasons for this. The named entities (named persons, companies, organizations and agencies, named works, products, laws, events, and fictional characters), each have had different sets of attributes and rules for maintenance. Some even have different customized relationships with other controlled vocabularies. Interestingly, it was not always this way. Before I joined in the mid-1990s, some of these named entities (agencies, organizations, works, geographics, and events) were mixed in with the “descriptors” in a Subject MegaFile. But eventually specific attributes and relations, not to mention the growing number of terms and a new vocabulary management system, combined to make it more logical to split off each of the named entity vocabularies. The Events were the last to be split out of the Subjects. So, it’s not because the controlled vocabularies were named entities per se, but rather their growing specialized maintenance needs due to an increase in specific attributes that led to managing them as separate controlled vocabularies. Attributes include, for example, birth date and place for a person, latitude and longitude for a location, and website URL and address for companies and organizations, among many more.

Taxonomies and ontologies

This feature of attributes brings us to the most recent trend in taxonomies, which is the occasional, but growing, convergence of taxonomies and ontologies. Ontologies divide up a knowledge domain into classes, and each class (like the Gale named-entity controlled vocabularies) has its own set of attributes and customized relationships with other classes. Ontologies, according to the Web Ontology Language (OWL) standard, however, have a different perspective on named entities. Ontologies are comprised of classes and subclasses, in hierarchies, which, in turn contain “instances” or “individuals,” which are unique named entities. The relationships between an instance and a class (or subclass) is not, however, considered hierarchical, but rather of a “member” type. Thus, while thesauri make no distinction for named entities, and taxonomies separate out name entities when it’s practical, ontologies make a strict distinction.

Furthermore, for ontologies, which originated in the domains of philosophy and computer science, a named entity as a proper noun is not what matters. Rather, it’s the fact that the instance is unique, and there is only one. This is true for people, companies/organizations, and places. It is not true for brand name products, though. A named product is a proper noun, such as MacBook Pro or Honda Accord, but it is not a unique instance, because there are millions of individual MacBook Pros and Honda Accords in existence. It’s a similar matter for named works, such as books, where one title has millions of copies. “Named entities” or “proper nouns” are grammatical or linguistic designations, which are OK for taxonomies and thesauri, but are not a feature of ontologies, with their philosophical origins.

Fortunately, you don’t have to worry about this philosophical problem if you choose to follow the approach of applying a high-level ontology model to an existing taxonomy or set of controlled vocabularies to extend the ontology with specific terms and named entities (or, from the other direction, to extend the taxonomy with semantic relations and attributes). The OWL-based ontology then may comprise only as many classes and subclasses needed to designate the usage of distinct custom relations and attributes. With this approach, a different ontology class is mapped to each subset or hierarchy or SKOS concept scheme of a larger taxonomy. Each named entity type would typically correspond to a different ontology class, based on the named entity’s own attributes and relations. So, each named entity type would be in its own controlled vocabulary or SKOS concept scheme.

Just because OWL ontologies may include named instances as members of a subclass, does not mean you have to set up your knowledge model that way. This is similar to the idea of the thesaurus standard, which permits named entities to be narrower terms to generic subjects, but you don’t have to set it up that way. Omitting an option described in the thesaurus or ontology standards does not mean you are not in compliance with those standards.

So, in conclusion, while some things about taxonomies have remained constant, other things, such as where to put named entities, have changed over time.