The Accidental Taxonomist: Taxonomy definitions

Showing posts with label Taxonomy definitions. Show all posts

Wednesday, July 31, 2024

Subject Headings vs. Taxonomies

When I spoke about taxonomies at the recent SLA (Special Librarians Association) annual conference, I was asked how a taxonomy differs from a subject heading scheme. Librarians are very familiar with subject headings, which are used to catalog books and other library materials. This is an interesting question, which I answered briefly in my presentation session, but I’d like to explain further.

I have previously written about how a taxonomy differs from a classification in “Classification Systems vs. Taxonomies” Taxonomies are more similar to subject heading schemes. Libraries use both classification systems (such as the Dewey Decimal), which are for determining the physical location of books and other library materials on shelves based on their codes, and subject heading schemes (such as Library of Congress Subject Headings), which are used to identify books and other materials by their specific subject matter. The same subject could be used to catalog books and materials of different types (nonfiction, fiction, sound recordings, children’s) with very different classifications.

How Taxonomies and Subject Heading Schemes are Similar

Taxonomies and subject heading schemes are both considered types of controlled vocabularies, and they share similar uses and features. They both serve users who are looking up subjects to find information or resources available on the subject, rather than (or not yet) for identifying the physical location of the resource. In addition, they both:

have structures, but their focus is on the concepts
can be both searched and browsed
exist for both general and specific subject domains (Medical Subject Headings (MeSH) published by the National Library of Medicine is an example of a specific subject-domain subject heading scheme.)
have some structured, thesaurus-type of relationships between terms, including broader/narrower, and related.
bring together different names, as synonyms/alternative labels/nonpreferred terms/used for terms
may include named entities (proper nouns for people, organizations, or geographic places) alongside topical subjects
may have scope notes on select terms

How Taxonomies and Subject Heading Schemes Differ

With so many similarities, one might wonder if there are any differences between subject heading schemes and taxonomies.

Subject heading schemes and taxonomies have different histories and originally different formats. Subject heading schemes were designed for the print format and have been adapted to digital environments, whereas information “taxonomies” as we know them have existed only after the emergence of digital navigation and search systems.

Structural Differences with Subdivisions

The name “subject headings” refers to the traditional browsable display of headings in an index, and under headings may appear sub-headings or subdivisions to further refine multiple references/citations/linked results. This structure is the main difference between subject heading schemes and taxonomies. The heading-subheading/subdivision structure is characteristic of back-of-the book indexes and indexes to articles when such indexes previously appeared in print, although it is still used online.

A subject heading may be subdivided by the addition of different types of subdivisions: topical, geographical (such as a country name), chronological (such as century, decade or war time), and form (for the content type, such a Periodicals). Some topical subdivisions are rather generic and can be applied to many headings, such as “Management,” “Research,” or “Law and legislation,” but most are specific to only a limited number of headings. For example, the subdivision “Lighting” is to be used under headings for structures, rooms, vehicles, installations, etc. See the full list of Library of Congress subdivisions.

The way that subdivisions refine a heading can be compared to the function of facets in a faceted taxonomy, which was noted by someone in the audience of my conference session. (See also the post “Faceted Classificationand Faceted Taxonomies.”) Subdivisions and facets are both aspects of something. That does not mean, however, that a faceted taxonomy and a subject heading scheme are the same.

The structure of a faceted taxonomy has facets at the top-level, and the facets are relevant to a specific set of content, so they are aspects of the content, rather than aspects of a heading term.
There can be hierarchies of terms within a facet of a faceted taxonomy, but subdivisions do not have internal hierarchy. Instead, subdivisions may subdivide each other, but this is more like a prescribed navigation path, and they must follow a standard sequence. For example:
English literature—20th century—History and criticism

Application Differences of Subdivisions vs. Attributes

Another facet-like implementation of taxonomies is to have attributes to refine the search results of a specific term within a hierarchical taxonomy. Attributes are common in e-commerce taxonomies, which involve a hierarchical taxonomy for product categories and attributes for product features. Attributes are more like subdivisions, in the way that they refine topics from the hierarchical taxonomy, but they are applied (tagged) differently than subdivisions.

The combination of a subject heading and a subdivision is done at the time of indexing an article or cataloging a book, and there are rules about which combinations are permitted. The combinations are indexed as if they were a single compound concept. Catalogers are required to use established heading-subdivision combinations and cannot just make up their own. Any string of multiple subdivisions must be applied in a prescribed order, such as geographic-topical-chronological-form for Library of Congress Subject Headings that are topics authorized for geographic subdivision.

Unlike the practice of cataloging or indexing with subject headings and subdivision taxonomy terms and attributes for refinement are:

assigned more independently of each other, although the type of taxonomy term may restrict which attributes are available
have a greater number of attribute types available and tag a piece of content with values from most or all of the attribute types
may even have more than one attribute value of the same type may be applied (such as an item having two colors)
have no ranked order to apply attributes or to search on them

Convergence of Subject Headings Schemes and Taxonomies

While subject heading schemes and taxonomies have traditionally had different styles, they have become more similar in more recent decades.

Many subject heading schemes and taxonomies have both adopted thesaurus features. Originally, the Library of Congress Subject Headings had only See (Use) and See also relationships (like in an index), but in 1987 it adopted thesaurus relationships of broader term/narrower term, and related term in place of See also. Meanwhile the differences between taxonomies and thesauri have also been blurred, as taxonomies may have related-term relationships, and thesauri may have an over-arching hierarchical structure. The leading reason taxonomies and thesauri are difficult to distinguish, in my opinion, is because the same software tools are used to develop and manage both, and the software makes no distinction between “taxonomy” and “thesaurus.”

Another way in which subject headings have become more like taxonomies is that subject headings may be used without subdivisions. This is increasingly common as subject headings get reused in search and retrieval systems which do not support the complexity of subdivisions. For example, newer online publishers of medical information have adopted Medical Subject Headings without their subdivisions, which are still used by the National Library of Medicine. Additionally, auto-tagging is not easily done with multiple levels indexing. Without subdivisions, subject heading schemes are essentially the same as taxonomies, as long as they have a hierarchical structure.

Conclusions

Taxonomies have similarities and differences to both classification systems and to subject heading schemes. In fact, I would say that the modern information taxonomies have inherited features of both. Taxonomies are not always well defined, but they are flexible and adaptable to business needs.

Controlled vocabularies have existed for a long time, but their applications are becoming more varied. This has led to differences and also convergences of their features. Nevertheless, certain controlled vocabularies are more common in certain implementations. Subject heading schemes remain common in libraries, whereas taxonomies are more common in business and commercial implementations.

Friday, December 30, 2022

Taxonomy Definition

I usually explain that a taxonomy is a structured kind of controlled vocabulary, which is list of terms (or concepts) usually used to tag content to aid in its retrieval. The structure can be hierarchical, faceted, or a combination. Other people have defined taxonomies for a general audience in more simplistic ways as a kind of hierarchical classification system. So, while a taxonomy has two main features (naming and structure), my preferred definition has focused on the controlled vocabulary and naming aspect, whereas other definitions focus on the hierarchical classification aspect of taxonomies. However, a taxonomy and a classification system are not necessarily the same. While it is understandable that a definition is simplified for a general audience, it should not be simplified to the extent of being misleading.

I have blogged previously on the differences between taxonomies and classification systems, so I won’t repeat all the differences again. The main point is that a classification system is generic and rigid and is intended to be used widely, such as the Dewey Decimal Classification for libraries, whereas a taxonomy tends to be customized for a particular use case and context and is flexible and undergoes changes.

Meanwhile, there are also a few well-known classification systems that are called “taxonomies,” such as the Linnaean taxonomy of organisms and Bloom’s taxonomy of educational objectives. These seem quite different from the information-retrieval type of taxonomy. The Linnaean hierarchical levels have names (Kingdom, Phylum, Class, etc.). The relationship of the hierarchical levels to each other are not all of the thesaurus standards: generic-specific, generic-instance, or whole-part. Rather, the Linnaean taxonomic relationship are generic-specific only, or more precisely that of member of class or subclass. Bloom's taxonomy has a completely different hierarchical model that does not follow thesaurus standards at all.

How does a taxonomy of concepts for information retrieval relate to a scientific taxonomy? They are similar, and the differences are not so great that there should be considered different meanings of the word “taxonomy.” If we consider that taxonomies are systems to name and organize things hierarchically, then a taxonomy for information retrieval, comprised of terms for tagging and retrieving content (documents, images, etc.), can be considered a taxonomy of a controlled vocabulary, in contrast to taxonomies of things, such as organisms. This is a slightly different perspective than to consider a taxonomy as a kind of controlled vocabulary, as I previously had. The following diagram illustrates a possible way to consider how information-retrieval taxonomies related to classification systems and controlled vocabularies.

Diagram showing that information taxonomies are at the interssection of classification systems and controlled vocabularies

Several kinds of knowledge organization systems are defined by their published standards. For thesauri, there are ANSI/NISO Z39.19 and ISO 25964. For terminologies, there is ISO/TC 37/SC 3 and other related standards. For ontologies, there is OWL (Web Ontology Language) from the W3C. There is no standard, however, specifically for “taxonomies” or even for “classification systems,” which is a reason why these remain difficult to define. The designations “classification system,” “classification scheme,” and “taxonomy” have been used interchangeably.

Wikipedia provides the definition at the entry for Taxonomy: “A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types.” But then it goes on to say, “it may refer to a categorisation of things or concepts.” Thus, an information-retrieval taxonomy is a categorization of concepts (also called terms in a controlled vocabulary). It is not a classification system, since the goal is not to classify things, not even the things tagged with the taxonomy concepts, but rather to organize the set of concepts that have been identified as appropriate for tagging and retrieving a set of content.

Sunday, February 9, 2020

Classification Systems vs. Taxonomies

Is a taxonomy the same as a classification scheme or system? Or, to put it another way, is a classification system, such as the Dewey Decimal System, a kind of taxonomy? Both of these kinds of knowledge organization systems have the feature of arranging topical terms in a hierarchy of multiple levels, without having related-term relationships or necessarily synonyms/nonpreferred terms, which are features of thesauri. So, it appears as if the only difference is that classification systems have some kind of notation or alphanumeric code associated with each term, and taxonomies do not. The differences, however, are greater than that.

Classification systems

The codes/notations in classifications are not merely shortcut conveniences. They represent a way to divide up the area of knowledge into broad classes, sub-classes, sub-sub-classes, etc. The codes/notations are not an after-thought but are planned from the beginning of the design of a classification system.

The classification is comprehensive; everything in the subject domain is covered with a classification code + label. There is often not a lot of room for expansion, except for a few unused sub-unit codes in each area for new topics. The word classification means to put into a predefined class or grouping. The approach to classification is thinking “where does this go?” (Digital documents may go into more than one classification.)

Classification systems are not just used in libraries, but in corporate settings too, such as for research literature or detailed manufacturing product catalogs. The standard for defining knowledge organization systems for interoperability on the web, the Simple Knowledge Organization System (SKOS), developed by the World Wide Web Consortium (W3C), recognizes classifications systems, by having a designated element for “notation.”

Taxonomies

A taxonomy is a kind of knowledge organization system that has its terms hierarchically related to each other. The starting point in creating a taxonomy might be a few top terms or facets, but then the focus of taxonomy development is on the specific terms needed, rather than the division of a domain into classes and subclasses, etc. What this means is that the terms do not have to comprehensively cover the subject domain in an abstract manner. Rather the terms have to “cover” the topics appearing in the body of content to be tagged with the taxonomy.

The taxonomy is used for tagging or indexing, not for classification or cataloging. So, rather than thinking where (into what class) does this document go, the question is, what is/are the main topic(s) of this document. The topics might not fall into neat balanced hierarchies. For example, an intranet taxonomy might have a term for Temporary employees, because there are some human resources policies dealing with this topic specifically, but have no term for Full-time employees, since that is the default, and the term would not be useful (and likely inconsistently tagged).

Taxonomies vs. Classification Systems Comparison Table

Different mindsets

Lumpers and splitters are historically two opposing viewpoints in categorization and classification: whether you "lump" items into large categories, focusing on the similarities, or "split" items into more smaller categories, focusing on the differences. Of course, there is often a combination of both approaches, but it is my feeling that the design of modern taxonomies tends to involve more lumping, whereas the design of classification systems has involved more splitting.

One of the challenges of working with subject matter experts (SMEs) in building a taxonomy is that SMEs, as experts in their domain, may tend to think of how to classify their domain, and propose a taxonomy that resembles a classification system, even if it lacks the codes/notations. So, it’s very important to provide precise guidelines to SMEs contributing to a taxonomy, explaining that the terms are intended for tagging common topics that appear in the content and are for limiting/filtering search results, and that full classification is not necessary.

Students of library science may also tend to think of classification systems as serving for taxonomies. They learn about classification systems when they study cataloging, and subject cataloging is also about where the book or other library material belongs (often literally, on the shelf). So, even librarians need training on taxonomies and the taxonomy mindset if they want to become taxonomists. I will be giving a taxonomy workshop at the Computers in Libraries conference in March, so I will be sharing these ideas with those who attend.

Wednesday, June 22, 2016

Taxonomies vs. Thesauri: Practical Implementations

The differences between taxonomies and thesauri and when to implement which has been a subject of previous presentations of mine and a previous blog post, Taxonomies vs. Thesauri. Most recently, a presentation of a case study of controlled vocabularies at Cengage Learning, which I gave at the “Taxonomy Café” session at the SLA annual conference this month, the post-presentation roundtable discussions got me thinking more about the differences in practical implementations.

To summarize the differences, while both taxonomies and thesauri have hierarchical relationships among their terms, in a taxonomy all terms are connected into a few large hierarchies with a limited number of top terms so as to serve top-down navigation or drilling-down of topics. While faceted taxonomies function differently, each facet label can be seen as a top term. Associative relationships (related terms) are a standard feature of thesauri but not of taxonomies. Synonyms/nonpreferred terms/alternate labels are required for thesauri, but could be optional in small taxonomies. Taxonomies serve browsing and drilling down by end users who are exploring topics, whereas thesauri serve users who search for (look up) a specific concept and then may following “use” (preferred term), broader, narrower, or related term links to find the best term. A taxonomy works well for a controlled vocabulary that is limited in scope and easily categorized into hierarchies, whereas a thesaurus works better for content and a set of terms that is not easily categorizable and does not have a limited scope.

In practice, I have found that taxonomies are useful for classifying products and services (such as in ecommerce), general enterprise document management, implementations in content management systems which support taxonomies, and all faceted or filtering implementations (SharePoint search, Endeca, and other post-search filtering enterprise search software). Thesauri, on the other hand, are more suitable for indexing and retrieval research literature (articles, white papers, conference presentations and proceedings, patents, etc.), whether commercially published or not.

Taxonomies are easier to create and often easier to implement than thesauri. They generally do not have associative (related term) relationships. In absence of associative relationships between terms and with the emphasis on creating large top-term hierarchies, the thesaurus standard (ANSI/NISO Z39.19) rules for hierarchical relationships do not always have to be strictly followed. The inclusion of synonyms/nonpreferred terms also tends to be less thorough in taxonomies than in thesauri. Thesauri, on the other hand, require greater expertise in the field of information/knowledge organization, particularly to distinguish between hierarchical and associative relationships and to create the optimal number of those relationships and the optimal number of nonpreferred terms. Taxonomies, whether hierarchical or faceted, also tend to be easy to understand and use, accommodated by out-of-the-box content management software, and easier to maintain (and could be maintained by subject matter experts instead of taxonomists). Therefore, if a taxonomy, rather than a thesaurus, will suffice, then it makes more sense to create and maintain a taxonomy.

Thesauri, on the other hand, are more appropriate for the indexing repositories of content for research because they do not restrict the inclusion of terms to established hierarchies. Any terms that represent a minimal threshold of content can be added, even if at first glance they may seem out of scope. For example, a term “Hot drinks” would not likely fit into a taxonomy on health/medicine, but the term would be desired for articles on research correlating the drinking of very hot beverages to esophageal cancer. Thesauri allow for inclusion of terms that, in combination with other terms, can achieve a more nuanced meaning, which may be needed in the research and discovery of what is contained in a body of research literature.

Indeed, in practice, the majority of new controlled vocabularies that are being created are taxonomies, not thesauri, and in fact taxonomies are usually all that are needed. The new implementations tend to be of the kind that are suitable for taxonomies. New repositories of documents for research, on the other hand, while highly important to be indexed with thesauri, do not arise as frequently. More often, collections of documents for researching are already established and often already have thesauri. These thesauri do require the work of taxonomists to update and maintain them, though.