Sunday, August 18, 2024

Taxonomies and Ontologies as Semantic Models

In describing what taxonomies and ontologies are and what they can do, we are hearing the word “semantics” more often. “Semantics” means “meaning,” which is nothing new, and taxonomies and ontologies are not new. What is new is that taxonomies and ontologies are now combined more, and we need a way to describe them together, and that involves the description of “semantic.” Furthermore, taxonomies and ontologies are being implemented in new and expanded applications, where the word semantic(s) has significance.

Semantics in Taxonomies and Ontologies

Taxonomies have semantics in their concepts. A taxonomy is not just a term base or a term list, but rather is an organized set of concepts, each with its own unambiguous meaning. The concepts bring together different labels, like “synonyms” for the same thing, and their meaning and usage is further clarified by their arrangement in a hierarchy. It’s often said that a taxonomy comprises “things” (concepts), not mere “strings” (of text).

Ontologies have a higher level of semantics than taxonomies. Even if they don’t contain synonyms, the relationships between concepts (entities) and sets
(classes) of entities have additional semantics. The relationships in an ontology are convey meanings beyond mere hierarchy or a generic “related term.” For example, relationships between entities may be “is located in,” “has customer,” and “sells product.” Furthermore, entities in an ontology may have various types of attributes, such as contact information for offices and people, which is another application of semantic data.

Bringing Together Taxonomies and Ontologies

Taxonomies and ontologies have different origins, but now they are increasingly based on shared Semantic Web data models and guidelines, which enables them to be integrated seamlessly. Taxonomies have their origins in library science structures, including thesauri, subject headings, and classification schemes. Ontologies have their origins in computer science and data science with a focus on data models.

Combining them brings the benefits of both: the linguistic aspect of controlled terminology and their synonyms with hierarchical structure in taxonomies and the custom semantic relationships and other additional properties provided by ontologies. This allows users to search for concepts/things, not just text strings while also linking to others things related in a specific way and being able to create complex multi-step queries.

Taxonomies are considered a kind of “controlled vocabulary” or “knowledge organization system.” Ontologies are considered a kind of “knowledge model,” and as a knowledge
representation system, rather than a knowledge organization system. When we combine taxonomies and ontologies or speak of them collectively, it’s logical to use the word “semantic,” whether as semantic structures or semantic models, because they both involve semantics and both are usually based on Semantic Web guidelines.

Taxonomies are increasingly based on the Semantic Web recommendation (published by the World Wide Web Consortium) of SKOS (Simple Knowledge Organization System), which is based on RDF (Resource Description Framework). Most ontologies are based on RDF-Schema, an extension of RDF, and OWL (Web Ontology Language), another Semantic Web recommendation. The data models of SKOS, RDF, RDF-S, and OWL may all be integrated into the same knowledge model for a combined taxonomy-ontology. Most software for dedicated taxonomy-ontology management uses these data models.

Semantic Search and Semantic Tagging


Taxonomies support semantic search and tagging. “Semantic search” is the third-ranked autocomplete suggested search phrase in a Google search I did recently on “semantic,” so this is clearly a popular application of semantics. Semantic search refers to search that focuses on concepts and meaning rather than just strings of text. This is not new, but since search that is based on text strings and statistical algorithms is so common, improving search results with the focus on semantics is getting more attention.

Semantic search is best enabled with the tagging of taxonomy concepts, which we may call “semantic tagging” (which I first heard of when asked to write a article on it in 2008). Advanced text analytics technologies, going beyond entity recognition and natural language processing to include natural language understanding so as to analyze sentence structure, syntax, and sentiment, can also yield search results based somewhat on meaning and not just words.

Semantic Data

Taxonomies are traditionally for tagging and retrieving content, whereas ontologies are traditionally for exploring and retrieving data. The combination of a taxonomy and an ontology enables users to retrieve both content and data that are related to each other. Semantics for content is a given, because content (whether text, image, or other media), by its very nature, has meaning. Data by itself may not have much meaning, unless it is related to other data and that relationship has meaning, too. Thus, “semantic data” is significant. We hear reference to “semantic data” much more often than to “semantic content.

You don’t need to add a taxonomy to content to make it “semantic” and understood (rather a taxonomy helps you find the content). However, depending on how data is presented, you may need to add an ontology or at least a semantic data model (a method to describe objects in a database and their relationship to one another) to make data “semantic.” Experts can analyze raw data, but the data is more valuable if non-experts can understand it, too, and that’s why “semantic data” is important. There is also a lot of attention on “semantic data models.”

Semantic Layer

The idea of a “semantic layer” as a framework or approach to make an organization’s information, both data and content, more structured, findable, and actionable, has been gaining popularity recently. Whether the “semantic layer” is new or just a new way of describing something is arguable.

A semantic layer is a standardized framework that organizes and abstracts organizational data and serves as a connector for all knowledge assets. It’s a method to bridge content and data silos through a structured and consistent approach to connecting instead of consolidating data, which data warehouses do. The idea of a “layer” is that it is part of an enterprise-wide architecture of information, data and content, that connects horizontally across siloed content and data repositories. Taxonomies and ontologies, in addition to potentially other knowledge organization systems, such as a business glossary, are key components of a semantic layer.

More Talk of Semantics with Taxonomies and Ontologies

I’ve definitely been hearing of “semantics” more in the world of taxonomies and ontologies, and now I am bringing the word more into my own presentations. Following are some past and future examples.

Wednesday, July 31, 2024

Subject Headings vs. Taxonomies

When I spoke about taxonomies at the recent SLA (Special Librarians Association) annual conference, I was asked how a taxonomy differs from a subject heading scheme. Librarians are very familiar with subject headings, which are used to catalog books and other library materials. This is an interesting question, which I answered briefly in my presentation session, but I’d like to explain further.

I have previously written about how a taxonomy differs from a classification in “Classification Systems vs. Taxonomies Taxonomies are more similar to subject heading schemes. Libraries use both classification systems (such as the Dewey Decimal), which are for determining the physical location of books and other library materials on shelves based on their codes, and subject heading schemes (such as Library of Congress Subject Headings), which are used to identify books and other materials by their specific subject matter.  The same subject could be used to catalog books and materials of different types (nonfiction, fiction, sound recordings, children’s) with very different classifications.

How Taxonomies and Subject Heading Schemes are Similar

Taxonomies and subject heading schemes are both considered types of controlled vocabularies, and they share similar uses and features. They both serve users who are looking up subjects to find information or resources available on the subject, rather than (or not yet) for identifying the physical location of the resource. In addition, they both:

  • have structures, but their focus is on the concepts
  • can be both searched and browsed
  • exist for both general and specific subject domains (Medical Subject Headings (MeSH) published by the National Library of Medicine is an example of a specific subject-domain subject heading scheme.)
  • have some structured, thesaurus-type of relationships between terms, including broader/narrower, and related.
  • bring together different names, as synonyms/alternative labels/nonpreferred terms/used for terms
  • may include named entities (proper nouns for people, organizations, or geographic places) alongside topical subjects
  • may have scope notes on select terms

How Taxonomies and Subject Heading Schemes Differ

With so many similarities, one might wonder if there are any differences between subject heading schemes and taxonomies.

Subject heading schemes and taxonomies have different histories and originally different formats. Subject heading schemes were designed for the print format and have been adapted to digital environments, whereas information “taxonomies” as we know them have existed only after the emergence of digital navigation and search systems.

Structural Differences with Subdivisions

The name “subject headings” refers to the traditional browsable display of headings in an index, and under headings may appear sub-headings or subdivisions to further refine multiple references/citations/linked results. This structure is the main difference between subject heading schemes and taxonomies. The heading-subheading/subdivision structure is characteristic of back-of-the book indexes and indexes to articles when such indexes previously appeared in print, although it is still used online.

A subject heading may be subdivided by the addition of different types of subdivisions: topical, geographical (such as a country name), chronological (such as century, decade or war time), and form (for the content type, such a Periodicals). Some topical subdivisions are rather generic and can be applied to many headings, such as “Management,” “Research,” or “Law and legislation,” but most are specific to only a limited number of headings. For example, the subdivision “Lighting” is to be used under headings for structures, rooms, vehicles, installations, etc. See the full list of Library of Congress subdivisions.

The way that subdivisions refine a heading can be compared to the function of facets in a faceted taxonomy, which was noted by someone in the audience of my conference session. (See also the post “Faceted Classificationand Faceted Taxonomies.”) Subdivisions and facets are both aspects of something. That does not mean, however, that a faceted taxonomy and a subject heading scheme are the same.

  • The structure of a faceted taxonomy has facets at the top-level, and the facets are relevant to a specific set of content, so they are aspects of the content, rather than aspects of a heading term.

  • There can be hierarchies of terms within a facet of a faceted taxonomy, but subdivisions do not have internal hierarchy. Instead, subdivisions may subdivide each other, but this is more like a prescribed navigation path, and they must follow a standard sequence. For example:
    English literature—20th century—History and criticism

Application Differences of Subdivisions vs. Attributes

Another facet-like implementation of taxonomies is to have attributes to refine the search results of a specific term within a hierarchical taxonomy. Attributes are common in e-commerce taxonomies, which involve a hierarchical taxonomy for product categories and attributes for product features. Attributes are more like subdivisions, in the way that they refine topics from the hierarchical taxonomy, but they are applied (tagged) differently than subdivisions.

The combination of a subject heading and a subdivision is done at the time of indexing an article or cataloging a book, and there are rules about which combinations are permitted. The combinations are indexed as if they were a single compound concept. Catalogers are required to use established heading-subdivision combinations and cannot just make up their own. Any string of multiple subdivisions must be applied in a prescribed order, such as geographic-topical-chronological-form for Library of Congress Subject Headings that are topics authorized for geographic subdivision.

Unlike the practice of cataloging or indexing with subject headings and subdivision taxonomy terms and attributes for refinement are:

  • assigned more independently of each other, although the type of taxonomy term may restrict which attributes are available

  • have a greater number of attribute types available and tag a piece of content with values from most or all of the attribute types

  • may even have more than one attribute value of the same type may be applied (such as an item having two colors)

  • have no ranked order to apply attributes or to search on them

Convergence of Subject Headings Schemes and Taxonomies

While subject heading schemes and taxonomies have traditionally had different styles, they have become more similar in more recent decades.

Many subject heading schemes and taxonomies have both adopted thesaurus features. Originally, the Library of Congress Subject Headings had only See (Use) and See also relationships (like in an index), but in 1987 it adopted thesaurus relationships of broader term/narrower term, and related term in place of See also. Meanwhile the differences between taxonomies and thesauri have also been blurred, as taxonomies may have related-term relationships, and thesauri may have an over-arching hierarchical structure. The leading reason taxonomies and thesauri are difficult to distinguish, in my opinion, is because the same software tools are used to develop and manage both, and the software makes no distinction between “taxonomy” and “thesaurus.”

Another way in which subject headings have become more like taxonomies is that subject headings may be used without subdivisions. This is increasingly common as subject headings get reused in search and retrieval systems which do not support the complexity of subdivisions. For example, newer online publishers of medical information have adopted Medical Subject Headings without their subdivisions, which are still used by the National Library of Medicine. Additionally, auto-tagging is not easily done with multiple levels indexing. Without subdivisions, subject heading schemes are essentially the same as taxonomies, as long as they have a hierarchical structure.

Conclusions

Taxonomies have similarities and differences to both classification systems and to subject heading schemes. In fact, I would say that the modern information taxonomies have inherited features of both. Taxonomies are not always well defined, but they are flexible and adaptable to business needs.

Controlled vocabularies have existed for a long time, but their applications are becoming more varied. This has led to differences and also convergences of their features. Nevertheless, certain controlled vocabularies are more common in certain implementations. Subject heading schemes remain common in libraries, whereas taxonomies are more common in business and commercial implementations.

 

Monday, May 20, 2024

Tagging with a New Taxonomy


The benefits to information users of having content tagged with a taxonomy are great. They include increased accuracy and comprehensiveness of search results, speed and efficiency in obtaining results, the ability filter search results, the opportunity to explore and discover related information, greater confidence in the completeness of results, and an overall better user experience. The benefits are worth the challenges of creating a taxonomy, and the benefits should be worth the challenges of properly tagging with a taxonomy as well.


Often the greatest challenge to taxonomy adoption is the ability to tag all of the content with the taxonomy terms as intended. Issues include allocating resources for tagging, implementing a new content management workflow, establishing criteria and quality control for tagging, and tagging a large volume of legacy
untagged content.

Tagging Resources

While taxonomy development has one-time project expenses (such as the hours of consultant or contractor), the ongoing tagging with a taxonomy requires an annual budget on top of some startup expenses, whether tagging is manual or automated. Manual tagging requires budgeting for the working hours, while auto-tagging typically requires an annual software license. Automated also requires some human involvement for quality checks and refinements of tagging parameters.

Which method, manual or automated, to choose depends on the volume and speed of tagging required, the nature of the content, and the need for accuracy. Automated methods are more cost effective for large volumes of content tagging and can tag more quickly. Automated (AI) methods can tag text or images, but the same tool/technology does not do both, so for mixed content, manual tagging may be a more practical and affordable option. Automated methods are also better for content of a consistent type (e.g. all resumes, all news, all technical support articles), whereas a diversity of content (e.g. everything on the intranet or on the public website), can be tagged more accurately if done manually. Manual tagging may not be as consistent as automated methods, but unlike automated tagging, it is rarely wrong. If 10-15% mis-tagged content cannot be tolerated, then manual tagging may be preferred.

Automated tagging is not free from manual labor. If tagging is done by machine learning, then the machine needs to learn from examples, and sample tagged content may need to be prepared and submitted to the system as such examples. If tagging is done by rules, then rules need to be written for most of the taxonomy concepts. Prebuilt starter taxonomies may be pre-trained or have tagging rules included, though, but they likely will need refinement. In fact, any auto-tagging needs to be tuned and refined as the content and the taxonomy evolve.

Tagging Workflow

Whether manual or automated, tagging content requires setting up new content management workflows. It needs to be determined who does the tagging: the author, the editor, or someone else. Unless trained professional indexers tag the content, tagging review by an editor may be desired.

While manual tagging can be done within the same system (some kind of content management system) where the content is stored, these systems usually don’t have the functionality of auto-tagging built in. Automated tagging is typically done by establishing an integration between the auto-tagging tool (which may be a module of a taxonomy management system) and the content management system and the setting up of a data “pipeline” for the tagging tool. Setting this up may require some additionally billed services of the software vendor.

Also as part of the tagging workflow should be a method for taggers or those who review automated tagging to be able to suggest new terms to add to the taxonomy, as they see new concepts in the content.

Tagging Standards

Establishing criteria and quality control for tagging begins with setting tagging policy and guidelines. This includes setting the policy regarding to what detail to tag, how many terms of each type may be tagged to a single piece of content, whether a certain taxonomy term type is required or not for tagging, and whether the tagging of certain terms should trigger the additional tagging of another term (such as a broader term). These policies can be set as parameters for auto-tagging. For manual tagging, some of the tagging policies can be system enforced, but other policies cannot be.

Tagging has both policies (rules) and guidelines (best practices/recommendations).  A policy, for example, would be the minimum and maximum number of tags permitted, whereas a guideline would be a suggested narrower range of tags.

Whether manual or automated, tagging should be occasionally checked for accuracy, as a periodic quality control function. Based on the results, revisions may be needed for the taxonomy, and/or the tagging guidelines/policies may need to be revised.

Legacy Content Tagging

Even if there is an established workflow for tagging newly added content, there is the challenge of tagging all the legacy content that is already in the system. It’s rare that a taxonomy is implemented before any content is already collected and made available for searching.

Automated tagging may be a good way to handle the backlog of untagged content. However, software is intended to be licensed for at least a year and be a part of the regular workflow, rather than for a one-time backlog tagging project. So, the long-term use of auto-tagging software needs to be considered.

If manual tagging only will be the selected method for the long-term, then you should consider the tagging services of a freelancer, contractor, temp, or intern (library science student) to take care of tagging the initial backlog of content. Freelance indexers can be found through the American Society for Indexing and indexing societies in other countries. They prefer to call the activity “indexing,” rather than “tagging.”

While taxonomy creation is a project, taxonomy management and maintenance are an on-going program, and it’s the same with tagging. Backlog tagging will be a project, but ongoing tagging is a related program, and should be related to taxonomy management and maintenance. Tagging should be an important part of an information and content management strategy and not an afterthought.

Tuesday, April 30, 2024

Synonym Rings (or Search Thesaurus)

A synonym ring is a simple kind of controlled vocabulary that, as the name suggests, has controlled synonyms for concepts and nothing more. I have long included mention of synonym rings in presentations I’ve given with sections listing and describing controlled vocabulary types, and the synonym ring has appeared on diagrams illustrating comparative complexity and included features of the various controlled vocabularies, progressing from the simplest term lists to synonym rings, name authorities, taxonomies, thesauri, and finally ontologies.

However, until now, I have not gone into detail about synonym ring use and design.  

The name “synonym ring” is generally known only by taxonomists and other information professionals. It is called a “ring” because all synonyms point to each other, as in a circle or ring, rather than to a preferred term/label. Another name for it is a “search thesaurus,” although it should be clear that “thesaurus” is meant to be the Roget’s type and not the information retrieval type (similar to a taxonomy). I have also read the name “synset” but have not heard it in practice.

 

What we are talking about is a managed set of concepts, each with one or more synonyms, created specifically for supporting search, matching end-user search strings to text strings in the content being searched, for commonly searched concepts. The synonyms also match to variant names of the concept throughout the body of text that is being searched. Because the synonym ring’s purpose is to support search, it is not browsed and thus not displayed to the end users. Therefore, a preferred term or preferred label for each concept is not needed and thus not included.

Whether in a synonym ring or in another controlled vocabulary or taxonomy, “synonyms” refer to concept variants and not literal grammatical synonyms. In a controlled vocabulary, they are often phrases, not single words, and they are for things/concepts, and not all kinds of words (different parts of speech) found in a dictionary. They also don’t have to be exact synonyms, but rather sufficiently synonymous for the context of the content being searched.

Features of a synonym ring (search thesaurus)

  • It includes only concepts for which there are “synonyms,” Each concept must have at least two synonyms. If there are no synonyms for the concept, then the concept is not included in the synonym ring (in contrast to a regular controlled vocabulary). So, important concepts may be absent.
  • Synonyms are not displayed to the users, so slang, deprecated, potentially offensive terms, etc. may be included.
  • It supports searching only and not tagging. People doing manual tagging or systems doing auto-tagging will not be able to make use of the synonyms to identify the best concept to tag with. (They could utilize another taxonomy implemented in another system for tagging.)

Implementation of synonym rings

Typically, when taxonomists are called upon to design a taxonomy, they design it with synonyms (aka alternative labels, nonpreferred terms, variants, etc.) included. Thus, creating a dedicated synonym ring type of controlled vocabulary is not common, since the necessary synonyms are already included in the taxonomy. Small taxonomies may not have synonyms, though.

Search that is built into content/record management systems may support search synonyms, but this tends to be more ad hoc than as a managed controlled vocabulary. Recently I looked into the synonym support in controlled vocabularies and taxonomies in Salesforce Service Cloud. It supports the creation of “custom synonym groups,” where each group is a synonym ring of up to six synonyms per concept, but these have to be entered individually in the user interface, rather than as an imported as a list. As such, it’s not really a “controlled vocabulary” set.

Some content management systems with included taxonomies only enable synonyms as part of their standard displayed taxonomies and not as non-displayed search synonyms. Other systems, such as SharePoint support the use of synonyms for its taxonomies (managed in its Term Store) for tagging but not for searching.

Adding search synonyms in systems that support it often have it as a systems administrator feature, which is something that the technical systems administrators may do, while taxonomists, information architects and knowledge, managers may not know about it. After all, a set of synonyms is not a “taxonomy,” so taxonomist involvement may not even be considered. Thus, communication is necessary between those who advocate the need for comprehensive search synonyms and know how best to create them and those who are in a technical role for implementing them in a system.

Advantages of synonym rings

A synonym ring is relatively easy to develop. While there are nuances to creating synonyms (described below), it’s easier than creating other controlled vocabularies or taxonomies, since there is no need to worry about which term should be preferred and how to best create a hierarchy. Since it is not displayed, getting input from users is not required.

By focusing on supporting only searching and not also tagging, the task of coming up with synonyms is also simpler, since sometimes you want synonyms to support search and not tagging and sometimes for tagging and not searching (such as when the synonyms display to users) and trying to design for both scenarios in the same taxonomy is not easy.

When searching is the primary way that users access content, rather than browsing and filtering, a synonym ring may be an ideal solution. It might not make sense to go to the effort to design and create a hierarchical taxonomy for terms that users are searching on, if the goal is to simply enhance search.

A taxonomy runs the risk of being too broad or too specific, but a synonym ring never has that issue. The size of a synonym ring type of controlled vocabulary is flexible, and it can be built out gradually over time with no detriment.

Disadvantages of synonym rings

A synonym ring is not a standard controlled vocabulary type and is not supported in the SKOS (Simple Knowledge Organization System) data model standard of the World Wide Web Consortium. This is because a SKOS controlled vocabulary (including taxonomies) needs to have preferred labels for its concepts. Thus, synonym rings are not interoperable in the same way that other controlled vocabularies are. You cannot link to external synonym rings, and you cannot even import or export them easily. They are managed within a siloed system.

Since synonym rings do not support tagging, an additional tagging controlled vocabulary with synonyms, which is somewhat redundant in its subject scope, may need to be created

Creating synonyms for a synonym ring

“Synonyms” can include dictionary synonyms, synonyms for individual words withing multi-word phrases (e.g. political protests / political demonstrations), formal and colloquial names, acronyms, etc. Following is a list of example types:

  • synonyms: Cars / Automobiles
  • quasi-synonyms: Learning / Training
  • variant spellings: Email / E-mail
  • lexical variants: Selling / Sales
  • foreign language names: München / Munich
  • acronyms/spelled out: GDP / Gross domestic product
  • scientific/popular names: Neoplasms / Cancer
  • older/current names: Near East / Middle East

Care should be taken not to include synonyms that are not sufficiently equivalent or may be vague and have other usages, such as “development” (which could refer to software development, nonprofit fundraising, or something else). It depends on context, so in the example with “tools” as a synonym software would be acceptable if the content were only about technology and not include manufacturing, construction, etc.

Synonyms can be identified when doing research for concepts to include, including manual content analysis, automatic term extraction, lists of uncontrolled keyword tags, and search log reports. Search logs are especially suitable for synonym rings, since their usage is the same: user search strings. However, often searches are on single words, whose meaning is vague. For example, a search string word of “application” is too vague and not be used as a synonym. You should only take search log search strings if their meaning is clear.

Finally, developing synonyms for a synonym ring implemented in an internal content management system is not the same as developing synonyms for a public website to support web search engine optimization (SEO), for which they are also called “search synonyms.” For SEO, web search engine algorithms need to be considered, and obtaining the greatest number of visitors is the goal, even if those site visitors did not intend to come to the website. In such cases, more specific concepts (e.g. iPhone as synonym for cell phone) as “synonyms” would be fine. If website visitors do not find what they are looking for, that’s OK. By contrast, users of enterprise CMS or search system, would consider it a waste of their time if they retrieved additional content that did not match their search. Although sample user testing is not needed, search testing to check the accuracy of results should be performed.

Sunday, March 24, 2024

History of Modern Information Taxonomies

The word “taxonomy” was coined in 1813 by the Swiss botanist A. P. de Candolle, who developed a new method of classifying plants. The word is derived from the combination of Greek words τάξις (taxis), meaning “order” or “arrangement,” and νόμος (nomos), meaning “method” or “law.” The designation of taxonomy was then applied after-the-fact to Carl Linneaus’ binomial nomenclature system that had been published under the title Systema Naturae initially in 1735.

Today’s information taxonomies have their origins in a combination of classification systems, library subject heading schemes, and literature retrieval thesauri, and thus have features that combine all of these. Despite their name, information taxonomies are closer to subject heading schemes and thesauri, than they are to classification systems.

Classification systems

Classification systems have a multi-level hierarchy of classes, where a subclass is fully contained in its parent class, and consequently members of a subclass are also members of the parent class. Members (things) can belong to only one class, though. Historic examples include:

  • Linnaean classification of organisms (1735-1758)
  • Paris Bookseller's classification (1842)
  • International Classification of Diseases (originally Bertillon Classification of Causes of Death, 1860)
  • Dewey Decimal Classification (1876) and other library classifications
  • Industry classification systems:
    • Standard Industrial Classification System (U.S) (1937)
    • International Standard Industrial Classification (U.N.) (1948)

The requirement that a thing (an organism, book, document, medical diagnosis, economic establishment) can go into only one class supports various purposes, which are not for information retrieval:

  • Understanding and organism’s evolutionary background; identifying potential medicinal herbs
  • Locating and reshelving a book on its shelf
  • Performing heath data analysis from hospital records; billing health insurance companies appropriately
  • Doing economic analysis of industries by aggregate establishment data

When it comes to information resources, classification systems may be used to determine in what (virtual) file folder a document belongs or, to support machine-learning based auto-classification.

Classification systems are also useful for data analysis, since content or records are assigned to only one classification, and this prevents any double counting. Large, data-heavy organizations might have developed their own internal classification systems for data tracking purposes. Such classifications do not serve the same purpose of a tagging/information retrieval taxonomy and should not substitute for a taxonomy but rather exist alongside for separate purposes.

Subject heading schemes

Subject heading schemes were developed to help people find books and later also articles on various subjects with more detail and flexibility for growth than classification systems. Subject headings are used for cataloguing and indexing, not for classification. Unlike classification (for shelf location) of which an item has only one classification, an item (book, article, other media) can have multiple subjects.

Features of subject heading schemes:

  • Alphabetical arrangement of a very large number of subjects and/or named entities (proper nouns)
  • Cross-references of See (Use) and See also (Related)
  • Headings with large numbers of citations broken down to group the citations by a sub-heading or subdivision, in what is also called pre-coordination. For example, China – Foreign relations.

Back-of-the-book indexes, whose format evolved over the first half of the 20th century, follow a similar style.

Examples of early subject heading schemes:

  • Library of Congress Subject Headings (1898) and other national library systems
  • US. National Library of Medicine’s Medical Subject Headings (1954)

Library subject headings were adopted for periodical article indexes early on. The Reader’s Guide to Periodical Literature published by the H.W, Wilson Company had been using subject headings, including subdivisions and cross-references, since shortly after its introduction in 1901 (as can be seen in the 1900 -1905 cumulative index excerpted in the screenshot below).

(The two-digit years are from the prior century.)

Eventually, subject heading schemes adopted thesaurus features of Broader term, Narrower term, and Related term relationships, as was the case for Library of Congress Subject Headings, starting in 1985. Thus, subject heading schemes and thesauri have become very similar. The name “heading” in subject headings implies that there also exist some sub-headings/subdivisions, a feature which is not a typical of thesauri, though.

Thesauri

Information thesauri (in contrast to a dictionary thesaurus, like Roget’s) emerged in the mid-20th century outside of libraries for the more specialized subject needs of the federal government, scientific publishers, and technology companies. The word “thesaurus” was first used to refer to a controlled vocabulary, as a set of words/terms, not classification codes, for information retrieval in the 1950s.

Early thesauri include:

  • E. I. Dupont de Nemours Company’s thesaurus (1959)
  • Thesaurus of Armed Services Technical Information Agency (ASTIA) Descriptors, U.S. Department of Defense (1960)
  • Chemical Engineering Thesaurus, published by the American Institute of Chemical Engineers (1961)

Additional professional organization publishers of scientific journals created their own thesauri in the 1960s. Dialog, the first online information service for article citations, which also utilized thesauri of information publishers, was launched in 1966.

Soon thereafter, standards for thesauri were developed and published:

  • UNESCO Guidelines for the establishment and development of monolingual thesauri (1970)
  • DIN 1463 (Deutsches Institut für Normung) Guidelines for the establishment and development of monolingual thesauri (1972)
  • ISO 2788 Guidelines for the establishment and development of monolingual thesauri (1974) (superseded by ISO 25964-1 2011)
  • ANSI American National Standard for Thesaurus Structure, Construction, and Use (1974) (superseded by ANSI/NISO Z39.19 1993)

Modern information taxonomies

The word “taxonomy” for a hierarchical structure (like a classification scheme) of terms for tagging and retrieval (like a thesaurus) gradually became popular in the 1990s. These new taxonomy-like thesauri became popular, largely due to advancements of software and website user interfaces to enable interactive displays of hierarchies. Taxonomies had the same primary purpose of thesauri, which is information findability and retrieval, but taxonomy implementations introduced new designs for browsing and expanding hierarchies. It was found that “taxonomy” also tended to resonate with business audiences better than “thesaurus.” A market for business and commercial taxonomies started to be recognized by software vendors and by consultants by the end of the 1990s.

Combining an interactive user interface with a database enabled the introduction of dynamic filters or refinements of searches by selected taxonomy terms based on different aspects, and thus faceted taxonomies emerged and have since become a popular, if not dominant, implementation of taxonomies for many different use cases. Faceted taxonomies, by combining search terms for refinement, do not need to be as large and detailed as thesauri.

As for the next chapter in the history of taxonomies, that involves a convergence with ontologies. You can read more about that in my past blog article “Taxonomies vs. Ontologies.”