Sunday, March 24, 2024

History of Modern Information Taxonomies

The word “taxonomy” was coined in 1813 by the Swiss botanist A. P. de Candolle, who developed a new method of classifying plants. The word is derived from the combination of Greek words τάξις (taxis), meaning “order” or “arrangement,” and νόμος (nomos), meaning “method” or “law.” The designation of taxonomy was then applied after-the-fact to Carl Linneaus’ binomial nomenclature system that had been published under the title Systema Naturae initially in 1735.

Today’s information taxonomies have their origins in a combination of classification systems, library subject heading schemes, and literature retrieval thesauri, and thus have features that combine all of these. Despite their name, information taxonomies are closer to subject heading schemes and thesauri, than they are to classification systems.

Classification systems

Classification systems have a multi-level hierarchy of classes, where a subclass is fully contained in its parent class, and consequently members of a subclass are also members of the parent class. Members (things) can belong to only one class, though. Historic examples include:

  • Linnaean classification of organisms (1735-1758)
  • Paris Bookseller's classification (1842)
  • International Classification of Diseases (originally Bertillon Classification of Causes of Death, 1860)
  • Dewey Decimal Classification (1876) and other library classifications
  • Industry classification systems:
    • Standard Industrial Classification System (U.S) (1937)
    • International Standard Industrial Classification (U.N.) (1948)

The requirement that a thing (an organism, book, document, medical diagnosis, economic establishment) can go into only one class supports various purposes, which are not for information retrieval:

  • Understanding and organism’s evolutionary background; identifying potential medicinal herbs
  • Locating and reshelving a book on its shelf
  • Performing heath data analysis from hospital records; billing health insurance companies appropriately
  • Doing economic analysis of industries by aggregate establishment data

When it comes to information resources, classification systems may be used to determine in what (virtual) file folder a document belongs or, to support machine-learning based auto-classification.

Classification systems are also useful for data analysis, since content or records are assigned to only one classification, and this prevents any double counting. Large, data-heavy organizations might have developed their own internal classification systems for data tracking purposes. Such classifications do not serve the same purpose of a tagging/information retrieval taxonomy and should not substitute for a taxonomy but rather exist alongside for separate purposes.

Subject heading schemes

Subject heading schemes were developed to help people find books and later also articles on various subjects with more detail and flexibility for growth than classification systems. Subject headings are used for cataloguing and indexing, not for classification. Unlike classification (for shelf location) of which an item has only one classification, an item (book, article, other media) can have multiple subjects.

Features of subject heading schemes:

  • Alphabetical arrangement of a very large number of subjects and/or named entities (proper nouns)
  • Cross-references of See (Use) and See also (Related)
  • Headings with large numbers of citations broken down to group the citations by a sub-heading or subdivision, in what is also called pre-coordination. For example, China – Foreign relations.

Back-of-the-book indexes, whose format evolved over the first half of the 20th century, follow a similar style.

Examples of early subject heading schemes:

  • Library of Congress Subject Headings (1898) and other national library systems
  • US. National Library of Medicine’s Medical Subject Headings (1954)

Library subject headings were adopted for periodical article indexes early on. The Reader’s Guide to Periodical Literature published by the H.W, Wilson Company had been using subject headings, including subdivisions and cross-references, since shortly after its introduction in 1901 (as can be seen in the 1900 -1905 cumulative index excerpted in the screenshot below).

(The two-digit years are from the prior century.)

Eventually, subject heading schemes adopted thesaurus features of Broader term, Narrower term, and Related term relationships, as was the case for Library of Congress Subject Headings, starting in 1985. Thus, subject heading schemes and thesauri have become very similar. The name “heading” in subject headings implies that there also exist some sub-headings/subdivisions, a feature which is not a typical of thesauri, though.

Thesauri

Information thesauri (in contrast to a dictionary thesaurus, like Roget’s) emerged in the mid-20th century outside of libraries for the more specialized subject needs of the federal government, scientific publishers, and technology companies. The word “thesaurus” was first used to refer to a controlled vocabulary, as a set of words/terms, not classification codes, for information retrieval in the 1950s.

Early thesauri include:

  • E. I. Dupont de Nemours Company’s thesaurus (1959)
  • Thesaurus of Armed Services Technical Information Agency (ASTIA) Descriptors, U.S. Department of Defense (1960)
  • Chemical Engineering Thesaurus, published by the American Institute of Chemical Engineers (1961)

Additional professional organization publishers of scientific journals created their own thesauri in the 1960s. Dialog, the first online information service for article citations, which also utilized thesauri of information publishers, was launched in 1966.

Soon thereafter, standards for thesauri were developed and published:

  • UNESCO Guidelines for the establishment and development of monolingual thesauri (1970)
  • DIN 1463 (Deutsches Institut für Normung) Guidelines for the establishment and development of monolingual thesauri (1972)
  • ISO 2788 Guidelines for the establishment and development of monolingual thesauri (1974) (superseded by ISO 25964-1 2011)
  • ANSI American National Standard for Thesaurus Structure, Construction, and Use (1974) (superseded by ANSI/NISO Z39.19 1993)

Modern information taxonomies

The word “taxonomy” for a hierarchical structure (like a classification scheme) of terms for tagging and retrieval (like a thesaurus) gradually became popular in the 1990s. These new taxonomy-like thesauri became popular, largely due to advancements of software and website user interfaces to enable interactive displays of hierarchies. Taxonomies had the same primary purpose of thesauri, which is information findability and retrieval, but taxonomy implementations introduced new designs for browsing and expanding hierarchies. It was found that “taxonomy” also tended to resonate with business audiences better than “thesaurus.” A market for business and commercial taxonomies started to be recognized by software vendors and by consultants by the end of the 1990s.

Combining an interactive user interface with a database enabled the introduction of dynamic filters or refinements of searches by selected taxonomy terms based on different aspects, and thus faceted taxonomies emerged and have since become a popular, if not dominant, implementation of taxonomies for many different use cases. Faceted taxonomies, by combining search terms for refinement, do not need to be as large and detailed as thesauri.

As for the next chapter in the history of taxonomies, that involves a convergence with ontologies. You can read more about that in my past blog article “Taxonomies vs. Ontologies.”