Tuesday, April 2, 2013

Taxonomies vs. Classification

A question had come up in one of my classes on how classification differs from taxonomies/thesauri. As part of an assignment to find thesauri on the web a student sought to find “how the Federal Government classifies its publications and was expecting to find a very elaborate Thesaurus … and instead found… the Superintendent of Documents classification system,” and so the student asked how that classification system fits into the scheme of definitions for taxonomies, controlled vocabularies, and thesauri. That I will attempt to explain here.

We are familiar with classification schemes used to catalog and locate books and other materials in libraries, such as the Dewy Decimal system or, for academic libraries, the Library of Congress Classification (letter-based call “numbers”). In addition to the U.S. federal government’s “Superintendent of Documents” classification system, many other national governments an international organizations also have their own document classification schemes, and states and provinces may have modified versions. There are also classification systems for industries, such as the NAICS (North American Industrial Classification System) codes. Corporations with large volumes of documents may have their own internal document classification systems.

I sum up the differences between classification schemes and taxonomies/thesauri as follows:


  • used for books, monographs, documents, reports, contracts, or other media
  • developed for the classification of physical items for their location on shelves, drawers, or filing cabinets and physical file folders
  • based on alpha-numeric codes
  • involves assigning an item only one classification code
  • manually assigned to each item
  • classification codes may include additional information, such as date, title, author, or publishing department information within the same classification code
  • rarely gets changed (due to the pre-established numeric code hierarchy)
  • helps document managers and librarians organize documents and helps users locate pre-identified documents and materials

Taxonomy/Controlled Vocabulary/thesauri:

  • used for articles, images, electronic files, paragraphs or sections of text if separated out as digital content units
  • used primarily in online/digital space
  • based on descriptive words and phrases (terms). Codes, if any, are secondary.
  • involves assigning an item multiple taxonomy terms
  • manually or automatically (auto-tagging, auto-classification, etc.) assigned to content items
  • taxonomy terms restricted to subject information (not to include date, title, author, publishing department, etc.)
  • can easily be revised and updated
  • helps users identify which content items they want

Another way to think of the comparison:
is for: where to put things/where does this document or item go.
is for: how to describe content/what is this text, image, or other media about.

So, while both classification and taxonomy are related and are within the realm of information science, they are really quite different. Since they serve different purposes, they can actually co-exist and both be applied to the same corpus of documents. Libraries utilize both at the same time: a classification system (the Dewy Decimal or Library of Congress Classification call numbers on books and media) and a form of a taxonomy in the catalog subject headings (usually Library of Congress Subject Headings, which are not to be confused with Library of Congress Classification).

Taxonomy and classification may each involve different people, too: catalogers for classification and taxonomists for taxonomies. While some information professionals may do both, you cannot assume that all catalogers know how to create taxonomies or that all taxonomists understand classification. There is, of course, a larger and growing need for taxonomies, in contrast to classification and cataloging systems, as more content migrates online. Furthermore, taxonomies are more adaptable to change and thus in need of continual maintenance, in comparison to the rather static classification systems. Many catalogers are taking an interest in learning about taxonomies these days.

Taxonomists who understand something about classification can also put that knowledge to use. There are many large corporations and agencies with documents organization by customized classification systems, which are now migrating over to dynamic online content/document management and taxonomies. The legacy classification systems then need to re-formed into (or replaced by) taxonomies, and then the legacy codes need to be mapped to the new taxonomy terms to ensure the continual retrieval of legacy documents. I did this kind of work as a consulting project for a large financial institution not long ago. There were thousands of legacy alpha-numeric codes, most of which combined both a document type attribute and a subject matter attribute into a single code, a typical feature of classification codes when a document can get only one code. A taxonomy, on the other hand may have one facet for document type and another facet for subject, and a document can be assigned multiple subject taxonomy terms in addition to the document type term.

As long as there are physical books, documents, and media, there is a need for classification, but if the entire content repository is digital, then taxonomies are the way to go.

  1. I like to think of it in terms of you use a taxonomy to classify information. You cannot have a thesauri or a controlled vocabulary without applying a particular classification scheme.