The Accidental Taxonomist: April 2013

Monday, April 22, 2013

Capitalization in Taxonomies

The question often comes up: what is the preferred style for the capitalization of taxonomy terms? Other than all proper nouns being capitalized, there is no strict rule for generic terms. In making the determination, it’s important to address the following questions. What kind of taxonomy is it? How will it be used? Who are the users, and what might they be accustomed to or expect?

The ANSI/NISO standard Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies states: “predominantly lowercase terms should be used for terms in controlled vocabularies” and continues: “capitals should be used only for the initial letter of proper names, trade names and those components of taxonomic names, such as genus, which are conventionally capitalized.” But remember that ANSI/NISO Z39.19 comprises guidelines and not strict requirements, so the stylistic matter of case does not have to follow ANSI/NISO Z39.19, if a house style dictates otherwise.

Note that there are three options, not just two for non-proper nouns/names, as these explanations themselves illustrate:
1.    all lower case (including the first letter of the first word)
2.    First letter of first word upper case
3.    Title Case (First Letter of the First and Main Words Capitalized)

While the distinctions between “controlled vocabularies,” “thesauri,”, and hierarchical or faceted “taxonomies” can be blurred, these different types do tend to have different practices for capitalization.

A “controlled vocabulary,” as the word “vocabulary” might suggest, is a list of terms (as single words or phrases), similar to what might be found in a glossary, with the possible added feature of synonyms/variants for each preferred term. Capitalization, therefore, could be expected to follow dictionary rules and thus not used except for proper names. A “synonym ring” type of controlled vocabulary, in which no terms are designated as “preferred” and none are even displayed to the user, has no need for any capitalization.

A “thesaurus” is a more complex type of controlled vocabulary with hierarchical and/or associative relationships relating various terms to each other. What are called thesauri tend to be more term-focused than hierarchically focused, and they tend to be large with many detailed terms. The terms can be quite specific, and proper nouns can be mixed in. Thesauri have traditionally been used by indexers to manually index multiple documents consistently over time. The resulting display of terms associated with content for the end-user to browse through is a type of index. Indexes (such as those at the backs of book) often follow the style of lower-case entries for non-proper names, too. If the terms are numerous and specific, they will appear to be and used as “index terms” rather than “categories.” Thus, if it’s called a thesaurus, it will more likely have terms in lower case. The choice of initial capitalization for a thesaurus, though, would not be incorrect, and is probably becoming more common, just as initial capitalization is becoming more common in main entries in back-of-the-book indexes.

A “taxonomy” implies a hierarchical classification or categorization of concepts. When we think of categories we think of labels or headings with subcategories. Headings in general tend to have initial capitalization or title capitalization. Thus, if it’s a strictly hierarchical taxonomy, where all terms are interconnected into a single hierarchy or a limited number of hierarchies, then it will more likely have initial capitalization or title capitalization. Such capitalization is particularly common on the relatively smaller/less detailed taxonomies that are proliferating on websites, intranets, and content management systems. It fits in with the web design style of capitalization on headings and categories.

In faceted taxonomies, which have become more popular in web/online taxonomies, proper names can be separated into their own facet(s), and confusion between proper names and generic terms is reduced. However, I would still recommend only the first letter of the first work capitalized, rather than title case, to minimize any confusion with proper names. The facet name itself, however, could be it title capitalization, since it represents a category heading and not a term for indexing. In fact, it might even be desirable to distinguish the facet labels from the values/terms within each facet by use of a different case style.

A mixed style of different capitalization at different levels is possible in hierarchical taxonomies, too. But I would recommend only the top terms, if any, have a different capitalization style. It would not be a good idea to have only the bottom level terms (“leaf nodes”) in a different case style, because they could change. If you decided that a leaf node should later have narrower terms added, you wouldn't want to have to worry about changing the case of the term. A good application of the mixed capitalization style is if the top level terms were not actually to be used in indexing/tagging but are really just categories/groupings of the actual index terms, which in-turn are arranged hierarchically underneath. (Other typographical methods of distinction could also be used for any non-indexible top-level categories.)

In sum, all-lower case is most appropriate for non-displayed controlled vocabularies, any controlled vocabularies or thesauri that integrate proper nouns into the same hierarchies as generic terms, and large thesauri used to support manual indexing. Initial capitalization is fine for end-user browsable hierarchical taxonomies on the web. Title capitalization is OK for facet labels or the top categories in a hierarchical taxonomy. Whichever style is chosen, however, should be applied consistently.

Tuesday, April 2, 2013

Taxonomies vs. Classification

A question had come up in one of my classes on how classification differs from taxonomies/thesauri. As part of an assignment to find thesauri on the web a student sought to find “how the Federal Government classifies its publications and was expecting to find a very elaborate Thesaurus … and instead found… the Superintendent of Documents classification system,” and so the student asked how that classification system fits into the scheme of definitions for taxonomies, controlled vocabularies, and thesauri. That I will attempt to explain here.

We are familiar with classification schemes used to catalog and locate books and other materials in libraries, such as the Dewy Decimal system or, for academic libraries, the Library of Congress Classification (letter-based call “numbers”). In addition to the U.S. federal government’s “Superintendent of Documents” classification system, many other national governments an international organizations also have their own document classification schemes, and states and provinces may have modified versions. There are also classification systems for industries, such as the NAICS (North American Industrial Classification System) codes. Corporations with large volumes of documents may have their own internal document classification systems.

I sum up the differences between classification schemes and taxonomies/thesauri as follows:

Classification:

used for books, monographs, documents, reports, contracts, or other media
developed for the classification of physical items for their location on shelves, drawers, or filing cabinets and physical file folders
based on alpha-numeric codes
involves assigning an item only one classification code
manually assigned to each item
classification codes may include additional information, such as date, title, author, or publishing department information within the same classification code
rarely gets changed (due to the pre-established numeric code hierarchy)
helps document managers and librarians organize documents and helps users locate pre-identified documents and materials

Taxonomy/Controlled Vocabulary/thesauri:

used for articles, images, electronic files, paragraphs or sections of text if separated out as digital content units
used primarily in online/digital space
based on descriptive words and phrases (terms). Codes, if any, are secondary.
involves assigning an item multiple taxonomy terms
manually or automatically (auto-tagging, auto-classification, etc.) assigned to content items
taxonomy terms restricted to subject information (not to include date, title, author, publishing department, etc.)
can easily be revised and updated
helps users identify which content items they want

Another way to think of the comparison:
Classification is for: where to put things/where does this document or item go.
Taxonomy is for: how to describe content/what is this text, image, or other media about.

So, while both classification and taxonomy are related and are within the realm of information science, they are really quite different. Since they serve different purposes, they can actually co-exist and both be applied to the same corpus of documents. Libraries utilize both at the same time: a classification system (the Dewy Decimal or Library of Congress Classification call numbers on books and media) and a form of a taxonomy in the catalog subject headings (usually Library of Congress Subject Headings, which are not to be confused with Library of Congress Classification).

Taxonomy and classification may each involve different people, too: catalogers for classification and taxonomists for taxonomies. While some information professionals may do both, you cannot assume that all catalogers know how to create taxonomies or that all taxonomists understand classification. There is, of course, a larger and growing need for taxonomies, in contrast to classification and cataloging systems, as more content migrates online. Furthermore, taxonomies are more adaptable to change and thus in need of continual maintenance, in comparison to the rather static classification systems. Many catalogers are taking an interest in learning about taxonomies these days.

Taxonomists who understand something about classification can also put that knowledge to use. There are many large corporations and agencies with documents organization by customized classification systems, which are now migrating over to dynamic online content/document management and taxonomies. The legacy classification systems then need to re-formed into (or replaced by) taxonomies, and then the legacy codes need to be mapped to the new taxonomy terms to ensure the continual retrieval of legacy documents. I did this kind of work as a consulting project for a large financial institution not long ago. There were thousands of legacy alpha-numeric codes, most of which combined both a document type attribute and a subject matter attribute into a single code, a typical feature of classification codes when a document can get only one code. A taxonomy, on the other hand may have one facet for document type and another facet for subject, and a document can be assigned multiple subject taxonomy terms in addition to the document type term.

As long as there are physical books, documents, and media, there is a need for classification, but if the entire content repository is digital, then taxonomies are the way to go.