The Accidental Taxonomist

Wednesday, July 31, 2013

Tags and Categories

What does a taxonomy comprise and how does it work? Professional taxonomists may speak of “terms,” “nodes,” or “labels,” whereas most other people with a basic understanding of taxonomy might refer to “tags” or “categories.” A category is a well understood concept, and social media sites have made the notion of “tag” well known.

In addition to the different professional level of such jargon, there is also a distinction in meaning. Ironically, it’s the professional terminology that is vague and the layman terminology that is more specific. Taxonomy “terms,” “nodes,” or “labels,” are all pretty generic and can all have various applications for different kinds of taxonomies, both for broad categorization and for specific indexing. “Tags” and “categories,” on the other hand, each tend to have distinct meanings. It’s not so much what they are, or even how they are organized, but rather how they are used.

Tags are for tagging.
That seems obvious. As for what is meant by “tagging,” that implies you put a tag on something. In fact, you can put more than one tag on something, and that’s typically encouraged in tagging. “Something” is typically an electronic file of some form of content, a document, image, video, database record, blog post, etc. Tags tend to be a brief label indicating what something is about. Tags can be very specific or relatively broad. Information professionals might prefer to call them “index terms.” An organized, alphabetized list of tags could serve as an index.

Categories are for categorizing.
This can also be called grouping or classifying. It implies putting something into a category, often represented as a file folder, whether an actual electronic folder path, or just a depiction of a folder icon. While categories have different levels of specificity, the name category implies a collection of things, so there is an implicit understanding that categories don’t get too specific. An organized structure of categories typically constitutes a hierarchical taxonomy.

Can something go into more than one category? In physical folders no (unless you make photocopy of the document for each folder), but in the digital world, often the answer is yes, but not always (again requiring the copying of files). It depends on the system, and it may involve some workaround. Even when it is possible to put a content item into more than one category, unlike tags, it is still preferable to have most content items assigned to only one category and a smaller number of them that may belong in two categories. For example, there may be a breadcrumb trail for the hierarchy of categories, and the breadcrumb trail may only take a single path. The idea is that the categories retain distinct meaning and usage through mostly distinct content.

Tags and categories together
Because tags and categories are different, it is possible to have both at the same time, especially if the categories are deliberately kept broad and the tags are relatively specific. Content management systems and digital asset management systems increasingly offer features of both categories and tags for managing content. In these cases, the challenge is to decide to what degree of classification to use the categories and to what degree to use the tags. That's exactly what I have done as a taxonomist on two recent consulting projects.

For the amateur taxonomist and indexer, one of the most common exposures to tags and categories is through blogs. Blogging software may permit the blog author to assign a tag or category to a blog post. Whether the tags and categories are appropriately named and used is another issue, though. Blogger.com provides only one option, which it calls "Labels" and utilizes an icon for a tag in the blogging interface, but then displays them when published in the right margin under a heading called "Categories." No wonder my "categories" don't look good; I had created them as if they were tags. Furthermore, the very specific subject matter of "The Accidental Taxonomist" blog makes its posts more suited for tagging than for categorizing. WordPress, on the other hand, gives the blogger both tools: tags and categories. If “The Accidental Taxonomist” blog eventually moves, you’ll know why.

Thursday, June 6, 2013

How Many Facets

Faceted taxonomies (taxonomies with attributes, dimensions, filters, etc. to limit search results based on the combination of selected criteria) are becoming increasingly popular with the support of web database technology. Unlike traditional hierarchical taxonomies, designing a faceted taxonomy first requires a decision on how many facets to create. There are various factors to take into consideration.

What the content supports

The nature of the content is always the most important factor. It may seem ironic, but content that is more limited in scope can support more facets than content that it broad in scope. For example, an ecommerce site selling just computers, could have a relatively large number of facets by which to limit laptop computers: brand, price range, hard drive, screen size, operating system, processor brand, processor type, webcam inclusion, and online/in-store availability (9 facets). On the other hand, if a content repository comprises all kinds of articles, then there is not much else beyond “subject” and article type to classify them by (2 facets). (Other metadata fields, such as author, title, and date, may also be used to limit results, but these do not involve taxonomy terms.)

What the end-user user interface supports

More facets can be included, if they are stacked one above each other vertically, such as in a left-margin, than if they are displayed horizontally across the width of the screen. This is because horizontal scrolling is something users dislike and is avoided in content design, whereas limited vertical scrolled is acceptable.

Sometimes a website or intranet is created in a web content management system that does not give as much flexibility in taxonomy display. For example, SharePoint requires a horizontal list of facets, if the facets are to be used to filter content displayed in “columns,” where facet names are the column headers. Furthermore, SharePoint will by default create columns for document format type, content type, author, date created, and date modified. While you can hide these columns, if you want to use some of these defaults, that will limit the number of other descriptive facets for columns to about three or four.

Facets that limit search results are typically displayed in the left-margin, so more facets can be created. However, the number of facets should be limited so that all of the facet labels (although not necessarily all of their contents/facet values/terms) display by default without scrolling. The first 4-6 terms or values within a facet should be displayed to give the user a good understanding of what is in there, with a link or button to “show more.” Scrolling can be used when a facet category is expanded. So, what needs to be considered is the vertical space if all facets display at least some values, and if that does not fit, whether some facets can be collapsed by default. The example below of the facets for limiting people search results on LinkedIn shows the default display of two facets with the first 6 terms, one facet with all 5 terms, and 12 facets collapsed (an unusually high number of facets).

What the tagging process supports

For manual tagging, you have to consider who is doing the tagging, what their knowledge and experience is, what level of training is practical, how much time and effort can practically be devoted to tagging, and what the tagging user interface looks like. As with the end-user UI, the tagging interface also needs to display all facets and facet values in an easy-to-use manner. Usually, people who tag content for internal content management are not dedicated indexers. To simplify tagging and ensure that it is done correctly and done at all, for internal tagging there should not be too many facets for internal tagging (such as around 3).

Organizations which tag/index content for subscription sale, on the other hand, where content indexing is core to their business, will invest in dedicated indexers who can be given thorough training in assigning terms from multiple facets and will also check their indexing for quality. Thus, for professional indexing, a greater number of facets can be supported.

In automated tagging, it’s not so much a matter of how many facets, but rather how distinct the facets are and how easy they are for automated tagging. There are different technologies out there, but, in general, named entities/proper nouns are easier to distinguish than topical subjects. So, facets for author, location, department, product name, etc., are easy to classify automatically. Language, and a document type that is based on file format are also straight-forward for auto-classification. Subject or Topic could be catch-all for high-ranked keywords. If you want to create facets for different kinds of topics, though, such as Purpose, Activity, Significance, Origin, etc., the distinctions will likely be too challenging for an auto-classification tool.

Monday, May 6, 2013

Topics and Document Types in Taxonomies

It’s quite common in a faceted taxonomy to have a Document/Content Type facet (I’ll call DocType here), whose terms define what a content item “is,” (a report, a blogpost, a form, a contract, a letter, a policy, etc.) and also a Topic or Subject facet, whose terms describe what a content item is “about” (legal compliance, training, new business, insurance, company information, etc.) While usually it’s pretty clear-cut what belongs in the DocType facet and what belongs in the Topic facet, occasionally there are some ambiguous concepts, so asking the questions “what is it?” versus “what is it about?” helps in making the distinction.

Often the taxonomist can resolve ambiguity by editing the term so that a one-word generic document type is appended to a descriptive word. For example “Marketing” by itself is a Topic, but “Marketing Material” is a DocType. This kind of decision is reached only after looking at the set of documents and determining whether there is a significant number of them that are really marketing materials versus a significant number of them that are really about marketing (and there could be both). You then have to decide how far to go with this. You could force otherwise topical concepts into DocTypes by adding the word “Document” to the end of many terms. For example, “Compliance” becomes “Compliance Document”, and Client Management” becomes “Client Management Document.” Depending on your overall content set and taxonomy design, this may or may not be acceptable practice.

Another complicating issue that may come up in designing such a faceted taxonomy is what to do if certain Topics only occur in certain types of documents. This is not unusual. While DocTypes such as Report, Evaluation, Meeting Minutes, Memo, Article, Review, etc., are rather generic and could all be associated with any number of the same shared set of Topics, other DocTypes that a customized for a specific content set are more limited in their application. For example, Topics for different types of approval to be used only with a DocType of “Approval Letter,” or Topics for types of product information to be used only with a DocType of “Product Information Sheet.”

There are two ways to handle this issue:

1. Create rules permitting certain Topics available as options only when certain DocTypes are assigned
This requires that DocType be assigned (tagged, indexed, matched, etc.) to a content item first, before the Topic is assigned. This can be seen as: the Topic is dependent on the DocType, or DocTypes terms drive the Topics, or the DocType takes precedence over the Topic. This is feasible with these facets, since a content item can be assigned only on DocType (in contrast with the possibility of getting assigned more than one Topic). What gets complicated, though, if there are additional rules between other facets, with the terms in one facet driving the availability of terms in other facets, such as File Type, Source, Department, etc.

2. Merge the DocType and Topic facet into a single facet
This may seem extreme, but it could be practical, especially if it’s easier for the end-user. It works if the there are not so many Topic terms, such as not many more than the total number of DocType terms, the majority of them are applicable to a single DocType term, and a user interface can be designed that supports an expandable/collapsible hierarchy, so a user clicks on a DocType and the applicable Topics underneath it display. Traditionally taxonomies are hierarchical after all. If a Topic term is valid for more than on DocType, then a valid polyhierarchy results. There could still be a distinct facet for File Type/Format (such as HTML, text, image, PDF, etc.), for which there would be no ambiguity, in contrast to the occasional ambiguity between DocTypes and Topics.

In either case—whether rules for the terms of one facet driving the availability of terms in another or whether a merged expandable hierarchical facet is created—collaboration is needed between the taxonomist and the technical experts who configure the implementation of taxonomy in the content/document management system.

Monday, April 22, 2013

Capitalization in Taxonomies

The question often comes up: what is the preferred style for the capitalization of taxonomy terms? Other than all proper nouns being capitalized, there is no strict rule for generic terms. In making the determination, it’s important to address the following questions. What kind of taxonomy is it? How will it be used? Who are the users, and what might they be accustomed to or expect?

The ANSI/NISO standard Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies states: “predominantly lowercase terms should be used for terms in controlled vocabularies” and continues: “capitals should be used only for the initial letter of proper names, trade names and those components of taxonomic names, such as genus, which are conventionally capitalized.” But remember that ANSI/NISO Z39.19 comprises guidelines and not strict requirements, so the stylistic matter of case does not have to follow ANSI/NISO Z39.19, if a house style dictates otherwise.

Note that there are three options, not just two for non-proper nouns/names, as these explanations themselves illustrate:
1.    all lower case (including the first letter of the first word)
2.    First letter of first word upper case
3.    Title Case (First Letter of the First and Main Words Capitalized)

While the distinctions between “controlled vocabularies,” “thesauri,”, and hierarchical or faceted “taxonomies” can be blurred, these different types do tend to have different practices for capitalization.

A “controlled vocabulary,” as the word “vocabulary” might suggest, is a list of terms (as single words or phrases), similar to what might be found in a glossary, with the possible added feature of synonyms/variants for each preferred term. Capitalization, therefore, could be expected to follow dictionary rules and thus not used except for proper names. A “synonym ring” type of controlled vocabulary, in which no terms are designated as “preferred” and none are even displayed to the user, has no need for any capitalization.

A “thesaurus” is a more complex type of controlled vocabulary with hierarchical and/or associative relationships relating various terms to each other. What are called thesauri tend to be more term-focused than hierarchically focused, and they tend to be large with many detailed terms. The terms can be quite specific, and proper nouns can be mixed in. Thesauri have traditionally been used by indexers to manually index multiple documents consistently over time. The resulting display of terms associated with content for the end-user to browse through is a type of index. Indexes (such as those at the backs of book) often follow the style of lower-case entries for non-proper names, too. If the terms are numerous and specific, they will appear to be and used as “index terms” rather than “categories.” Thus, if it’s called a thesaurus, it will more likely have terms in lower case. The choice of initial capitalization for a thesaurus, though, would not be incorrect, and is probably becoming more common, just as initial capitalization is becoming more common in main entries in back-of-the-book indexes.

A “taxonomy” implies a hierarchical classification or categorization of concepts. When we think of categories we think of labels or headings with subcategories. Headings in general tend to have initial capitalization or title capitalization. Thus, if it’s a strictly hierarchical taxonomy, where all terms are interconnected into a single hierarchy or a limited number of hierarchies, then it will more likely have initial capitalization or title capitalization. Such capitalization is particularly common on the relatively smaller/less detailed taxonomies that are proliferating on websites, intranets, and content management systems. It fits in with the web design style of capitalization on headings and categories.

In faceted taxonomies, which have become more popular in web/online taxonomies, proper names can be separated into their own facet(s), and confusion between proper names and generic terms is reduced. However, I would still recommend only the first letter of the first work capitalized, rather than title case, to minimize any confusion with proper names. The facet name itself, however, could be it title capitalization, since it represents a category heading and not a term for indexing. In fact, it might even be desirable to distinguish the facet labels from the values/terms within each facet by use of a different case style.

A mixed style of different capitalization at different levels is possible in hierarchical taxonomies, too. But I would recommend only the top terms, if any, have a different capitalization style. It would not be a good idea to have only the bottom level terms (“leaf nodes”) in a different case style, because they could change. If you decided that a leaf node should later have narrower terms added, you wouldn't want to have to worry about changing the case of the term. A good application of the mixed capitalization style is if the top level terms were not actually to be used in indexing/tagging but are really just categories/groupings of the actual index terms, which in-turn are arranged hierarchically underneath. (Other typographical methods of distinction could also be used for any non-indexible top-level categories.)

In sum, all-lower case is most appropriate for non-displayed controlled vocabularies, any controlled vocabularies or thesauri that integrate proper nouns into the same hierarchies as generic terms, and large thesauri used to support manual indexing. Initial capitalization is fine for end-user browsable hierarchical taxonomies on the web. Title capitalization is OK for facet labels or the top categories in a hierarchical taxonomy. Whichever style is chosen, however, should be applied consistently.

Tuesday, April 2, 2013

Taxonomies vs. Classification

A question had come up in one of my classes on how classification differs from taxonomies/thesauri. As part of an assignment to find thesauri on the web a student sought to find “how the Federal Government classifies its publications and was expecting to find a very elaborate Thesaurus … and instead found… the Superintendent of Documents classification system,” and so the student asked how that classification system fits into the scheme of definitions for taxonomies, controlled vocabularies, and thesauri. That I will attempt to explain here.

We are familiar with classification schemes used to catalog and locate books and other materials in libraries, such as the Dewy Decimal system or, for academic libraries, the Library of Congress Classification (letter-based call “numbers”). In addition to the U.S. federal government’s “Superintendent of Documents” classification system, many other national governments an international organizations also have their own document classification schemes, and states and provinces may have modified versions. There are also classification systems for industries, such as the NAICS (North American Industrial Classification System) codes. Corporations with large volumes of documents may have their own internal document classification systems.

I sum up the differences between classification schemes and taxonomies/thesauri as follows:

Classification:

used for books, monographs, documents, reports, contracts, or other media
developed for the classification of physical items for their location on shelves, drawers, or filing cabinets and physical file folders
based on alpha-numeric codes
involves assigning an item only one classification code
manually assigned to each item
classification codes may include additional information, such as date, title, author, or publishing department information within the same classification code
rarely gets changed (due to the pre-established numeric code hierarchy)
helps document managers and librarians organize documents and helps users locate pre-identified documents and materials

Taxonomy/Controlled Vocabulary/thesauri:

used for articles, images, electronic files, paragraphs or sections of text if separated out as digital content units
used primarily in online/digital space
based on descriptive words and phrases (terms). Codes, if any, are secondary.
involves assigning an item multiple taxonomy terms
manually or automatically (auto-tagging, auto-classification, etc.) assigned to content items
taxonomy terms restricted to subject information (not to include date, title, author, publishing department, etc.)
can easily be revised and updated
helps users identify which content items they want

Another way to think of the comparison:
Classification is for: where to put things/where does this document or item go.
Taxonomy is for: how to describe content/what is this text, image, or other media about.

So, while both classification and taxonomy are related and are within the realm of information science, they are really quite different. Since they serve different purposes, they can actually co-exist and both be applied to the same corpus of documents. Libraries utilize both at the same time: a classification system (the Dewy Decimal or Library of Congress Classification call numbers on books and media) and a form of a taxonomy in the catalog subject headings (usually Library of Congress Subject Headings, which are not to be confused with Library of Congress Classification).

Taxonomy and classification may each involve different people, too: catalogers for classification and taxonomists for taxonomies. While some information professionals may do both, you cannot assume that all catalogers know how to create taxonomies or that all taxonomists understand classification. There is, of course, a larger and growing need for taxonomies, in contrast to classification and cataloging systems, as more content migrates online. Furthermore, taxonomies are more adaptable to change and thus in need of continual maintenance, in comparison to the rather static classification systems. Many catalogers are taking an interest in learning about taxonomies these days.

Taxonomists who understand something about classification can also put that knowledge to use. There are many large corporations and agencies with documents organization by customized classification systems, which are now migrating over to dynamic online content/document management and taxonomies. The legacy classification systems then need to re-formed into (or replaced by) taxonomies, and then the legacy codes need to be mapped to the new taxonomy terms to ensure the continual retrieval of legacy documents. I did this kind of work as a consulting project for a large financial institution not long ago. There were thousands of legacy alpha-numeric codes, most of which combined both a document type attribute and a subject matter attribute into a single code, a typical feature of classification codes when a document can get only one code. A taxonomy, on the other hand may have one facet for document type and another facet for subject, and a document can be assigned multiple subject taxonomy terms in addition to the document type term.

As long as there are physical books, documents, and media, there is a need for classification, but if the entire content repository is digital, then taxonomies are the way to go.