Sunday, May 13, 2018

Creating Subject Terms for a Faceted Taxonomy


Faceted taxonomies—those that allow users to limit or filter search results by selecting terms or attributes from each of several types/aspects—are becoming increasingly common. They are easy and effective for end-users with various abilities in searching. When it comes to designing facets, some of the facets and their terms for a content collection may be obvious: Document or Content Type, Location, Audience, Purpose, etc. Creating a facet for Subject, for tagging topics the content is about, however, can be quite daunting.

Some faceted taxonomies do not have a Subject facet. Product taxonomies, such as for ecommerce, don’t have Subjects, but rather product categories. Enterprise taxonomies, such as those used in enterprise content or document management systems, also typically don’t have a Subject facet, but rather they have detailed terms in the Document Type facet and may have facets for Business Activity/Function, Department, Line of Business, or even something for Life Cycle/Phase/Stage.

The Subject facet is important, and can be quite large, for taxonomies for tagging and retrieving of content in a collection, library, or repository of published articles, research studies or reports, manuals, presentations, speeches, educational/training materials, images, videos, etc. If a large number of terms are needed to adequately cover the breadth and depth of the content, the Subject facet may comprise its own internal hierarchical taxonomy or thesaurus.

Coming up with the numerous Subject terms is more work and may require a different approach than for the terms of the other facets, which may be based on user needs and expectations. The terms in the Subject facet need to be based primarily on the subject of the content items being tagged. Other techniques for developing taxonomy terms, such as stakeholder interviews and search query logs, are helpful for other facets, but not so much for Subjects.

A taxonomy is built in a combination of top-down (identifying facets and top terms) and bottom-up (identifying the individual terms needed for indexing) tasks. The Subjects are developed a little bit top-down but more bottom-up. The top-down approach for Subjects starts with identifying the subject domain and scope and then any primary divisions in that domain, based on familiarity with the subject area and the content collection. The bottom-up approach involves looking a numerous individual content items to determine the main topics they are about and developing terms for these topics.

Determining what content items are about and what terms describe is the activity of descriptive indexing. I prefer to use the word indexing than tagging here, especially in absence of a taxonomy/controlled vocabulary, which has yet to be created, because it is an analytical task. (See my earlier blog post “Tagging vs.Indexing.”) So, at this stage it may help to have someone who has experience as an indexer do test-indexing of a rather large, representative sample of the content. Guidelines should be established at the start, such as each document is to be assigned index terms for the document as a whole, not for each section, and that a document should be assigned no more than three Subject index terms, for example.

The terms will need to be individually reviewed so that similar terms can be considered for merging into a single concept (and alternative labels/synonyms might be created). To keep the number of terms more manageable for review, it’s best to review and edit the terms periodically, before completing the test-indexing of the entire sample set of content. Thus, developing the taxonomy of Subjects by means of test-indexing is an iterative process. You will probably see trends, patterns, and possibly subcategories emerge from the terms as you collect them. The initial terms that come out of test-indexing can be quite specific and then made broader later. It’s easy to edit specific terms into broader terms, while it is not possible to go the other direction without reviewing the content again. In some cases, you can identify the key terms from the tile of a document or the caption/description of an image, but often for text you need to read headings/subheadings and skim the text or look at an image.

Ideally, this test-indexing can be saved, so when actual indexing is done with the final taxonomy, the indexing work does not have to be repeated. But often this is often not possible. So, test-indexing should not be too thorough or laborious. Before I became a taxonomist, I was an indexer, so I am quite efficient at this task, and I enjoy it.

Keep in mind that a taxonomy should continue to get updated even after it is implemented. This is especially the case for the Subjects, as new content will introduce new topics not yet included within the Subject terms. Thus, the test-indexing need not be completely comprehensive. It is understood that more Subject terms will be added as needed later. What is important is that new terms are added only in accordance with established policy.

x