Faceted taxonomies—those that allow users to limit or filter search results by
selecting terms or attributes from each of several types/aspects—are becoming
increasingly common. They are easy and effective for end-users with various
abilities in searching. When it comes to designing facets, some of the facets and
their terms for a content collection may be obvious: Document or Content Type, Location,
Audience, Purpose, etc. Creating a facet for Subject, for tagging topics the
content is about, however, can be
quite daunting.
Some faceted
taxonomies do not have a Subject facet. Product taxonomies, such as for
ecommerce, don’t have Subjects, but rather product categories. Enterprise
taxonomies, such as those used in enterprise content or document management
systems, also typically don’t have a Subject facet, but rather they have
detailed terms in the Document Type facet and may have facets for Business
Activity/Function, Department, Line of Business, or even something for Life
Cycle/Phase/Stage.
The Subject facet is
important, and can be quite large, for taxonomies for tagging and retrieving of
content in a collection, library, or repository of published articles, research
studies or reports, manuals, presentations, speeches, educational/training materials,
images, videos, etc. If a large number of terms are needed to adequately cover
the breadth and depth of the content, the Subject facet may comprise its own internal
hierarchical taxonomy or thesaurus.
Coming up with the numerous
Subject terms is more work and may require a different approach than for the
terms of the other facets, which may be based on user needs and expectations.
The terms in the Subject facet need to be based primarily on the subject of the
content items being tagged. Other techniques for developing taxonomy terms,
such as stakeholder interviews and search query logs, are helpful for other
facets, but not so much for Subjects.
A taxonomy is built in
a combination of top-down (identifying facets and top terms) and bottom-up
(identifying the individual terms needed for indexing) tasks. The Subjects are
developed a little bit top-down but more bottom-up. The top-down approach for
Subjects starts with identifying the subject domain and scope and then any
primary divisions in that domain, based on familiarity with the subject area
and the content collection. The bottom-up approach involves looking a numerous
individual content items to determine the main topics they are about and
developing terms for these topics.
Determining what
content items are about and what terms describe is the activity of descriptive
indexing. I prefer to use the word indexing than tagging here, especially in
absence of a taxonomy/controlled vocabulary, which has yet to be created,
because it is an analytical task. (See my earlier blog post “Tagging vs.Indexing.”) So, at this stage it may help to have
someone who has experience as an indexer do test-indexing of a rather large,
representative sample of the content. Guidelines should be established at the
start, such as each document is to be assigned index terms for the document as
a whole, not for each section, and that a document should be assigned no more
than three Subject index terms, for example.
The terms will need to
be individually reviewed so that similar terms can be considered for merging
into a single concept (and alternative labels/synonyms might be created). To
keep the number of terms more manageable for review, it’s best to review and
edit the terms periodically, before completing the test-indexing of the entire
sample set of content. Thus, developing the taxonomy of Subjects by means of
test-indexing is an iterative process. You will probably see trends, patterns,
and possibly subcategories emerge from the terms as you collect them. The
initial terms that come out of test-indexing can be quite specific and then
made broader later. It’s easy to edit specific terms into broader terms, while
it is not possible to go the other direction without reviewing the content again.
In some cases, you can identify the key terms from the tile of a document or
the caption/description of an image, but often for text you need to read
headings/subheadings and skim the text or look at an image.
Ideally, this
test-indexing can be saved, so when actual indexing is done with the final
taxonomy, the indexing work does not have to be repeated. But often this is often
not possible. So, test-indexing should not be too thorough or laborious. Before
I became a taxonomist, I was an indexer, so I am quite efficient at this task, and
I enjoy it.
Keep in mind that a
taxonomy should continue to get updated even after it is implemented. This is
especially the case for the Subjects, as new content will introduce new topics
not yet included within the Subject terms. Thus, the test-indexing need not be
completely comprehensive. It is understood that more Subject terms will be
added as needed later. What is important is that new terms are added only in
accordance with established policy.
x