Subjects, topics, index terms, keywords, controlled
vocabulary, thesaurus, taxonomy. These all refer to an organized, precise way
to find and retrieve desired information, where that information has been
indexed to terms. Indexing content with subject terms can be manual or
automated, but in either case the focus is on what the content is about, not
what words appear in the text. The subject terms represent unambiguous
concepts, which may have synonyms, but synonyms are often included as
cross-references to redirect to the preferred term name and thus to the same
set of content. Before the era of digital content, subject categories or index
terms were the only method to find specific information, such as in a
back-of-the-book index or business categories in the yellow pages.
Using subject terms to find desired content contrasts with using
a search engine for full text search. Search is based on the occurrence of
words, not concepts, so appropriate results can be missed if they use different
wording for the same concept, and inappropriate results can be retrieved if a
word has multiple meanings. The accuracy of search, without the additional
support index terms/subjects, is dependent instead on the sophistication of
algorithms. The combinations of algorithms have improved only slightly in the
past decade or two. What has made a bigger difference in retrieving good
results through search (without subject indexing), is that in many cases the
volume of content has grown, and when search results are arranged by relevancy,
a larger number of initially displayed search results are satisfactory.
There are two issues with this kind of search. Ordering
results by relevancy is not always the preferred option. Sometimes searchers
are interested in timely stories, so they want their results to be ordered by
date, newest first, but when relying on a search engine, newer results might not
all be insufficiently relevant. Secondly,
such results are good for the searcher who only wants to get some or enough
information on a topic. If instead, the searcher wants to perform an exhaustive
search and retrieve everything available on a topic, there will likely be
relevant content that is missed in the search retrieval because it was worded
differently. Indexing with subject terms improves both precision (accuracy,
where incorrect content is not retrieved) and recall (comprehensiveness, where
appropriate content is not missed).
The role that index terms play in the search process has
evolved. Originally, researchers started with browsing a full list of subjects
that may have been arranged alphabetically (as a traditional book-style index)
or hierarchically (as a taxonomy), and they navigated the index to find more
specific subdivisions as aspects of the main heading, or they navigated the
taxonomy to drill down to the most specific term. As the volume of indexed
documents or other content items has grown over the years, browsing and
selecting a term from a taxonomy or thesaurus is often no longer as practical
or sufficient. An individual term may have too many records indexed to it. Furthermore,
many taxonomies and thesauri have grown too large to easily browse.
So, instead of taxonomy terms being used as the primary starting
point to find desired content, taxonomy terms are more often being used to
narrow or filter search results. The
user executes a search in the search engine, and if they get too many results,
they can limit or filter the results by various aspects listed in the margin, including
by indexed subject. (Other aspects could be date, document type, author, source
etc.) The subjects can display in order of frequency of occurrence on the
records in the search result set, and the user can select among them, rather
than having to browse the entire taxonomy or thesaurus.
Use of subjects and other attributes to limit search results
is becoming very common across various implementations, so most people are
familiar with using them, such as enterprise search systems to find internal
corporate documents, ecommerce websites for selecting products, library
databases for selecting research articles.
The use of subjects to limit search results is similar to a
faceted taxonomy, although the designation “faceted taxonomy” typically refers
to a taxonomy where different types of terms are grouped into multiple facets.
In other words, a faceted taxonomy involves several facets or filters, whereas
a traditional taxonomy or thesaurus may comprise a single facet or filter, which
may be used in combination with other, non-taxonomy filters.
I will be exploring and demonstrating this topic,
specifically in the case of library subscription databases, in a presentation “Customer
Focused Thesauri,” in addition to a pre-conference workshop on taxonomy
creation, at the Computers in Libraries conference in Arlington, VA, in April.