Taxonomies and thesauri are only truly useful if their terms
are appropriately indexed or tagged to content. My path to taxonomist had been
as an indexer, so I always value the importance of human indexers. Nevertheless,
I must acknowledge that automated indexing, also called auto-categorization, is
becoming increasingly common and important.
At the most recent Taxonomy Boot Camp conference (November
6-7, in Washington, DC), a trend I discerned was the increasingly
commonplace use of auto-categorization (or at least machine-aided indexing) with taxonomies. Conference
presentations didn’t state auto-categorization as something new but rather
sometime more matter of-the-fact, and by the way, the software vendor used in
this case is so-and-so. There were also sessions on artificial intelligence and
taxonomy and on leveraging taxonomy management with machine learning. There is
also a lot of interest in text analytics, a field broader than
auto-categorization, which justified the first Text Analytics Forum conference
co-located with and immediately following Taxonomy Boot Camp (which I,
unfortunately, did not have time for).
When conference speakers and others state that automated
indexing has been proven repeatedly in test comparisons to be more “reliable”
and more “consistent” than human/manual indexing, while true, that does not
mean it is better. Human indexing is certainly not as consistent, as two
trained indexers will not index exactly the same way, but the way they differ
is rarely so substantial. One indexer may add an additional index term. Another
indexer may index with a slightly different, but related, term. Automated
indexing, on the other hand, while consistent, is not as correct. Depending on
the method, it can be approximately 20% inaccurate, indexing with completely
wrong terms or completely missing the most appropriate terms. That’s where
“machine-aided indexing” comes in, where indexing is initially automated, but a
human quickly reviews the suggested terms, adding or deleting terms as
appropriate.
The primary reason for implementing automated indexing is
not so much to achieve consistent
indexing, but rather to achieve efficient
indexing. This is because the amount of content to be indexed in many
organizations is growing too fast to be kept up with by manual indexing. Publishers
of external content for subscribers have also transitioned to partial automated
indexes or machine-aided indexing.
While enterprise search engines do not utilize taxonomies by
default (but can be configured to make use of them), auto-categorization software
generally uses some form of taxonomies. Search engines can function
out-of-the-box without any taxonomies or controlled vocabularies, although a
search thesaurus (a.k.a synonym ring) can significantly improve search
precision and recall. Auto-categorization software, on the other hand, relies
on “categories,” which can be simple controlled vocabularies or hierarchical or
faceted taxonomies. Thus, as auto-categorization is gaining wider adoption, the
need for taxonomies to support them is also growing.
Automated indexing technologies have not advanced
significantly in recent years, but there have been improvements in auto-categorization
software by effectively combining more than one technology method within the
same software product. The main technology methods are (1) rules-based and (2) machine-learning.
Regardless of the method, automated indexing is still not fully automated.
Humans are required to put in time and effort beforehand to either write or
edit rules for each taxonomy term, or to provide and test training sets of
sample documents to index for machine learning. These could be dedicated roles
or additional tasks to be performed by the taxonomist.
Auto-categorization is also becoming more common, because
software products that effectively combine taxonomy management with
auto-categorization have become more established and better integrated. Although
there are many organizations which continue to use distinctly separate software
for each of taxonomy management and auto-categorization, organizations newer to
taxonomy adoption prefer to have a single solution. Synaptica is the one major
taxonomy management vendor which does not yet include fully integrated
auto-categorization, and they are very actively working on incorporating the
technology. I have separate chapters in my book, The Accidental Taxonomist for software for taxonomy management and
software for auto-categorization, but in my second edition I ended up repeating
more vendors in both sections.