The Accidental Taxonomist

Friday, April 11, 2014

Taxonomy Software Directories

It's difficult to find a list of taxonomy management software that is both comprehensive and up to date, yet not overwhelmed with related products and services. I define taxonomy management software as a tool to manually build and edit taxonomies, controlled vocabularies, and thesauri in accordance with industry standards. It should be the primary tool used by those who work as taxonomists. Lists of “taxonomy software,” however, may include more than just tools for taxonomy management, such as auto-classification/auto-categorization/auto-indexing software, search software that utilizes taxonomies, or mind-mapping and other graphical categorization tools, etc.

Taxonomy maintenance, unfortunately, is just too small of a niche area for the major evaluators of software, whether consultancies, industry research firms, or trade publications, to find it worth their time to study. Companies that research the information technology market, such as Forrester Research, Gartner, International Data Corporation (IDC), and Real Story Group, won't get the commercial payoff from preparing studies of the taxonomy management software industry and products.

At the time I wrote my book, the most comprehensive directory of taxonomy software I found and refer my readers to was that of the British consultant Leonard Will, on the website of his consulting business Willpower Information, which lists 38 software packages, both commercial and freeware. Leonard Will had contacted each vendor and thus provided descriptive and contact information for each tool. The fact that this was a directory of "thesaurus" software and not “taxonomy” software is not an issue, and it was probably a good thing to include only software that meets thesaurus expectations. This directory was very comprehensive, including lesser-known free and open source software, which over time tended to become unsupported or even unavailable. With an interest in posterity, Leonard Will kept the unavailable software listed in his directory merely with a note to that effect. This may have been interesting for anyone thinking of developing their own thesaurus software, as they may be able to track down these other developers. For someone looking for a good commercial solution, however, there are far too many outdated products to weed through.

After Leonard Will retired, he decided he did not want to spend the time maintaining his directory, which he last updated in 2007, and in 2011 he offered the content of his directory to someone else, specifically contacting both Margie Hlava of Access Innovations and myself. Then Margie and I had to figure out which one of us would take it, fully aware that the rich content on a website would help our own respective business websites, yet it would also take quite a bit of time and effort to set up and maintain. After a year of hoping to find time, I finally relented that I would not and told Margie she could take it. The successor to the Willpower Thesaurus software directory, maintained by Margie’s employee Eric Ziecker, now resides at http://www.taxobank.org/content/thesauri-and-vocabulary-control-thesaurus-software

The core of TaxoBank's directory “Software for building and editing thesauri” at present is still essentially the same as the Willpower site, maintaining the original tabular content, style, colors, etc of that site, so visitors to the TaxoBank site may recognize it from Willpower. Posterity still seems to be valued, as all but one of the same 38 software packages are still there, although in two cases there is a note saying “The particular software referenced above is no longer available.” The notes section for many packages has been updated with additional content extracted from the vendor websites. More updating is still pending, though, as operating systems listed are dated, such as “Windows 95/98/NT/2000/XP.”

The main difference from the original Willpower site is the addition of 63 other products in a new section, separated by the note “Additional indexing, taxonomy, controlled vocabulary, thesaurus, classification, mapping and ontology software and services not referenced in Leonard Will's original listing follows below.” These additional products include many products not specific to “building and editing thesauri,” such as Apache Lucene, EMC Documentum, Oracle Endeca, Google site search, HP Autonomy, IBM Infosphere, and Microsoft SharePoint, along with one taxonomy consulting service. In my opinion, it might be better to have the related products and services on a separate web page to avoid possible confusion and to keep the list to a manageable length, as the total web page is currently 145 printed pages long. Despite these issues, I praise Margie and Eric for taking efforts to maintain this valuable resource.

As for a shorter list focused on current commercial software dedicated to supporting the manual creation and editing of thesauri and taxonomies, that may have to wait until the next edition (not yet started) of my book. For now, there are the products, as of early 2010, listed in Chapter 5 of The Accidental Taxonomist book website links page. To this list, I would now add at least PoolParty and TopBraid Enterprise Vocabulary Net, both introduced since the book went to press. Meanwhile, taxonomy consultants still remain a valuable source of advice on taxonomy/thesaurus management software.

Saturday, March 15, 2014

Indexing vs. Thesaurus Creation

The activities of back-of-the-book indexing, document/digital asset indexing, and thesaurus/taxonomy creation all require similar skills, but each has its own unique requirements. Indeed a typical career path toward an accidental taxonomist is to first work as an indexer. You might think that the two kinds of indexing are similar to each other and thesaurus creation differs more, but having done all three, I can attest that back-of-the-book indexing and thesaurus/taxonomy creation are more similar to each other than the two kinds of indexing are.

What is indexing

In my previous blog post “Tagging vs. Indexing,” I explain that indexing involves designating descriptive terms or labels for what some content is about, and that these terms are organized into a browsable index. There are two kinds of indexing:

“Closed indexing,” or back-of-the-book indexing, where the index is created based solely on concepts that the indexer identifies within the text of a single monograph. The index is created for that one monograph and then is finished ("closed").
“Open indexing”, or what has been called “database indexing,” for the indexing of articles, documents, content items, or digital assets, whereby the indexer pulls index terms from a controlled vocabulary or thesaurus and assigns them to multiple individual documents or digital assets. The set of content grows over time, and the same terms in the index will point to increasingly more documents over time. It is called “open” indexing, because the task is ongoing. The thesaurus helps ensure consistent indexing over time.

Both kinds of indexing require the skill of analyzing content to determine what concepts are important and deserve indexing. The biggest difference between back-of-the-book indexing and database indexing is that book indexing requires that the indexer additionally invent the index terms and not merely pull them off of a thesaurus.

What is a thesaurus

I use the designation thesaurus here, because I mean the type of taxonomy that features the full set of relationship types between its terms, with each term designating an unambiguous concept (noun or noun phrase). The relationship types are:

Hierarchical (broader term/narrower term)
Equivalence (use/used from “nonpreferred terms” or “synonyms”)
Associative (related terms)

To best support manual indexing, the existence of all these different kinds of relationships help direct the indexers to the most appropriate terms to describe the content they are indexing. The same thesaurus, or parts of it, may be displayed to the end-users to help guide them to find the most appropriate terms to describe the idea about which they are searching for information. The thesaurus thus not only standardizes the language for the concepts, but also provides a guiding structure.|

How they are related

Open/database indexing and thesaurus creation are obviously related, because the thesaurus is used to support this kind of indexing. In an organization which is involved in such indexing, it is not unusual for former indexers to become editors of the thesaurus, since they are already very familiar with it and understand the needs of the indexer-users.

Closed/book indexing and thesaurus creation are related, because they both involve the development of original terms and relationships between them.

Thesaurus and book index similarities and differences

Thesauri and back-of-the-book indexes both have what can be called multiple points of entry. In a book index these can be either See cross-references or “double-posts," whereby additional variant terms or synonyms are included in the index, and they all point to the same set of page numbers. In a thesaurus, this is the equivalence relationships, where nonpreferred terms or synonyms point to the preferred terms (Use/UF). The difference is that a thesaurus distinguishes between the preferred and nonpreferred terms, whereby double-posts in a book index are all of equal standing and none is ”preferred.”

Thesauri and back-of-the-book indexes both have hierarchical structure among their terms. In a thesaurus there are narrower terms to a broader term (BT/NT). In an index, there are subentries indented under a main entry. However, these hierarchies are not identical. In a thesaurus, narrower terms must be generic types, instances or integral parts of the broader term. In a book index, subentries are any aspect of the main entry or merely another concept in combination. In fact, an indexer may choose to switch the main entry and subentry (the subentry becoming a main entry and the main entry becoming its subentry) with no problems. Don’t try to do that in a thesaurus or taxonomy!

Finally, thesauri and back-of-the-book indexes both have indications of related concepts. Thesauri have the associative relationship called Related Term (RT), and book indexes have See also cross-references. While in general these function the same, the rules for thesauri are stricter. If the “related” terms are really hierarchical, then they must have the hierarchical relationship instead. In a book index, it is acceptable to have a See also between two terms where one is actually broader in meaning to the other.

I will be giving a presentation on this in greater detail at the annual conference of the American Society for Indexing, on April 30, 2015, in Seattle, WA.

Friday, February 28, 2014

Tagging vs. Indexing

I have blogged before on the difference between tags and categories, but recently someone asked me about the difference between tagging and indexing (the manual kind). It's not a simple answer.

One important way in which tagging and indexing differ is that tagging involves any kind of designation about a piece of content, what it is or what it is about, whereas indexing is restricted to descriptive labels for what content is about. Tagging can include content type, date, creator, source, audience, location, rights, keywords, etc., whereas indexing is for the subjects of the content. In this sense, tagging is sort of the modern word for cataloging or the assignment of metadata.

But what if we are concerned with just the descriptive labeling of content and not other metadata? That might be called tagging or it might be called indexing. In this case, the difference is more nuanced, and to a certain extent it is historical.

When I first entered this field in early 1990s, the notion of "tagging" was not really known. Indexing, on the other hand, was a recognized activity. There are two kinds of indexing:
1) Closed indexing or back-of-the-book indexing, where the index is created based solely on concepts found in a single monograph, and the index is created for that one monograph and is then finished ("closed").
2) Open indexing, or what was then called database indexing, whereby index terms taken from a controlled vocabulary or thesaurus are assigned to multiple individual documents or digital assets, with the content ever growing over time and the same index terms will point to increasingly more documents over time.

Then, with the rise of social media, "tagging" became popular in the form of assigning keywords and names to photos or blogposts or other digital content. Initially, tagging was clearly different from indexing, because:
1) Tagging did not use a controlled vocabulary (aka thesaurus or taxonomy)
2) Tagging was done by creators and consumers of content, and not trained indexers. "Indexer" is a profession; "tagger" is not.

Indexing is also different from tagging by what results from it. If we look to the origin of the word "index", it means to indicate or to point (as with your index finger). So, the result of indexing is an "index" that the user can browse to locate referenced (if in print) or linked (if electronic) content. A thesaurus/taxonomy and an index (a structured list of the terms that had been used for indexing) could be essentially the same thing. Sometimes not the entire index is browsable but rather just a section via a type-ahead scroll-box feature. Tagging, on the other hand, with the lack of controlled vocabulary, does not result in any created work, just a folksonomy, which, with its multiple terms with the same or overlapping meaning, is not suitable for browsing. If displayed, tagging terms are shown by popularity instead, such as in a tag cloud, which is interesting, but not an accurate method for content findability and retrieval.

In time, enterprise software adopted social media methods, user interfaces, and features. As a consequence, tagging became more formalized as an employee task, and folksonomies got edited into controlled vocabularies or taxonomies, if not at least becoming sources for taxonomy terms. So, now tagging may be done with or without a controlled vocabulary, and both consumers and professional editors/content managers (if not “taggers”) do tagging.

"Tags" and "tagging" are now also designated features content management and digital asset management software, and content editors "tag" with terms from a controlled list. As such, the distinctions between "indexing" and "tagging" have become blurred, and what this activity is called may depend on what the software vendor, the industry (publishing may prefer to call it indexing, whereas ecommerce calls it tagging), and the corporate culture prefers to call it.

The designation of “indexing”, as open index creation, is also becoming less common as the full display of indexes has become less common. Search boxes (even if what the user enters into it is matched against a thesaurus) have often replaced long alphabetized lists of subject entries and subentries. We continue to find indexes at the back of books, but online for electronic content the displayed browsable index is less common than it used to be.

Tuesday, January 28, 2014

Taxonomies vs. Thesauri

Two taxonomy consulting projects I worked on last year seemed to lend themselves more to the development of a thesaurus than a set of hierarchical taxonomies. But clients usually ask for a taxonomy and not a thesaurus. Perhaps we need to ask what is in mind with the notion of a “taxonomy.” When someone wants a “taxonomy” developed, do they want a structured kind of controlled vocabulary to support consistent indexing/tagging and retrieval (the broad meaning of taxonomy), or do they specifically want a browse display of topics in a top-down navigation structure in a user interface (the narrower meaning of taxonomy)? The broad meaning of “taxonomy” includes thesauri, too. So, if you are looking for the former, maybe it is actually a thesaurus that you want.

In its broad meaning, “taxonomy” often refers to any of various kinds of controlled vocabularies: synonym rings to support search without being displayed (which a search vendor might call a “thesaurus”), hierarchical topic trees without synonyms, faceted taxonomies, and finally the more complex taxonomies that include all of hierarchical relationships, associative relationships, and synonyms. The latter is what may be called a thesaurus. In such a case, I would be asked for “a taxonomy with hierarchical relationships, associative relationships, and synonyms, and possibly term notes or definitions,” rather than “at thesaurus.” The word “taxonomy” has become the standard term of reference in the business, outside library applications.

The usual differentiating distinction between a strictly defined taxonomy (its narrower meaning) and a thesaurus is that a thesaurus has all the features of a taxonomy plus the addition of associative relationships. This is largely true, and I will add that a thesaurus also must have equivalence relationships (between a “preferred term” and its synonyms or nonpreferred terms), whereas synonyms/nonpreferred terms are merely optional in taxonomies, depending on the taxonomy size. Thesauri should also be built according to the standards of ANSI/NISO Z39.19 or ISO 25964, whereas taxonomies can be a little more flexible in their adherence to standards.

The extent of hierarchies

However, in my experience, I would say there is another very important distinction between a narrowly defined taxonomy and a thesaurus. A taxonomy has hierarchical relationships that bring in all of the terms/concepts into one or more (but a limited number) of hierarchical tree structures or facets. (We can consider a facet as a simple two-level hierarchy comprising the facet label and its narrower facet values.) Think of a taxonomy as supporting classification, categorization, and concept organization, with a basis in the Linnean taxonomy of animals and plants that is the most well-known meaning of “taxonomy.” The user typically enters a taxonomy from the top down.

In a thesaurus, by contrast, it is not necessary to structure all concepts (terms) into a limited number of top level hierarchies. A thesaurus focuses on terms and their immediate relationships with other terms. Hierarchical relationships between terms may result in extended hierarchies of various degrees, whether just two terms or more, but do not extend the depth of the entire taxonomy. Thus, numerous isolated hierarchies could exist. What this means is that a top down hierarchical display of a thesaurus would not comprise simply a few equally sized hierarchies, but rather numerous hierarchies of varied sizes and specificities. “Top terms” are not all of the same equal weight, importance of generalness. Therefore, while any thesaurus could be displayed hierarchically, it might not be desired to display hierarchically. Instead, the user might browse the terms of thesaurus alphabetically to select a term. A selected term will then indicate that term’s hierarchical relationships.

The idea of navigating without high-level hierarchies through which to drill down may seem odd, especially since hierarchy trees have become so common in website navigation. But there is no single right way to navigate. “Navigate” and “browse” are not synonymous with “drill down” through a hierarchy. Browsing could start out alphabetically and then jump from one term to the next via both hierarchical and associative relationships.

Blurred distinctions

You may have a hierarchical taxonomy with the additional thesaurus features of associative relationships, synonyms, scope notes for terms, etc., and then you can call it “a taxonomy with thesaurus features.” On the other hand, you may have a thesaurus that does in fact have an over-arching hierarchical structure, and you may call it “a thesaurus with a taxonomy structure.” Both of these kinds of “taxonomies” and “thesauri” would thus have essentially the same structure.

An organization might start calling its taxonomy a “thesaurus” if it chose to follow the terminology of its selected thesaurus software vendor. The following vendors, for example, call their products thesaurus management software and the results created as “thesauri”: Synaptica, Data Harmony, PoolParty, and MultiTes. Vendors have developed software that is full-featured, so not only can the software be used to create simple hierarchical taxonomies, but it also supports the full range of relationship types (hierarchical, associative, and equivalence) along with term notes, term attributes, and other maintenance tracking features. Thus, it is thesaurus management software that may be used for either thesauri or taxonomies or anything inbetween and other simpler types of controlled vocabularies.

Choosing the approach

The choice between adopting a hierarchical taxonomy vs. a thesaurus depend on the nature of the content and the users.
A hierarchical taxonomy would be fine if:
- The content is of a homogenous type that can be characterized by the same set of facets.
- The nature of the topics for the content falls neatly into a hierarchy.
- Users are not experts in the subjects and need to be guided by hierarchies.
A thesaurus would be more suitable if:
- Multiple, overlapping subject areas or domains are covered with diverse content.
- The terms need to be highly specific for detailed indexing.
- The topics do not lend themselves to neat hierarchies.
- Users are knowledgeable of the subject and will likely look for specific terms.

Monday, December 9, 2013

Taxonomy Governance

Recently I was asked to speak on a panel on taxonomy governance, so this gave me an opportunity to reflect more on the subject. "Metadata Enhancement for Improved Content Management - Taxonomies and Governance" was the title of a panel I spoke on at the Gilbane Conference 2013: Content and the Digital Experience in Boston on December 3.

When I had first heard of "governance" with respect to knowledge management and taxonomies, in 2005, it did not sound like a subject of interest to me. Perhaps I was thinking of it in terms business process management in general, which is not my field. Over the years I have come to realize that governance is a very important part of any taxonomy, and while governance can be limited to the governing the taxonomy itself it can extend to other areas that are related to the taxonomy, such as indexing and content management. Most significantly, though, there is a synergy or dualism of taxonomies and governance: to be effective taxonomies must be governed, yet the existence of a taxonomy itself is a form of governance. A taxonomy, after all, is a kind of controlled vocabulary, and “controlled” means governed. It's better to describe what taxonomy governance entails than to try to define it. Taxonomy governance comprises the policies, procedures, and documentation for the ongoing management and use of taxonomy.

My main points in my brief presentation were:

Governance process begins when taxonomy development begins.
Each taxonomy is unique and has its own governance policy.
Governance includes both:
- Documented editorial policies
- Taxonomy management procedures and responsibilities
There are minimal guidelines to a taxonomy when it is started.
Decisions reached to questions as they come up in the process are documented and eventually become policy.
Taxonomy policy/guidelines includes both:
- Taxonomy specifications, style and maintenance
- Taxonomy usage and indexing/tagging/categorization policy (manual or automated)

Reflecting on the different taxonomy jobs I have had and projects I have worked on, taxonomy governance has taken many forms beyond the obvious of documenting the taxonomy editorial policies. Even though I did not hear of taxonomy governance until I had been working for years with taxonomies, I actually had been involved with governance for many years prior, just not by that name. My first job working with taxonomies (called then controlled vocabularies) was with the title of Vocabulary and Quality Management Specialist. In addition to maintaining the controlled vocabularies according to prescribed procedures, my duties included writing guidelines for the indexers using the vocabularies, especially for new topics and current events, and checking the published content for possible vocabulary-related quality issues. At my next employer, a developer of search software with built-in taxonomies, documenting how to create the taxonomies in a consistent style was simply a part of the documenting how to use the software. Later, on an assignment with a consulting firm, on ongoing contract involved making regular updates to ecommerce client's product taxonomy, following a certain procedure and workflow that was tracked in SharePoint. Finally, in more recent years as an independent taxonomy consultant, I have made sure that taxonomy editorial policies and maintenance guidelines are always a part of my project plans.

When a taxonomy project is short on time or budget, there may be a temptation to skip the governance documentation and planning. But in the long term, that will cost more. Time will be wasted by the taxonomy editors going back through old emails to try to find out what was decided when individual questions came up. Taxonomy editors will also waste time having to redo some of their work, after realizing that they were not following a consistent style or policy. Finally, and most crucially, lack of governance will likely result in an inconsistently developed taxonomy, which in turn leads to inconsistent indexing/tagging, no matter the method used. Then the main purpose of the taxonomy is defeated.

Taxonomy governance might not be as hot a topic as it was a few years ago, but that's only because it has become standard, accepted practice. Yet there is still a lot that an organization owning a taxonomy can learn about governance in the form of best practices and case studies. While organizations may not want to share their taxonomies, as intellectual property, hopefully they will share their experiences and tips on taxonomy governance.