The Accidental Taxonomist

Tuesday, January 28, 2014

Taxonomies vs. Thesauri

Two taxonomy consulting projects I worked on last year seemed to lend themselves more to the development of a thesaurus than a set of hierarchical taxonomies. But clients usually ask for a taxonomy and not a thesaurus. Perhaps we need to ask what is in mind with the notion of a “taxonomy.” When someone wants a “taxonomy” developed, do they want a structured kind of controlled vocabulary to support consistent indexing/tagging and retrieval (the broad meaning of taxonomy), or do they specifically want a browse display of topics in a top-down navigation structure in a user interface (the narrower meaning of taxonomy)? The broad meaning of “taxonomy” includes thesauri, too. So, if you are looking for the former, maybe it is actually a thesaurus that you want.

In its broad meaning, “taxonomy” often refers to any of various kinds of controlled vocabularies: synonym rings to support search without being displayed (which a search vendor might call a “thesaurus”), hierarchical topic trees without synonyms, faceted taxonomies, and finally the more complex taxonomies that include all of hierarchical relationships, associative relationships, and synonyms. The latter is what may be called a thesaurus. In such a case, I would be asked for “a taxonomy with hierarchical relationships, associative relationships, and synonyms, and possibly term notes or definitions,” rather than “at thesaurus.” The word “taxonomy” has become the standard term of reference in the business, outside library applications.

The usual differentiating distinction between a strictly defined taxonomy (its narrower meaning) and a thesaurus is that a thesaurus has all the features of a taxonomy plus the addition of associative relationships. This is largely true, and I will add that a thesaurus also must have equivalence relationships (between a “preferred term” and its synonyms or nonpreferred terms), whereas synonyms/nonpreferred terms are merely optional in taxonomies, depending on the taxonomy size. Thesauri should also be built according to the standards of ANSI/NISO Z39.19 or ISO 25964, whereas taxonomies can be a little more flexible in their adherence to standards.

The extent of hierarchies

However, in my experience, I would say there is another very important distinction between a narrowly defined taxonomy and a thesaurus. A taxonomy has hierarchical relationships that bring in all of the terms/concepts into one or more (but a limited number) of hierarchical tree structures or facets. (We can consider a facet as a simple two-level hierarchy comprising the facet label and its narrower facet values.) Think of a taxonomy as supporting classification, categorization, and concept organization, with a basis in the Linnean taxonomy of animals and plants that is the most well-known meaning of “taxonomy.” The user typically enters a taxonomy from the top down.

In a thesaurus, by contrast, it is not necessary to structure all concepts (terms) into a limited number of top level hierarchies. A thesaurus focuses on terms and their immediate relationships with other terms. Hierarchical relationships between terms may result in extended hierarchies of various degrees, whether just two terms or more, but do not extend the depth of the entire taxonomy. Thus, numerous isolated hierarchies could exist. What this means is that a top down hierarchical display of a thesaurus would not comprise simply a few equally sized hierarchies, but rather numerous hierarchies of varied sizes and specificities. “Top terms” are not all of the same equal weight, importance of generalness. Therefore, while any thesaurus could be displayed hierarchically, it might not be desired to display hierarchically. Instead, the user might browse the terms of thesaurus alphabetically to select a term. A selected term will then indicate that term’s hierarchical relationships.

The idea of navigating without high-level hierarchies through which to drill down may seem odd, especially since hierarchy trees have become so common in website navigation. But there is no single right way to navigate. “Navigate” and “browse” are not synonymous with “drill down” through a hierarchy. Browsing could start out alphabetically and then jump from one term to the next via both hierarchical and associative relationships.

Blurred distinctions

You may have a hierarchical taxonomy with the additional thesaurus features of associative relationships, synonyms, scope notes for terms, etc., and then you can call it “a taxonomy with thesaurus features.” On the other hand, you may have a thesaurus that does in fact have an over-arching hierarchical structure, and you may call it “a thesaurus with a taxonomy structure.” Both of these kinds of “taxonomies” and “thesauri” would thus have essentially the same structure.

An organization might start calling its taxonomy a “thesaurus” if it chose to follow the terminology of its selected thesaurus software vendor. The following vendors, for example, call their products thesaurus management software and the results created as “thesauri”: Synaptica, Data Harmony, PoolParty, and MultiTes. Vendors have developed software that is full-featured, so not only can the software be used to create simple hierarchical taxonomies, but it also supports the full range of relationship types (hierarchical, associative, and equivalence) along with term notes, term attributes, and other maintenance tracking features. Thus, it is thesaurus management software that may be used for either thesauri or taxonomies or anything inbetween and other simpler types of controlled vocabularies.

Choosing the approach

The choice between adopting a hierarchical taxonomy vs. a thesaurus depend on the nature of the content and the users.
A hierarchical taxonomy would be fine if:
- The content is of a homogenous type that can be characterized by the same set of facets.
- The nature of the topics for the content falls neatly into a hierarchy.
- Users are not experts in the subjects and need to be guided by hierarchies.
A thesaurus would be more suitable if:
- Multiple, overlapping subject areas or domains are covered with diverse content.
- The terms need to be highly specific for detailed indexing.
- The topics do not lend themselves to neat hierarchies.
- Users are knowledgeable of the subject and will likely look for specific terms.

Monday, December 9, 2013

Taxonomy Governance

Recently I was asked to speak on a panel on taxonomy governance, so this gave me an opportunity to reflect more on the subject. "Metadata Enhancement for Improved Content Management - Taxonomies and Governance" was the title of a panel I spoke on at the Gilbane Conference 2013: Content and the Digital Experience in Boston on December 3.

When I had first heard of "governance" with respect to knowledge management and taxonomies, in 2005, it did not sound like a subject of interest to me. Perhaps I was thinking of it in terms business process management in general, which is not my field. Over the years I have come to realize that governance is a very important part of any taxonomy, and while governance can be limited to the governing the taxonomy itself it can extend to other areas that are related to the taxonomy, such as indexing and content management. Most significantly, though, there is a synergy or dualism of taxonomies and governance: to be effective taxonomies must be governed, yet the existence of a taxonomy itself is a form of governance. A taxonomy, after all, is a kind of controlled vocabulary, and “controlled” means governed. It's better to describe what taxonomy governance entails than to try to define it. Taxonomy governance comprises the policies, procedures, and documentation for the ongoing management and use of taxonomy.

My main points in my brief presentation were:

Governance process begins when taxonomy development begins.
Each taxonomy is unique and has its own governance policy.
Governance includes both:
- Documented editorial policies
- Taxonomy management procedures and responsibilities
There are minimal guidelines to a taxonomy when it is started.
Decisions reached to questions as they come up in the process are documented and eventually become policy.
Taxonomy policy/guidelines includes both:
- Taxonomy specifications, style and maintenance
- Taxonomy usage and indexing/tagging/categorization policy (manual or automated)

Reflecting on the different taxonomy jobs I have had and projects I have worked on, taxonomy governance has taken many forms beyond the obvious of documenting the taxonomy editorial policies. Even though I did not hear of taxonomy governance until I had been working for years with taxonomies, I actually had been involved with governance for many years prior, just not by that name. My first job working with taxonomies (called then controlled vocabularies) was with the title of Vocabulary and Quality Management Specialist. In addition to maintaining the controlled vocabularies according to prescribed procedures, my duties included writing guidelines for the indexers using the vocabularies, especially for new topics and current events, and checking the published content for possible vocabulary-related quality issues. At my next employer, a developer of search software with built-in taxonomies, documenting how to create the taxonomies in a consistent style was simply a part of the documenting how to use the software. Later, on an assignment with a consulting firm, on ongoing contract involved making regular updates to ecommerce client's product taxonomy, following a certain procedure and workflow that was tracked in SharePoint. Finally, in more recent years as an independent taxonomy consultant, I have made sure that taxonomy editorial policies and maintenance guidelines are always a part of my project plans.

When a taxonomy project is short on time or budget, there may be a temptation to skip the governance documentation and planning. But in the long term, that will cost more. Time will be wasted by the taxonomy editors going back through old emails to try to find out what was decided when individual questions came up. Taxonomy editors will also waste time having to redo some of their work, after realizing that they were not following a consistent style or policy. Finally, and most crucially, lack of governance will likely result in an inconsistently developed taxonomy, which in turn leads to inconsistent indexing/tagging, no matter the method used. Then the main purpose of the taxonomy is defeated.

Taxonomy governance might not be as hot a topic as it was a few years ago, but that's only because it has become standard, accepted practice. Yet there is still a lot that an organization owning a taxonomy can learn about governance in the form of best practices and case studies. While organizations may not want to share their taxonomies, as intellectual property, hopefully they will share their experiences and tips on taxonomy governance.

Saturday, November 9, 2013

Information Architecture and Taxonomies

While interest in “information architecture” by that name has declined in the past decade, interest in what information architecture involves continues to be strong, and perhaps there is some merging of the fields of taxonomy and information architecture.

At one point in my career I wanted to be an information architect, to organize the pages and menus of websites and intranets. The discipline’s leading professional association, the IA Institute additionally describes the field as “The structural design of shared information environments.” But within a couple of years, I found that interest in my information architecture skills, at least for small websites (“little IA”) was getting squeezed out for skills in either graphic design or technical web development. Over time it also seemed as if information architecture was being replaced by the growing field of user experience design (UXD). Indeed Google search trends show a definite decline in interest in the phrase “information architecture” during the same period of a steady growth in interest in “user experience.”

I was therefore pleasantly surprised to find that information architecture was one of the themes at this year’s Taxonomy Boot Camp (Washington, DC, November 5-6, 2013), the leading conference dedicated to taxonomies.

Information architecture was a central part of the keynote “Taxonomy Is Power: Bringing It All Together,” presented by Bob Boiko. He started off explaining that information systems are a triad of people, information, and technology. But he, too, had observed that information architecture (IA) has often been “captured” by user experience (UX), moving away from technology toward the user, but the “information” piece of the triad sometimes gets lost along the way and needs more attention. Bob defined information architecture as “the art and science of designing information structures” and that information architects live in the space between art (design) and science (technology). Information architecture is also about naming things, and taxonomies can help engineers and designers name things for both the front end and back end of an information system. Bob said that taxonomists should look at and “own” the concept of information architecture.

The conference also featured a session of three presentations under the heading “User Experience (UX) in Taxonomy Design.” Michael Rudy, of the consultancy Factor, spoke on the benefits of integrating user experience with information management, and Bram Wessel, also of Factor, presented on how different methods of user research, common in user experience design, such as card sorting, tree testing, personas, and prototyping, are also applicable to taxonomies. Taking a different angle to the issue, Ben Licciardi of PPC presented methods of designing the manual indexing/tagging interface for taxonomy use.

There are various perspectives and approaches to this field, whether stressing structure as in “architecture,” naming, as in “taxonomy,” or meaning, as in “semantics.” Different labels may resonate better with different audiences. The week of the conference I was also indexing a book on user experience design (a small project to do on the plane and to broaden my knowledge of the subject). While “taxonomy” was not mentioned in this light book, “semantic design” was the name of a section which mentioned information architecture, organizing information, and metadata.

Several years ago, perhaps 2007, when I introduced myself as a taxonomist to someone at a professional conference, I was asked what the difference was between taxonomists and information architects. My answer then is the same as it is now: there is definitely a significant area of overlap between the skills, tasks, and responsibilities in both professions, although there are some areas that concern information architects and not most taxonomists, and there are areas that concern taxonomists and not most information architects. So, it may only depend on what kind of information architect or kind of taxonomist you are. I hope one day to also attend the main information architecture conference, the IA Summit and continue this discussion, as interest in taxonomies is remaining strong.

Sunday, October 6, 2013

Taxonomies and Text Analytics Compared

Last week (September 30 – October 1) I attended the Text Analytics World conference in Boston as an invited speaker. This is the second year was fortunate to present at and attend this conference, which also meets in San Francisco in the spring. I posted a blog about the conference last fall, “Text Analytics and Taxonomies,” discussing the strong connections between taxonomies and text analytics in serving similar data/information retrieval goals. That connection between the two was again apparent at this year’s conference, with many speakers mentioning taxonomies, and I came away with additional analogies, beyond their shared purpose.

Problematic definition

Both taxonomies and text analytics are not well defined, and can have both a narrow definition and a broad definition. For taxonomies, the narrower meaning is a hierarchical tree of concepts arranged with broader and narrower relationships. The broad meaning of taxonomy is any controlled vocabulary, whether hierarchies, facets, thesauri, authority files, or simple terms lists to fill metadata fields. For text analytics, the narrower meaning is “text mining”, the process of deriving high-quality information contained in natural language text. But the conference chair, Tom Reamy of the KAPS Group, explained that the conference takes a broader definition of text analytics to include not only text mining but also, auto-categorization, sentiment analysis, predictive analytics, entity extraction, and machine learning.

There is also the issue of whether the name is appropriate. Some people don’t like the name taxonomies, and try to avoid it. Similarly, there are issues with the designation of “text analytics.” Discussion in the conference’s expert sessions and closing session, brought up the issue that perhaps a better name is needed for the field. Both “text” and “analytics” have issues, as they both have assumed narrower meanings. It comes out of the field of knowledge management, but that field is too broad. A more accurate label that Tom Reamy suggested was “unified data insights,” but it will stay text analytics for now.

Technology and human effort

Both taxonomies and text analytics rely on technology/software, but neither is a 100% automated solution, nor can the software products be used an out-of-the-box solutions without significant trained and skilled usage. If we consider the software as “tools” rather than “solutions,” we have a more realistic understanding of what the software can do. The process of building a taxonomy is aided by taxonomy or thesaurus management software, which is kind of a tool that an experienced taxonomist uses to manage the terms, relationships, synonyms, notes/definitions, and other term attributes. Similarly text analytics software, and auto-classification software in particular, requires expertise to leverage the tool for desired results. This was the theme of a presentation on selecting text analytics tools by Janine Johnson of Versik Analytics (who also used “tool” in her presentation title).

As I explained in my presentation, “Taxonomies for Auto-Tagging Unstructured Content,” both of the leading methods of auto-categorization, rules-based machine learning statistical methods, require considerable human input. In rules-based auto-categorization, experts need to write or edit rules for each taxonomy concept that leverage combinations of synonyms and proximity or other Boolean operators; and in machine-learning auto-categorization, experts need to identify and essentially pre-index a large set of sample documents for each taxonomy term, for the system to learn from the human indexed example.

Multidisciplinary background

Both taxonomies and text analytics are seen as a fields of expertise, methods of knowledge management, and at least parts of a solution to an organization’s information management problem. However they are not academic disciplines or majors. Rather, the educational background and skills of people who work in the fields of both taxonomies and text analytics is somewhat varied and multidisciplinary.

In taxonomies, library/information science is the most dominant background, but probably does not account for any more than half of practicing taxonomies. Information architecture/user experience design, database design, knowledge management, editorial, and subject matter (health, law, science, business, etc.) expertise are also common backgrounds.

In text analytics, computer science is the most common background. A show of hands of the conference participants indicated that the majority had computer science or engineering backgrounds. But linguistics is also important (although the small minority at this conference were more hesitant to reveal themselves). The keynote speaker, Dr. James Pennebaker, was a psychologist and explained why psychology is also important to text analytics. Participants in the closing expert panel answered my question on educational background with a similar answer of a combination of computer science/programming, linguistics, and cognitive sciences.

In addition to the interdisciplinary background of taxonomists and text analytics professionals, the applications of taxonomies and text analytics also span all disciplines and industries. Conference case studies included applications of text analytics in education, pharmaceuticals, healthcare, publishing, telecommunications, and federal agencies.

Tuesday, September 17, 2013

Taxonomy Terms with “And”

In considering best practices for developing taxonomy term labels or names, there is the question about the use of the word “and” within taxonomy terms. My previous two blog posts were called “Tags and Categories” and “Card Sorting and Taxonomies,” which demonstrate how common it is to have the word “and” in titles, headings, or other labels. By extension, does it work in taxonomy terms?

The standards for taxonomies, ANSI/NSIO Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and ISO 25964-1 Thesauri and Interoperability with Other Vocabularies make no mention of terms with the word “and.” While it is not explicitly prohibited, it is neither mentioned as an acceptable form among the rather exhaustive list of term format types. Even the section on compound terms makes no mention of terms with the word “and.” So, one might conclude that terms should not have the word “and” within them. Yet it is not uncommon, especially in larger, more specialized taxonomies and thesauri.

The simple little word “and” can actually have two different meanings:

1) the intersection of two concepts, to include only that which belongs to both, which is the Boolean operator AND

2) the combination or union of two concepts, to include any of either, which is actually the Boolean operator OR.

When it comes to taxonomy terms, the word “and” could have either of the above two usages, and it’s very important to know which it is in which case.

“And” meaning AND

My blog post title “Card Sorting and Taxonomies” involves the first meaning, the intersection of both concepts, which in this case is the use and suitability of card sorting specifically for taxonomies. “Card Sorting and Taxonomies” is more concise than saying “the suitability of card sorting for taxonomies,” and taxonomy terms need to be concise. Examples of the use of “and” in this (Boolean AND) meaning in taxonomy terms that I have run across include:

Children and Television

Gender and Poverty

The choice of using “and” is significant. It means any intersection/relation of these two concepts. “Children and Television” comprises all of the following: children’s television shows, the impact of television (not just children’s programming) on children, the depiction of children in television, etc. Similarly “Gender and Poverty” covers various issues, such as data on poverty rates by gender, how poverty effects the genders differently, and reasons why more women are poor in developing countries.

It is easy to identify this meaning of the word “and” when the two concepts linked by the conjunction are quite distinct. In many taxonomies, the preferred policy is to avoid creating such terms, lest the taxonomy become too large and complex.

“And” meaning OR

My blog post title “Tags and Categories” involves the second meaning, the combination of both concepts. I described what tags were and what categories were and compared them. Examples of the use of “and” in this (Boolean OR) meaning in taxonomy terms that I have run across include:

Measurement and Analysis

Laws and Regulations

Roads and Highways

Maintenance and Repair

An additional example is the title of the online course I teach: “Taxonomies and Controlled Vocabularies.”

The main reason to create such terms is that, while some content deals with one or the other of the two linked words, a significant amount of content really has to do with both, and users probably don’t care to make the distinction either, so it’s better to have just a single concept in the taxonomy. But one word is not equivalent to the other, so a taxonomy term cannot be created from just one word and the other designated as its nonpreferred term/synonym. Another situation for these types of taxonomy terms is a small browsable taxonomy that does not utilize/support synonyms. An additional reason to create them is that they can boost SEO (search engine optimization) in website labels by giving more words prominence. Finally, the combined terms can also appease competing stakeholders who both want their preferred label as part of the term name.

The difference in a taxonomy

If you have taxonomy terms with the word “and” in them, it needs to be clear which of these two Boolean meanings it is, not only to ensure accurate content tagging, but also to ensure the proper relationship of the term to other terms in the taxonomy. Recently I was reviewing a taxonomy with the term “Investment and Trade” and by itself, I could not determine whether it meant the intersection of combination of these two words, so I didn’t not know how it should be related to terms of “Investment” and “Trade.”

A term with the Boolean AND is a narrower term to terms of both its component parts, what is known as polyhierarchy. “Children and Television” is narrower to both “Children” and to “Television.” When there occurs a term with Boolean OR, such as “Measurement and Analysis,” it is expected that the component words to not exist as preferred terms in the taxonomy. Rather, each word “Measurement” and “Analysis” could be nonpreferred terms/synonyms for “Measurement and Analysis.