The Accidental Taxonomist

Tuesday, May 20, 2014

Creating Taxonomies from Scratch

When I first got into taxonomy work, my impression was that the trend was increasingly to revise, redesign, merge, and update existing taxonomies and less for creating new taxonomies. As taxonomies became more common in large organizations, it seemed obvious that there would be less original taxonomy creation needs and more taxonomy improvement needs. Taxonomies need to be updated when content changes, terminology changes, users change, indexing methods change, content/document management systems change, etc. Older taxonomies may also need to be repurposed, merged, or mapped. While there is no shortage of work on existing taxonomies, to my pleasant surprise I have found recently that there are many projects for new taxonomies as well.

Who needs taxonomies from scratch

In the field of taxonomy consulting, different taxonomy projects go to different consultancies. Large organizations with large taxonomy projects tend to hire taxonomy consultancies with multiple consultants to handle their projects, and it is the large organizations that by now tend to already have some taxonomies, even if they need a lot of work. Smaller organizations tend to hire independent consultant-contractors, and smaller organizations more likely are new to taxonomies and need to have one built from scratch. When I started out consulting, I was employed or subcontracted to consultancies that served larger clients and worked more on taxonomy redesign projects, but then when I became an independent consultant I was contacted by and often served smaller clients, including startups, and thus became involved with more projects to build original taxonomies.

The types of projects that start-ups have for taxonomies are really quite interesting and they reflect a trend in innovative content-based products and services. In the past couple of years I have been contacted about creating taxonomies (some of which I did) for the following:

A subscription, web-based software with taxonomy for photographers to tag and classify their own images
A web-based market place for craftspeople and customers to meet to buy/sell customized objects
A website of quotes by famous and not-so-famous women with related content
A web database of yoga poses associated with a yoga studio
A web service of sites for artists to promote themselves
A loyalty marketing and data software platform for retailers
A mobile app that pulls content from LinkedIn to help professionals and job seekers make connections and obtain career advice

Yet it may not even be the size of the organization seeking taxonomies that has an impact in the demand for new taxonomies from scratch. It could also be that taxonomies are becoming better known across all industries, not just the fields of publishing, information services, and ecommerce. There is also no doubt that the growing amount of content in all areas necessitates better methods of organization and retrieval.

How taxonomies are built from scratch

Even taxonomists with considerable experience in editing taxonomies might not know where to begin if they were to create a taxonomy from scratch. There is some uncertainty over whether to take a predominantly top-down or bottom-up approach. I recommend a hybrid approach, with some initial top-level development, but most of the work on the specific taxonomy terms built from the bottom. If a navigational tree hierarchy is to be displayed to the users, then at least some initial top-down development is needed.

Developing the top terms (or facets, as the case may be) is based on best practices, understanding the users, adapting to any user interface constraints, and general experience as a taxonomist. Developing all the detailed terms within the taxonomy from below, however, is quite a different task that requires different taxonomist skills. Despite the fact that a spreadsheet, such as Excel, is inappropriate for managing taxonomies, I have found that even with taxonomy management software available, Excel is the most usable tool for the initial stage for gathering candidate terms along with information about their sources and/or for comparing terms side-by-side from multiple sources and at the same time putting them into a hierarchy. Finally, if a taxonomy is somewhat specialized and technical in nature and to be used by subject matter experts, it’s also possible to let the subject matter experts propose their own taxonomy and then review it with them and heavily revise it to bring it up to standards.

I will discuss this in more detail in my presentation, “Taxonomies: Everything you Need to Know to Start a Taxonomy from Scratch,” at the SLA conference in Vancouver, BC, on June 8.

Friday, April 11, 2014

Taxonomy Software Directories

It's difficult to find a list of taxonomy management software that is both comprehensive and up to date, yet not overwhelmed with related products and services. I define taxonomy management software as a tool to manually build and edit taxonomies, controlled vocabularies, and thesauri in accordance with industry standards. It should be the primary tool used by those who work as taxonomists. Lists of “taxonomy software,” however, may include more than just tools for taxonomy management, such as auto-classification/auto-categorization/auto-indexing software, search software that utilizes taxonomies, or mind-mapping and other graphical categorization tools, etc.

Taxonomy maintenance, unfortunately, is just too small of a niche area for the major evaluators of software, whether consultancies, industry research firms, or trade publications, to find it worth their time to study. Companies that research the information technology market, such as Forrester Research, Gartner, International Data Corporation (IDC), and Real Story Group, won't get the commercial payoff from preparing studies of the taxonomy management software industry and products.

At the time I wrote my book, the most comprehensive directory of taxonomy software I found and refer my readers to was that of the British consultant Leonard Will, on the website of his consulting business Willpower Information, which lists 38 software packages, both commercial and freeware. Leonard Will had contacted each vendor and thus provided descriptive and contact information for each tool. The fact that this was a directory of "thesaurus" software and not “taxonomy” software is not an issue, and it was probably a good thing to include only software that meets thesaurus expectations. This directory was very comprehensive, including lesser-known free and open source software, which over time tended to become unsupported or even unavailable. With an interest in posterity, Leonard Will kept the unavailable software listed in his directory merely with a note to that effect. This may have been interesting for anyone thinking of developing their own thesaurus software, as they may be able to track down these other developers. For someone looking for a good commercial solution, however, there are far too many outdated products to weed through.

After Leonard Will retired, he decided he did not want to spend the time maintaining his directory, which he last updated in 2007, and in 2011 he offered the content of his directory to someone else, specifically contacting both Margie Hlava of Access Innovations and myself. Then Margie and I had to figure out which one of us would take it, fully aware that the rich content on a website would help our own respective business websites, yet it would also take quite a bit of time and effort to set up and maintain. After a year of hoping to find time, I finally relented that I would not and told Margie she could take it. The successor to the Willpower Thesaurus software directory, maintained by Margie’s employee Eric Ziecker, now resides at http://www.taxobank.org/content/thesauri-and-vocabulary-control-thesaurus-software

The core of TaxoBank's directory “Software for building and editing thesauri” at present is still essentially the same as the Willpower site, maintaining the original tabular content, style, colors, etc of that site, so visitors to the TaxoBank site may recognize it from Willpower. Posterity still seems to be valued, as all but one of the same 38 software packages are still there, although in two cases there is a note saying “The particular software referenced above is no longer available.” The notes section for many packages has been updated with additional content extracted from the vendor websites. More updating is still pending, though, as operating systems listed are dated, such as “Windows 95/98/NT/2000/XP.”

The main difference from the original Willpower site is the addition of 63 other products in a new section, separated by the note “Additional indexing, taxonomy, controlled vocabulary, thesaurus, classification, mapping and ontology software and services not referenced in Leonard Will's original listing follows below.” These additional products include many products not specific to “building and editing thesauri,” such as Apache Lucene, EMC Documentum, Oracle Endeca, Google site search, HP Autonomy, IBM Infosphere, and Microsoft SharePoint, along with one taxonomy consulting service. In my opinion, it might be better to have the related products and services on a separate web page to avoid possible confusion and to keep the list to a manageable length, as the total web page is currently 145 printed pages long. Despite these issues, I praise Margie and Eric for taking efforts to maintain this valuable resource.

As for a shorter list focused on current commercial software dedicated to supporting the manual creation and editing of thesauri and taxonomies, that may have to wait until the next edition (not yet started) of my book. For now, there are the products, as of early 2010, listed in Chapter 5 of The Accidental Taxonomist book website links page. To this list, I would now add at least PoolParty and TopBraid Enterprise Vocabulary Net, both introduced since the book went to press. Meanwhile, taxonomy consultants still remain a valuable source of advice on taxonomy/thesaurus management software.

Saturday, March 15, 2014

Indexing vs. Thesaurus Creation

The activities of back-of-the-book indexing, document/digital asset indexing, and thesaurus/taxonomy creation all require similar skills, but each has its own unique requirements. Indeed a typical career path toward an accidental taxonomist is to first work as an indexer. You might think that the two kinds of indexing are similar to each other and thesaurus creation differs more, but having done all three, I can attest that back-of-the-book indexing and thesaurus/taxonomy creation are more similar to each other than the two kinds of indexing are.

What is indexing

In my previous blog post “Tagging vs. Indexing,” I explain that indexing involves designating descriptive terms or labels for what some content is about, and that these terms are organized into a browsable index. There are two kinds of indexing:

“Closed indexing,” or back-of-the-book indexing, where the index is created based solely on concepts that the indexer identifies within the text of a single monograph. The index is created for that one monograph and then is finished ("closed").
“Open indexing”, or what has been called “database indexing,” for the indexing of articles, documents, content items, or digital assets, whereby the indexer pulls index terms from a controlled vocabulary or thesaurus and assigns them to multiple individual documents or digital assets. The set of content grows over time, and the same terms in the index will point to increasingly more documents over time. It is called “open” indexing, because the task is ongoing. The thesaurus helps ensure consistent indexing over time.

Both kinds of indexing require the skill of analyzing content to determine what concepts are important and deserve indexing. The biggest difference between back-of-the-book indexing and database indexing is that book indexing requires that the indexer additionally invent the index terms and not merely pull them off of a thesaurus.

What is a thesaurus

I use the designation thesaurus here, because I mean the type of taxonomy that features the full set of relationship types between its terms, with each term designating an unambiguous concept (noun or noun phrase). The relationship types are:

Hierarchical (broader term/narrower term)
Equivalence (use/used from “nonpreferred terms” or “synonyms”)
Associative (related terms)

To best support manual indexing, the existence of all these different kinds of relationships help direct the indexers to the most appropriate terms to describe the content they are indexing. The same thesaurus, or parts of it, may be displayed to the end-users to help guide them to find the most appropriate terms to describe the idea about which they are searching for information. The thesaurus thus not only standardizes the language for the concepts, but also provides a guiding structure.|

How they are related

Open/database indexing and thesaurus creation are obviously related, because the thesaurus is used to support this kind of indexing. In an organization which is involved in such indexing, it is not unusual for former indexers to become editors of the thesaurus, since they are already very familiar with it and understand the needs of the indexer-users.

Closed/book indexing and thesaurus creation are related, because they both involve the development of original terms and relationships between them.

Thesaurus and book index similarities and differences

Thesauri and back-of-the-book indexes both have what can be called multiple points of entry. In a book index these can be either See cross-references or “double-posts," whereby additional variant terms or synonyms are included in the index, and they all point to the same set of page numbers. In a thesaurus, this is the equivalence relationships, where nonpreferred terms or synonyms point to the preferred terms (Use/UF). The difference is that a thesaurus distinguishes between the preferred and nonpreferred terms, whereby double-posts in a book index are all of equal standing and none is ”preferred.”

Thesauri and back-of-the-book indexes both have hierarchical structure among their terms. In a thesaurus there are narrower terms to a broader term (BT/NT). In an index, there are subentries indented under a main entry. However, these hierarchies are not identical. In a thesaurus, narrower terms must be generic types, instances or integral parts of the broader term. In a book index, subentries are any aspect of the main entry or merely another concept in combination. In fact, an indexer may choose to switch the main entry and subentry (the subentry becoming a main entry and the main entry becoming its subentry) with no problems. Don’t try to do that in a thesaurus or taxonomy!

Finally, thesauri and back-of-the-book indexes both have indications of related concepts. Thesauri have the associative relationship called Related Term (RT), and book indexes have See also cross-references. While in general these function the same, the rules for thesauri are stricter. If the “related” terms are really hierarchical, then they must have the hierarchical relationship instead. In a book index, it is acceptable to have a See also between two terms where one is actually broader in meaning to the other.

I will be giving a presentation on this in greater detail at the annual conference of the American Society for Indexing, on April 30, 2015, in Seattle, WA.

Friday, February 28, 2014

Tagging vs. Indexing

I have blogged before on the difference between tags and categories, but recently someone asked me about the difference between tagging and indexing (the manual kind). It's not a simple answer.

One important way in which tagging and indexing differ is that tagging involves any kind of designation about a piece of content, what it is or what it is about, whereas indexing is restricted to descriptive labels for what content is about. Tagging can include content type, date, creator, source, audience, location, rights, keywords, etc., whereas indexing is for the subjects of the content. In this sense, tagging is sort of the modern word for cataloging or the assignment of metadata.

But what if we are concerned with just the descriptive labeling of content and not other metadata? That might be called tagging or it might be called indexing. In this case, the difference is more nuanced, and to a certain extent it is historical.

When I first entered this field in early 1990s, the notion of "tagging" was not really known. Indexing, on the other hand, was a recognized activity. There are two kinds of indexing:
1) Closed indexing or back-of-the-book indexing, where the index is created based solely on concepts found in a single monograph, and the index is created for that one monograph and is then finished ("closed").
2) Open indexing, or what was then called database indexing, whereby index terms taken from a controlled vocabulary or thesaurus are assigned to multiple individual documents or digital assets, with the content ever growing over time and the same index terms will point to increasingly more documents over time.

Then, with the rise of social media, "tagging" became popular in the form of assigning keywords and names to photos or blogposts or other digital content. Initially, tagging was clearly different from indexing, because:
1) Tagging did not use a controlled vocabulary (aka thesaurus or taxonomy)
2) Tagging was done by creators and consumers of content, and not trained indexers. "Indexer" is a profession; "tagger" is not.

Indexing is also different from tagging by what results from it. If we look to the origin of the word "index", it means to indicate or to point (as with your index finger). So, the result of indexing is an "index" that the user can browse to locate referenced (if in print) or linked (if electronic) content. A thesaurus/taxonomy and an index (a structured list of the terms that had been used for indexing) could be essentially the same thing. Sometimes not the entire index is browsable but rather just a section via a type-ahead scroll-box feature. Tagging, on the other hand, with the lack of controlled vocabulary, does not result in any created work, just a folksonomy, which, with its multiple terms with the same or overlapping meaning, is not suitable for browsing. If displayed, tagging terms are shown by popularity instead, such as in a tag cloud, which is interesting, but not an accurate method for content findability and retrieval.

In time, enterprise software adopted social media methods, user interfaces, and features. As a consequence, tagging became more formalized as an employee task, and folksonomies got edited into controlled vocabularies or taxonomies, if not at least becoming sources for taxonomy terms. So, now tagging may be done with or without a controlled vocabulary, and both consumers and professional editors/content managers (if not “taggers”) do tagging.

"Tags" and "tagging" are now also designated features content management and digital asset management software, and content editors "tag" with terms from a controlled list. As such, the distinctions between "indexing" and "tagging" have become blurred, and what this activity is called may depend on what the software vendor, the industry (publishing may prefer to call it indexing, whereas ecommerce calls it tagging), and the corporate culture prefers to call it.

The designation of “indexing”, as open index creation, is also becoming less common as the full display of indexes has become less common. Search boxes (even if what the user enters into it is matched against a thesaurus) have often replaced long alphabetized lists of subject entries and subentries. We continue to find indexes at the back of books, but online for electronic content the displayed browsable index is less common than it used to be.

Tuesday, January 28, 2014

Taxonomies vs. Thesauri

Two taxonomy consulting projects I worked on last year seemed to lend themselves more to the development of a thesaurus than a set of hierarchical taxonomies. But clients usually ask for a taxonomy and not a thesaurus. Perhaps we need to ask what is in mind with the notion of a “taxonomy.” When someone wants a “taxonomy” developed, do they want a structured kind of controlled vocabulary to support consistent indexing/tagging and retrieval (the broad meaning of taxonomy), or do they specifically want a browse display of topics in a top-down navigation structure in a user interface (the narrower meaning of taxonomy)? The broad meaning of “taxonomy” includes thesauri, too. So, if you are looking for the former, maybe it is actually a thesaurus that you want.

In its broad meaning, “taxonomy” often refers to any of various kinds of controlled vocabularies: synonym rings to support search without being displayed (which a search vendor might call a “thesaurus”), hierarchical topic trees without synonyms, faceted taxonomies, and finally the more complex taxonomies that include all of hierarchical relationships, associative relationships, and synonyms. The latter is what may be called a thesaurus. In such a case, I would be asked for “a taxonomy with hierarchical relationships, associative relationships, and synonyms, and possibly term notes or definitions,” rather than “at thesaurus.” The word “taxonomy” has become the standard term of reference in the business, outside library applications.

The usual differentiating distinction between a strictly defined taxonomy (its narrower meaning) and a thesaurus is that a thesaurus has all the features of a taxonomy plus the addition of associative relationships. This is largely true, and I will add that a thesaurus also must have equivalence relationships (between a “preferred term” and its synonyms or nonpreferred terms), whereas synonyms/nonpreferred terms are merely optional in taxonomies, depending on the taxonomy size. Thesauri should also be built according to the standards of ANSI/NISO Z39.19 or ISO 25964, whereas taxonomies can be a little more flexible in their adherence to standards.

The extent of hierarchies

However, in my experience, I would say there is another very important distinction between a narrowly defined taxonomy and a thesaurus. A taxonomy has hierarchical relationships that bring in all of the terms/concepts into one or more (but a limited number) of hierarchical tree structures or facets. (We can consider a facet as a simple two-level hierarchy comprising the facet label and its narrower facet values.) Think of a taxonomy as supporting classification, categorization, and concept organization, with a basis in the Linnean taxonomy of animals and plants that is the most well-known meaning of “taxonomy.” The user typically enters a taxonomy from the top down.

In a thesaurus, by contrast, it is not necessary to structure all concepts (terms) into a limited number of top level hierarchies. A thesaurus focuses on terms and their immediate relationships with other terms. Hierarchical relationships between terms may result in extended hierarchies of various degrees, whether just two terms or more, but do not extend the depth of the entire taxonomy. Thus, numerous isolated hierarchies could exist. What this means is that a top down hierarchical display of a thesaurus would not comprise simply a few equally sized hierarchies, but rather numerous hierarchies of varied sizes and specificities. “Top terms” are not all of the same equal weight, importance of generalness. Therefore, while any thesaurus could be displayed hierarchically, it might not be desired to display hierarchically. Instead, the user might browse the terms of thesaurus alphabetically to select a term. A selected term will then indicate that term’s hierarchical relationships.

The idea of navigating without high-level hierarchies through which to drill down may seem odd, especially since hierarchy trees have become so common in website navigation. But there is no single right way to navigate. “Navigate” and “browse” are not synonymous with “drill down” through a hierarchy. Browsing could start out alphabetically and then jump from one term to the next via both hierarchical and associative relationships.

Blurred distinctions

You may have a hierarchical taxonomy with the additional thesaurus features of associative relationships, synonyms, scope notes for terms, etc., and then you can call it “a taxonomy with thesaurus features.” On the other hand, you may have a thesaurus that does in fact have an over-arching hierarchical structure, and you may call it “a thesaurus with a taxonomy structure.” Both of these kinds of “taxonomies” and “thesauri” would thus have essentially the same structure.

An organization might start calling its taxonomy a “thesaurus” if it chose to follow the terminology of its selected thesaurus software vendor. The following vendors, for example, call their products thesaurus management software and the results created as “thesauri”: Synaptica, Data Harmony, PoolParty, and MultiTes. Vendors have developed software that is full-featured, so not only can the software be used to create simple hierarchical taxonomies, but it also supports the full range of relationship types (hierarchical, associative, and equivalence) along with term notes, term attributes, and other maintenance tracking features. Thus, it is thesaurus management software that may be used for either thesauri or taxonomies or anything inbetween and other simpler types of controlled vocabularies.

Choosing the approach

The choice between adopting a hierarchical taxonomy vs. a thesaurus depend on the nature of the content and the users.
A hierarchical taxonomy would be fine if:
- The content is of a homogenous type that can be characterized by the same set of facets.
- The nature of the topics for the content falls neatly into a hierarchy.
- Users are not experts in the subjects and need to be guided by hierarchies.
A thesaurus would be more suitable if:
- Multiple, overlapping subject areas or domains are covered with diverse content.
- The terms need to be highly specific for detailed indexing.
- The topics do not lend themselves to neat hierarchies.
- Users are knowledgeable of the subject and will likely look for specific terms.