Sunday, January 28, 2018

Best Practices for Different Taxonomies

A question was recently posted to a group: “I'm wondering if anyone knows of a standard for designing taxonomies for industrial components (widgets).” So far, no one has replied.

To clarify, taxonomies for different subject areas and different content don’t have different standards. Standards, whether for interoperability, such as SKOS, or for structural design, such as ANSI/NISO Z39.19 or ISO 25964, are the same and just as relevant for taxonomies and thesauri in all subject areas. Taxonomies for different subject areas and content may have different design best practices, though. The published standards don’t spell out everything; there is room for design and style differences for different taxonomies, including those that differ in their subject domain and content.

Areas of taxonomy design best practices that may differ include:
  • Degree of term specificity or granularity
  • Depth of hierarchical levels
  • Number of terms at the same level (i.e. the number of narrower terms a term has)
  • Length of terms
  • Use of parenthetical modifiers and other term label fields
  • Additional attribute details for terms (notes or controlled value fields)
There are also issues of relationships between terms (whether a term may have more than one broader term, and whether there should be associative/related-term relationships) and how extensive alternative labels/synonyms shall be. Best practices for these issues, however, depend more upon the implementation and user interface for the taxonomy than on the subject area of the taxonomy.

In the case of an industrial component taxonomy, best practices for the aforementioned points would likely be of the following:
  • There should be relatively high level of specificity of terms to include all components
  • Depth of hierarchy that accurately reflects standard component categories and subcategories. So this could be deeper than for other, business taxonomies. Also, the levels of depth may vary in different parts of the taxonomy.
  • The number of terms at the same level should also accurately reflect standard component categories and subcategories, so there could be a large number of terms at the same level.
  • The length of the term should be complete and unambiguous, but any component number should be managed in a separate field.
  • It may be desired to use some additional numeric or alphanumeric classification system. If so, the classification code would be another field or component of the term, separate from the term name, for purpose of sorting.
  • Additional attribute details for each term would be desired and expected. These may include a component number, size, price, and other specifications. (Attribute fields may or may not be searchable. They are not for filtering, though, as facets are.)
In contrast, a consumer products ecommerce taxonomy would follow different best practices:
  • Terms should not be too specific, not more specific than what users would be familiar with. Specificity should reflect the number of units (SKUs) covered by the term category. A term that refers to only 1-5 products is probably too specific. If there are additional refinement filters, then a category term may be broad enough to include 10-50 items.
  • Hierarchy should not be too deep, probably no more than 3 levels.
  • Terms per level should be limited, such as 3-12 terms per hierarchy level
  • Term names should be concise, for easy browsing, yet unambiguous, usually 1-3 words
  • Terms should probably not have any other fields/components or parenthetical qualifiers
  • Attribute details would include at minimum product number/SKU and description. Price would be managed as a separate filter, rather than as merely an attribute.
These best practices are not “standards” because they tend not to be shared outside of an organization. Each organization comes up with their own policies and guidelines, just as they have their own taxonomies. The best practices could be considered internal standards, though. Regardless of what they are called, these guidelines should be documented and overseen as part of a taxonomy governance plan.









Sunday, December 31, 2017

Engaging Others in Taxonomy Building

Whether you are building a new taxonomy from scratch or redesigning one based on an existing taxonomy, it’s important to engage other people in the process. There are two primary reasons:
  1. getting input from those who will use the taxonomy, so that it will better suit their needs
  2. getting buy-in and support from various stakeholders and users, so that it will continually be used, maintained, and funded
Even if you are the only person given the responsibility for creating and editing the taxonomy, and it’s no one else’s job to be involved, you should still seek the input of others.

It’s just common sense to get input from those who will use the taxonomy, although being respectful of their time in the process. If the taxonomy project can be stretched out over time, this means that asking for input does not happen too frequently.

Getting buy-in and support is an issue that is especially particular with taxonomies. People don’t always understand or respect the full benefits of a taxonomy. They may think that search alone or automatically generated keywords may suffice. Or they may prefer the simplicity of a search box. People responsible for tagging might prefer to avoid that task to save time and thus not adequately or correctly apply the taxonomy. Getting such people involved in the taxonomy creation process both educates them about the benefits of the taxonomy and gives them a sense of ownership by contributing to the process that is meant to serve them.

Others in your organization will use the taxonomy. If it’s for tagging and retrieving content internally, some people will use the taxonomy to tag content, and other people will use the taxonomy to help find content, but representatives from both groups of people should be invited to give input. If the taxonomy is to be use for public-facing content, those who will be tagging the content are still within your organization, and they should be consulted in the taxonomy design process.

Taxonomy consultants might lead 2-day taxonomy brainstorming workshops of stakeholders, conduct a series of interviews with stakeholders and users, and develop detailed use cases. While there is usually not the interest in demanding so much time of others when the taxonomy project has been assigned to a person on staff, the taxonomist should still engage other employees, at least at a minimal level. To get initial input, instead of in-person interviews, this could involve emailing a short list of questions to key users and stakeholders and the following up with a phone conversation to go over the answers. Draft versions of a starter taxonomy, which can be developed by means other than a brainstorming workshop, should be shared with stakeholders and subject-matter experts for feedback.

A rather technical taxonomy could start with asking the subject matter experts to submit lists of terms which the taxonomist would review with them and edit into proper taxonomy. In most cases, however, the taxonomist develops the draft starter taxonomy and then seeks feedback.

If you are not asking stakeholders to provide the starter taxonomy, where do you start? Someone asked me that question recently: “Where do I go to get my starter/basic/beginner list before I actually sit with staff?” I advised that you don't need much prior to talking with people. My recommendation, in the case of an enterprise taxonomy, would be to consider the following sources for draft taxonomy terms:

  1. Folder names from shared drives
  2. A prior attempt at creating a starter taxonomy, even if never implemented
  3. Categories or tags set up in a content management system, whether or not fully used
  4. Intranet/website site map or navigation menu labels
  5. Product catalog top categories (as a starting point for a "Product" facet/Term set)
  6. Org chart (as a starting point for a "Department" facet/Term set)
  7. High-frequency use terms from search log reports
Include at least the two of these sources to share with people in meetings. I also recommend not editing them prior to initial meetings with people, but rather to get their input on this raw data.

Taxonomy work, is thus people-oriented work, like information architecture, in contrast to related fields of indexing and cataloging.

Friday, November 24, 2017

Auto-categorization and Taxonomies



Taxonomies and thesauri are only truly useful if their terms are appropriately indexed or tagged to content. My path to taxonomist had been as an indexer, so I always value the importance of human indexers. Nevertheless, I must acknowledge that automated indexing, also called auto-categorization, is becoming increasingly common and important.

At the most recent Taxonomy Boot Camp conference (November 6-7, in Washington, DC), a trend I discerned was the increasingly commonplace use of auto-categorization (or at least machine-aided indexing) with taxonomies. Conference presentations didn’t state auto-categorization as something new but rather sometime more matter of-the-fact, and by the way, the software vendor used in this case is so-and-so. There were also sessions on artificial intelligence and taxonomy and on leveraging taxonomy management with machine learning. There is also a lot of interest in text analytics, a field broader than auto-categorization, which justified the first Text Analytics Forum conference co-located with and immediately following Taxonomy Boot Camp (which I, unfortunately, did not have time for).

When conference speakers and others state that automated indexing has been proven repeatedly in test comparisons to be more “reliable” and more “consistent” than human/manual indexing, while true, that does not mean it is better. Human indexing is certainly not as consistent, as two trained indexers will not index exactly the same way, but the way they differ is rarely so substantial. One indexer may add an additional index term. Another indexer may index with a slightly different, but related, term. Automated indexing, on the other hand, while consistent, is not as correct. Depending on the method, it can be approximately 20% inaccurate, indexing with completely wrong terms or completely missing the most appropriate terms. That’s where “machine-aided indexing” comes in, where indexing is initially automated, but a human quickly reviews the suggested terms, adding or deleting terms as appropriate.

The primary reason for implementing automated indexing is not so much to achieve consistent indexing, but rather to achieve efficient indexing. This is because the amount of content to be indexed in many organizations is growing too fast to be kept up with by manual indexing. Publishers of external content for subscribers have also transitioned to partial automated indexes or machine-aided indexing.

While enterprise search engines do not utilize taxonomies by default (but can be configured to make use of them), auto-categorization software generally uses some form of taxonomies. Search engines can function out-of-the-box without any taxonomies or controlled vocabularies, although a search thesaurus (a.k.a synonym ring) can significantly improve search precision and recall. Auto-categorization software, on the other hand, relies on “categories,” which can be simple controlled vocabularies or hierarchical or faceted taxonomies. Thus, as auto-categorization is gaining wider adoption, the need for taxonomies to support them is also growing.

Automated indexing technologies have not advanced significantly in recent years, but there have been improvements in auto-categorization software by effectively combining more than one technology method within the same software product. The main technology methods are (1) rules-based and (2) machine-learning. Regardless of the method, automated indexing is still not fully automated. Humans are required to put in time and effort beforehand to either write or edit rules for each taxonomy term, or to provide and test training sets of sample documents to index for machine learning. These could be dedicated roles or additional tasks to be performed by the taxonomist.

Auto-categorization is also becoming more common, because software products that effectively combine taxonomy management with auto-categorization have become more established and better integrated. Although there are many organizations which continue to use distinctly separate software for each of taxonomy management and auto-categorization, organizations newer to taxonomy adoption prefer to have a single solution. Synaptica is the one major taxonomy management vendor which does not yet include fully integrated auto-categorization, and they are very actively working on incorporating the technology. I have separate chapters in my book, The Accidental Taxonomist for software for taxonomy management and software for auto-categorization, but in my second edition I ended up repeating more vendors in both sections.

Saturday, October 21, 2017

Taxonomies for Specific Business Needs

Designing controlled vocabularies to meet specific business needs was the topic of my latest conference presentation at Taxonomy Boot Camp London on October 17. There are two aspects to this topic: (1) the type of controlled vocabulary to choose, and (2) whether to have the same controlled vocabulary or distinct controlled vocabularies to serve different business needs.

For choosing the type of controlled vocabulary, the most common choices are a thesaurus, a hierarchical taxonomy, or a faceted taxonomy. It is also possible to have some kind of combination or hybrid type of these. I’ve discussed the difference between taxonomies and thesauri in previous blog posts, “Taxonomies vs. Thesauri”  and “Taxonomies vs. Thesauri: Practical Implementations.”
So, now I will focus on whether to have the same controlled vocabulary or distinct controlled vocabularies to serve different business needs.

What are different business needs? Taxonomies may be needed to make different kinds of information organized and easily searched or discovered and retrieved by different users, including:
  • Internal documentation, including policies and procedures, market and product research, etc.
  • Digital assets for content managers to reuse in publishing content 
  • Product information, such as a product catalog for ecommerce, for customer
  • Curated, premium content for subscriber
  • Informational content for the public
 While different organizations have their own needs, the same organization could have more than one business need for a taxonomy, such as an internal use and an external customer-facing use. An organization in the business of publishing content, may even have quite different published products for different users and purposes and consider each of those as separate business needs.

Taxonomies are versatile, so it is possible and worth considering having a single, master taxonomy serve all business needs, with terms classified for different uses. Terms managed in a taxonomy management system can be tagged with a category assignment as to which use they are for, such as some for an internal use, some for an external, and perhaps many of them for both.  You determine the type of category, let’s say “audience”, determine what audience types there are, such as “internal” and “external,” and set that up in the categories option of your taxonomy management software. Then you assign the categories, as appropriate to each term. This method works if the same terms and the same structure are being used in both cases, with one use having more specific terms in some areas. The other use may have less specific terms, or also more specific terms in other areas.

The method of using the same taxonomy for different uses, designating use by categories on terms requires
  • that when a concept in the taxonomy has more than one use, that the same preferred term label is used for the concept in both/all cases
  • that concepts/terms in the taxonomy have the same relationships to each other
Sometimes, however, different business needs require different preferred labels for a concept, such as “Customers” vs. “Clients.” It is possible to maintain multiple preferred labels for a concept, if you manage them as you would manage multiple language versions of a term in a multilingual taxonomy, but this is more complexity than necessary when only some of the terms have different preferred label.

If you want to maintain links between equivalent terms, whether they have the same preferred labels or not, in different business-use versions, it’s not necessary to maintain them in the same taxonomy akin to multilingual versions. Rather, if you created two separate taxonomies, you could still set up inter-taxonomy links between the equivalent terms. This is not necessary, but might be desirable.

Whether to maintain one or more taxonomies also depends on the size of each. If one of the business use cases requires only a small taxonomy, of a couple hundred terms or less, then it is not too much trouble to maintain distinct taxonomies for each.

Saturday, September 30, 2017

Vocabulary Management Issues



Issues in Vocabulary Management” is the latest Technical Report (TR-06-2017) published by the National InformationStandards Organization (NISO), approved on September 25, 2017. I had the honor of serving on its working group, specifically on its subgroup for Vocabulary Use/Reuse.

The most significant NISO publication for controlled vocabularies is ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, which is referenced several times in TR-06. ANSI/NISO Z39.19 focuses on how to design and create controlled vocabularies (especially thesauri and taxonomies), whereas TR -06 addresses issues in the use of controlled vocabularies. Furthermore, as a Technical Report, rather than a Standard, this 49-page document does not contain requirements, but rather serves an informative purpose. It does have a page of recommendations, though, which are for a vocabulary’s definition and attribute types, its best practices for documentation, and its licensing or provisions for use and reuse.

Over time, the need to create new controlled vocabularies from scratch diminishes, as more vocabularies come into existence, especially those that are made available for sharing or licensing (see my blog post Directories and Databases of Published Controlled Vocabularies) but the need to maintain, revise, and reuse them grows, so this Technical Report serves a valuable role.

What are the “issues” in vocabulary management? They could vary, based on the organization and implementation, but this document considers three areas of

  • Vocabulary use and reuse, dealing with permissions, licenses, maintenance, versioning, extending and mapping vocabularies.
  • Vocabulary documentation, dealing with governance issues and how to document vocabulary properties.
  • Vocabulary preservation, dealing with issues of abandoned or “orphaned” vocabularies, which is especially the case of vocabularies developed by nonprofit organizations which have lost their funding to maintain them.

These issues are relevant to both proprietary controlled vocabularies, which may be reused through licensing agreements, and publicly available vocabularies, which are shared and reused increasingly through linked data on the web, or more specifically the Semantic Web and the Linked Open Data environment.  For publicly available or open vocabularies there are also the issues of simply finding or discovering suitable and sustainable vocabularies and evaluating them and then the communication between the vocabulary owner and user.

TR-06 takes a somewhat broader view of “vocabularies,” not just “controlled vocabularies,” but also including ontologies, unstructured term lists, terminologies, synonym rings, etc. I explored these differences and definitions in detail in my blog post Vocabularies and Controlled Vocabularies, which I wrote shortly after starting work on the NISO working group. The vocabularies of concern of TR-06 also include element sets, which comprise metadata properties/fields and not merely the controlled vocabulary terms/values within those properties.

TR-06 does not seem so much as a “technical report.” It also includes several real-life examples and use cases. To a certain extent, it explains by example.  Appendices include a glossary of terms with extensive definitions; a descriptive list of vocabulary directories, repositories or collections (something that I worked on); a list of free and open vocabulary tools (far more extensive than those I described in a previous blog post Free Taxonomy Management Software); and a list of additional resources with links, besides its bibliography, making this quite a valuable resource.

TR-06 “Issues in Vocabulary Management” will now be added to my list of recommended resources for controlled vocabulary and taxonomy management, and I hope that many of those who manage taxonomies will take a look at it.