Monday, July 17, 2017

Metadata and Taxonomies

Metadata and taxonomies are related. In The Accidental Taxonomist, 2nd edition (pp. 15-18), I explain that most, but not all, taxonomies (not purely navigational taxonomies) serve to populate terms/values in metadata fields/elements; and some, but definitely not all, metadata fields are populated by terms/values from controlled vocabularies or, more specifically, taxonomies (in contrast to free text or key words).

The question remains whether to start with creating the overall metadata strategy and schema and then build taxonomies as part of it as needed, or to start with creating a taxonomy and then, in the process, identify the various descriptive metadata.  Ideally the two are developed for implementation combination, as part of an integrated strategy. However, an expert in taxonomy development (a taxonomist) and an expert in metadata design (a metadata architect) are usually not the same person.

A metadata architect can become an accidental taxonomist, and a taxonomist can become an accidental metadata architect, or the two experts can work together on the same project, although it is not so common for an organization to have both such experts on staff.  Whether an organization has a metadata architect or taxonomist depends on the nature of the organization’s content and content organization needs.

Organizations that start with the metadata expertise and approach to information management tend to be those with significant needs in digital asset management (with image or other media collections), records management (in highly regulated industries), publishing, or cultural preservation (museums or libraries). Organizations that start with the taxonomy expertise and approach include product or service providers, distributors and retailers (especially in ecommerce), and organizations focused on providing information resources.

A hierarchical taxonomy can be integrated with metadata, when one of the metadata fields is for “Topic” or “Subject,” and there is a hierarchical taxonomy of subject terms associated with that field. However, it is the faceted type of taxonomy in particular that unites the tasks of taxonomy creation and metadata design.

Faceted Taxonomies and Metadata

A faceted taxonomy comprises a set of facets, each an individual controlled vocabulary, whose terms are generally not linked/related to terms in the other controlled vocabulary facets, but the combination of terms from a combination of facets are used to tag the same set of content, and users search/filter on terms in combination from various facets. Examples of facets may be Product/Service, Market Segment, Location, Document Type, Supplier, etc. A faceted taxonomy is a common type for both enterprise taxonomies and ecommerce or product review taxonomies, and it’s a type of taxonomy that taxonomists are familiar with creating. It’s called a “taxonomy” even though it differs from the classical hierarchical “tree” type of taxonomy, because it involves controlled vocabulary and classification. The name for each facet and the terms within the facet constitutes a simply two-level hierarchy.

Each facet is also a metadata field/element. The taxonomist designing a faceted taxonomy is thus also designing metadata, at least some of it. There are usually more metadata fields to describe the content beyond those which comprise the taxonomy facets. For a faceted taxonomy to best serve the user who is trying to find/discover content based on what it is and what it is about, the number of facets should be limited. (See my earlier post "How Many Facets.") Metadata, however, can serve additional purposes beyond helping users find content. Metadata may describe content for purposes of full identification, source citation, or information on how the content can be used, including rights data.  The taxonomist or metadata architect needs to decide which metadata fields will constitute a displayed faceted taxonomy for the end-user to utilize in search/discovery, and which metadata fields will not but will rather display on a selected content record.

On the other hand, there may be additional metadata fields beyond the scope and definition of “taxonomy” that are nevertheless made available to the end-user to filter/refine results alongside the other, taxonomy facets. These could be for author/creator, date, title keyword, text keyword, file format, etc. Sometimes the distinction between taxonomy facet and other metadata in this case is not so clear, such as for Document/Content Type, Audience, or Language, when these fields utilize controlled vocabularies. Due to this overlap and blurred distinction between taxonomy facets and displayed metadata for filtering, it is a good idea to design the taxonomy and metadata together as an integrated strategy.

Sunday, June 18, 2017

Standards for Taxonomies

Since “taxonomies” are rather loosely defined, standards specifically for taxonomies do not exist, but there are standards that are relevant to taxonomies. A taxonomy is a kind of controlled vocabulary, and there are standards for controlled vocabularies. There are also standards specifically for thesauri, a kind of controlled vocabulary with which taxonomies typically share many features. 

Standards serve various purposes. Two leading purposes for standards are:
  1. To ensure consistency and ease of use across different products or systems used by different users.
  2. To ensure interoperability, the sharing or exchange of products/services/information.

Standards for Consistency

Standards aimed at ensuring consistency and ease of use would include buttons on devices, menus in user interfaces, pedals in cars. With such standards, users can expect the same experience from manufacturers or service providers and thus they are able to easily use products or systems from different manufacturers/providers/vendors. In the case of information systems, this kind of standard includes those for the design and style of book indexes and thesauri. These “standards” tend to be guidelines, recommendations, or accepted conventions, and not exactly strict standards, even if issued from a standards body. For thesauri, the “standard” is issued by the National Information Standards Institute (NISO), but it is called a "guideline”: ANSI/NISOZ.39.19 Guidelines for the Construction of Monolingual Controlled Vocabularies. The corresponding ISO standard is ISO 25964 Part 1: Thesauri for Information Retrieval.

These guidelines cover style and form of terms, circumstances for creating the various kinds of relationships between terms, use of notes on terms, etc. They are all about how to create well-formed thesauri with consistent design features that are then easy and intuitive to use. For example, when a user sees that two terms are in a hierarchical relationship, the user understands that the narrower term is a kind of, instance of, or integral part of the broader term, and not merely an aspect of or some other related concept of the broader term. In fact, the end-user of a thesaurus does not even need to know and understand thesaurus principles to be able to make use of a thesaurus to find desired concepts and content.

Standards for Interoperability

The other kind of standards, those aimed at ensuring interoperability, would include standards for size and units of measure, data exchange, and communications protocols. Interoperability standards are important for those controlled vocabularies which are intended to be shared or reused. Thus, the content to which controlled vocabularies link can be accessed by third parties or made publicly accessible over the Web. Controlled vocabularies may be “reused”, if the original creator of a controlled vocabulary decides to license the vocabulary (without linked content) to other publishers to use on their own content, so that the second publisher does not have to reinvent a controlled vocabulary that already exists in same subject area.

Interoperability standards for controlled vocabularies include ZThes (a thesaurus schema for XML, which is has since gone out of style), World Wide Web Consortium (W3C) specifications for the Semantic Web including SKOS (Simple KnowledgeOrganization System) and the Web Ontology Language (OWL) for ontologies, and ISO 25964 Part 2: Interoperability with other vocabularies. Indeed, ISO 25964 covers consistency standards in its first part and interoperability standards in its second part. 

Metadata Schema

Since taxonomies or other controlled vocabularies may be used to provide terms that fill a certain metadata element/property/field within a larger set of metadata, the use of a standard metadata schema or model is yet another way in which interoperable standards involve taxonomies.  If structured content is to be shared or exchanged, the metadata fields need to be standardized with the same names, abbreviations, and purposes.

Examples of standard metadata schema include MARC for library materials, Dublin Core (DCMI) for generic online networked resources, IPTC (International Press Telecommunications Council) for photographs and other media, DDI (Data Documentation Initiative) for describing data from the social sciences, and PREMIS (Preservation Metadata: Implementation Strategies) for repositories of digital objects. Adopting such a metadata schema would be another way to enable sharing of content tagged with the metadata.

I was pleased to have the opportunity to learn more about information and publishing standards recently at the Society for Scholarly Publishing conference in attending pre-meeting seminar “All About Standards.”

Wednesday, May 24, 2017

Adjective and Verb Terms in Taxonomies

Terms in a taxonomy are generally nouns or noun phrases, but this does not mean that a taxonomy cannot comprise adjectives or verbs instead. There may be differences of opinion on this, though.

A thesaurus, another kind of controlled vocabulary, by contrast, is expected to follow standards (ANSI/NISO Z.39.19 or ISO 25964), which dictate that the terms be only nouns or noun phrases. Since a thesaurus is more structured than a taxonomy, it might be assumed that a thesaurus is a kind of taxonomy with additional features (nonpreferred terms, associative relationships, scope notes, etc.), but that the basic format of the terms are the same.  In general, this is true. Terms in the vast majority of taxonomies follow the same format as terms in thesauri, and the differences between these two different knowledge organization systems lie rather in their use of term relationships and additional attributes.

Taxonomists should attempt to follow the thesaurus standards when creating taxonomies, to the extent that is practical or relevant. Reflecting the content and serving the users are always the first priorities for taxonomies. So, there may be cases when terms as adjectives or verbs are practical.

Taxonomies vary more than thesauri do, though. While the structure of a thesaurus is consistent, taxonomies can be based on hierarchies or on facets or a combination of both. Facets are lists of terms to describe certain attributes, aspects, limit-by/filter-by categories, or metadata fields. Facets could include types such as color, size, speed, etc., in which the terms in these facets are adjectives, for example the names of individual colors.

Taxonomies with terms that are verbs are even less common than taxonomies with terms that are adjectives. Taxonomy terms of verbs (not merely verbal nouns ending in -ing) are found in only very special-purpose taxonomies. As with taxonomies with adjectives, the verb terms would not comprise or be scattered throughout an entire hierarchical taxonomy, but would rather serve as shorter term lists or facets. A good example, is Bloom’s taxonomy of educational outcomes, which is just the short list of the following verbs in this order: Remember, Understand, Apply, Analyze, Evaluate, and Create. Taxonomists might dismiss Bloom’s as not really a “taxonomy,” but it is very common to use Bloom’s terms in a facet within a faceted taxonomy for educational content.

Sets of longer verb phrases may stretch the definition of taxonomy or controlled vocabulary, but they still serve the same purpose of a controlled list within a metadata property used to tag content. This is the case for learning objectives used to tag educational content. An example of a learning objective is: “Classify costs as direct versus indirect.” Learning objectives can even be put into a hierarchy, like other taxonomies.

Metadata of phrases that begin with verbs could also be used to describe processes or procedures. I had been asked once to design a “taxonomy” for the steps and options of statements/questions to be made by sales representatives as they go through the process of achieving a sale. These “terms” would have been verbal statements similarly complex as learning objectives. The issue I had with calling it a taxonomy is that the statements would not be arranged hierarchically of broader/narrower, but rather in a flow-chart procedure format. Indeed, this would have violated the definition of a taxonomy which has to have some hierarchy. However, this would have resemblance to an ontology with its semantic relationships. So, such a procedure system still would be a kind of knowledge organization system.

Sunday, April 23, 2017

Taxonomy Term Specificity

One of the challenges in creating or editing taxonomies is determining how specific the terms should be. This is a key issue in making a taxonomy customized for a certain implementation, which involves a unique set of content to be tagged/indexed and a certain set of users. Highly specific terms tend to be the consequence of deeper hierarchies. So, the decision of how specific the terms should be is also related to the decision of how many hierarchical levels of depth the taxonomy should be. Taxonomies that are organized into multiple facets, on the other hand, tend to have more limited hierarchy, if any, and terms that are not so specific.

Having taxonomy terms that are more specific than necessary inevitably means that there are more taxonomy terms than necessary. The larger taxonomy is more difficult to maintain both in currency and consistency. Terms that are more specific than necessary are also likely to be more specific than expected by the users and might get overlooked and not even used. If the taxonomy is searched, the users will not likely search for such highly specific terms. If the taxonomy is browsed, the users might stop at a higher-level broader term and be satisfied with that. Furthermore, users like to retrieve multiple results (content items or references) for a single search term, so that they can browse the list and evaluate what they want. Highly specific terms will match fewer content items, so retrieved results could comprise only one or two items per taxonomy term, which may not satisfy most users. Having a greater number of more specific terms can also lead to more inconsistency in the indexing/tagging, whether manual or automated. 

Having taxonomy terms that are not specific enough means that each taxonomy term is indexed to a relatively large number of content items, and the users may have to scroll through multiple screens of returned results and look at multiple items to find what they really want. The availability of additional filters or facets can help limit the results, though. Having terms that are not specific enough also makes it more difficult for users to “discover” potential related topics of interest, whether the terms have “related-term”/”see also” relationships between them or whether “related” terms are suggested by shared tagged occurrence among content items.

Taxonomists sometimes refer to term specificity as “granularity” or a taxonomy being “granular.” There is the irony that, although the scope and meaning of specific terms is granular/narrow/small, the terms themselves are not small. The “granular” terms tend to be longer, more complex, multi-word terms. If combining multiple concepts into a single term, such terms might also be called "pre-coordinated" terms. Following are examples of specific, granular taxonomy terms from different specialized taxonomies:

  • Possessed object access systems (in an information technology taxonomy)
  • Fingerstick blood sugar testing (in a health care taxonomy)
  • Standard manufacturing overhead cost (in a business taxonomy)

The taxonomist typically creates specific/granular terms, based on the concepts of sample content to be tagged. There may be a document with the phrase in the title, an image with the phrase in its caption, a product with this description as its type, a department with the phrase in its name, etc. Obviously, source phrases would need to be edited to become well-formed taxonomy terms, but they may still be multi-word, complex terms. Creating a taxonomy from scratch usually involves a combination of a top-down and bottom-up approach in the development of terms and hierarchical relationships. The specific/granular terms are the result of the bottom-up component of taxonomy development.

Taxonomies available for license might be appropriate in their subject area and scope, but chances are that their terms get either too specific or not specific enough for different implementations. Thus, if you choose to license a taxonomy, make sure your license allows you to customize the taxonomy so that you can either delete terms that are too specific or add more terms, as narrower terms to existing terms, that are more specific to suit your content

Creating or deleting specific terms is also part of periodic taxonomy maintenance. If a term, which has no narrower terms, is heavily used in indexing, it might be time to “break it up” be creating a few more specific, narrower terms so that the large content set is indexed and retrieved with more specific terms for more manageable result numbers. If, over a period of time, a specific terms has been applied in indexing very few times, or not at all, it should probably be deleted. The deleted term can be changed to a variant/nonpreferred term/alternative label for an existing broader concept. The specificity of a taxonomy should match the specificity of the content being tagged with it, and this can change over time.

Friday, March 17, 2017

Taxonomies as Knowledge Organization Systems

A taxonomy is a kind of controlled vocabulary. A taxonomy is also a kind of knowledge organization system. So, the question is: what’s the difference, if any, between a controlled vocabulary and a knowledge organization system? When I first heard of “knowledge organization system” I perceived it as merely a more academic term for controlled vocabulary. While it’s true that knowledge organization systems are discussed more in library and information science literature and courses than they are in corporate enterprises, there are additional nuanced differences between the two.

Controlled vocabularies comprise simple term lists, synonym rings (search thesauri), authority files, taxonomies, and thesauri. Knowledge organization systems comprise all of these, plus categorization schemes, classiļ¬cation schemes, dictionaries, gazetteers, glossaries, ontologies, semantic networks, subject heading schemes, and terminologies. As such, knowledge organization systems can be considered to be broader than controlled vocabularies, including all kinds of controlled vocabularies and more.

Yet, it’s not simply a matter of more types that distinguish knowledge organization systems. Knowledge organization systems include “schemes” that go beyond how the terms are organized and related to each other. Categorization schemes, classification schemes, semantic networks, ontologies present not only terms and relationships but also models of how information/knowledge can be managed and organized. These typically involve additional specifications and documentation on how they are to be used. There is indeed something to the name “knowledge organization system.” A “system” is more than just terms and their relationships.

As such, there is more discourse around knowledge organization systems than controlled vocabularies, per se (separate from discussions specifically about taxonomies or thesauri). Conference sessions of the Association for Information Science & Technology (ASIS&T) more often have “knowledge organization systems” in their titles than “controlled vocabularies.” There is even a professional association dedicated to knowledge organization systems, the International Society for Knowledge Organization (ISKO). There is no comparable organization for controlled vocabularies or just taxonomies or thesauri. ISKO holds conferences with sessions around the various issues of knowledge organization systems, including taxonomies. Recognizing that taxonomies are an important kind of knowledge organization system, the ISKO UK chapter co-sponsors the Taxonomy Boot Camp London conference

Taxonomies are not only included within knowledge organization systems, but they are also a part of the field of knowledge management. As a consultant, I worked with clients who managed taxonomies within their knowledge management services, headed by a manager or director of knowledge management. Also, at a consultancy where I previously worked, taxonomy consulting was part of the larger knowledge management consulting practice

I used to describe taxonomies as only a kind of controlled vocabulary, but now I will start referring to them as knowledge organization systems as well.

Sunday, February 19, 2017

Avoiding Mistakes in Taxonomy Hierarchical Relationships

Perhaps the most important issue in designing a hierarchical taxonomy is creating hierarchical relationships between terms correctly. This makes the taxonomy intuitively easy to understand and navigate by all kinds of users, regardless of whether they have had any training on using a taxonomy.

The basic principles of the hierarchical relationship are described in the ANSI/NISO Z39.19 and ISO 25964-1 standards for thesauri. As a quick summary, the relationship is created between terms in the following circumstances:

  • a broader term which is generic and a narrower term which is a more specific type of the generic broader term,
  • a broader term which is generic and the narrower term is a named instance (proper noun) of the generic broader term,
  • a broader term which is a whole entity and a narrower term which is an integral part.

It is the first, generic-specific type that is most common, but is also most prone to errors by those not experienced in creating taxonomies. Typical errors include confusing refinement and narrower terms, too closely reflecting the source content hierarchy, and creating narrower terms that are applications, uses, or examples of a broader term.

Confusing Refinements with Narrower Terms

We envision users browsing a hierarchical taxonomy from top down, from broad topic to more specific topic. A more specific topic is a narrower term (NT) of a broader topic. However, instead of providing more specific topics, the creator of a taxonomy might mistakenly provide refinements of the broader topic, which are aspects of the topic, but not actually narrower terms. A term that is an aspect or refinement is not a unique stand-alone term/concept, but rather it is meant to be used in combination with its parent term.

An example of such an erroneous hierarchy would be:

  Eye diseases

Diagnosis is an aspect or refinement of Eye diseases (and of other disease-type terms), and not a narrower term. A narrower term would be specific type of eye disease:

  Eye diseases
     NT: Glaucoma

A refinement term might not be as obvious as it is in the above example. If the same term, however, appears duplicated as a narrower term to different broader terms, but with a different implied/contextual meaning in each case, this should be red flag that the duplicated narrower term is really a refinement term. For example, the duplication of the term Waiver in a legal taxonomy as:

  Objections to evidence

  Right to jury trial

In this case, the duplicate narrower term should be changed to be specific in each case, such as: Objections to evidence waiver and Right to jury trial waiver.

Novice taxonomists might create such incorrect broader term-narrower term relationships because they have seen them formed as such elsewhere, such as Library of Congress Subject Headings plus Subdivisions or back-of-the-book index main entries plus subentries. A subheading or a subentry is not the same as a narrower term, because a subheading or a subentry only has usage and meaning in the context of the main heading it is associated with (appears under). A taxonomy narrower term, on the other hand, is not a different kind of term, but is rather a description of a relationship between terms. The meaning of a term in a taxonomy is constant and not dependent on its location in the taxonomy.

Too Closely Reflecting the Source Content Hierarchy

Some taxonomies are based heavily on certain text sources, such as the table of contents of one or a limited number of books or manuals, where the text is structured into units, chapters, main heading sections, subheading sections, etc. It is thus natural to make use of the structure of the text as a basis for the structure of the hierarchy. But there can be issues.

In the following example of a chapter and its headings from a textbook, greater hierarchical structure is needed for the corresponding taxonomy terms, and one of the topics (Units of Measure) does not belong within this hierarchy.

  Microbiology Laboratory
  --Microbiology Lab Personnel
  --Introduction to the Microscope
  --Units of Measure
  --Types of Microscopes
  --Laboratory Staining Methods
  --Culture Media

These concepts may appear in a taxonomy arranged hierarchically as follows:

  Medical laboratory technology
  NT: Laboratory equipment and supplies
       NT: Culture media
       NT: Microscopes
            NT: Microscope types
  NT: Laboratory personnel
  NT: Microscope use
       NT: Microscopy stains
  NT: Serology

  NT: Measurements and calculations
       NT: Units of measure
Another issue is that, even when the the hierarchy from the source is acceptable, the subheading-based terms are short, generic, and without context. An example is as follows:

  Eye Medications
  --Anti-inflammatory Agents
  --Antiglaucoma Agents
  --Local Anesthetics

The only correct narrower term above is Antiglaucoma Agents, as the other terms are not specific to eye medications. They could be linked as related terms instead.

Applications, Uses, or Example-Type Terms

Relying too much on certain text sources for the taxonomy may also result in erroneously creating narrower terms for the applications, uses, or examples of the broader term concept, because the text presents content that way.

Following are several examples:

  Web Applications
  --Tourism and Travel
  --Higher Education
  --Financial Institutions
  --Software Distribution
  --Health Care

  Decision making issues
  --Ethical conflicts
  --Information sources
  --Intraorganizational conflicts
  --Social influences

  Globalization challenges
  --Cultural differences
  --Economic risk
  --Political risk
  --Managerial limitations

Each of these so-called narrower terms are merely examples within the context of the broader term. All "narrower terms" could have other uses beyond the context of the broader term. To make the hierarchy correct, either:
1) the relationship should be changed from narrower-term (NT) to related-term (RT). This would be the case, if these terms can logically exist elsewhere in the taxonomy. Also, indexing of the concepts may require a pair of terms (such as Globalization challenges AND Economic risk),
2) the narrower terms should be modified and clarified, such as Cultural challenges to globalization, Economic risk challenges to globalization, Political challenges to globalization, and Managerial challenges to globalization. This would be the case, if these terms did not exist elsewhere in the taxonomy.

In conclusion, hierarchical relationships need to be constructed independent of any sources for terms, and they need to be universal and not subject to certain contexts.