The Accidental Taxonomist

Thursday, August 22, 2019

Taxonomy Mapping

As more taxonomies get created, we see a growing need to “map” taxonomies to each other, which is linking between individual terms or concepts in each taxonomy so that the taxonomies may be used in some combination. Mapping is not new, but as it has become more frequent it is now reflected in newer standards and in taxonomy management software features.

Mapping taxonomies

Reasons or use cases for mapping include:

Selected content with an enterprise taxonomy is made available on a public web site with a different public-facing taxonomy.
A provider of scientific/technical/medical content with a technical thesaurus creates a simpler taxonomy aimed at laypeople.
Content will be made available in a different language region, and a comparable taxonomy already exists in the other language.
A knowledge graph is built to aggregate data from multiple repositories, each with its own taxonomy.
An enterprise search is based on “federated search” and different areas have different search-support thesauri.
Terms from search engine logs are mapped to a taxonomy to add alternative labels.
Terms from an open source or licensed vocabulary are mapped to a taxonomy to enrich it.

I’ve worked on occasional taxonomy mapping projects since the late 1990s, and I discuss mapping in a section of my book, The Accidental Taxonomist (2^nd edition, pp. 369-73) and in an earlier blog post. I’ve also presented in conferences before on mapping taxonomies, as early as 2009, but only briefly and in the wider in the context of related activities of merging taxonomies and creating multilingual taxonomies. My next conference presentation (not including a pre-conference workshop), “Mapping Taxonomies, Thesauri, and Ontologies” (SEMANTiCS 2019 in Karlsruhe, Germany), will be dedicated to subject of mapping.

In talking recently with more people about mapping, both clients and software vendors, I’ve learned that my previous view of mapping was somewhat narrow. I had considered mapping to be only one-way directional from terms in a tagged taxonomy to terms in a retrieval taxonomy.

One-way directional taxonomy mapping

I still think this model applies to the majority of use cases, but mapping has a broader meaning in the standards and in taxonomy management software capabilities.

Standards for Taxonomy Mapping

The SKOS (Simple Knowledge Organization System) W3C standard adopted in 2009 for a controlled vocabulary model and interchangeable format specifies not only the familiar thesaurus relationships of broader, narrower, and related, but what are called mapping relationships comprising exactMatch, closeMatch, broadmatch, narrowMatch, and relatedMatch. How these different mapping relationship types are to be used is really up to the taxonomy owner. The broadMatch and narrowMatch are directional, but reciprocal, so using these permits bidirectional mapping. However, there is no reason why you cannot use just one mapping relationship type if you are mapping in only a single direction. Or you could use just two, such as exactMatch and broadMatch.

The international standard ISO 25964-2 Thesaurus and Interoperability with Other Vocabularies – Part 2: Interoperability with Other Vocabularies (published in 2013) is substantially about mapping. Interoperability is not synonymous with mapping but covers more, including using a standard format such as SKOS. However, the ISO standard discusses mapping in more detail than any other form of interoperability. The introduction states that “inter-vocabulary mapping will be the principal focus of this part of ISO 25964.” (The slightly older American standard, ANSI/NISO Z.39.19-2005 is comparable with ISO 25964 Part 1, which is all about thesauri, and lacks any explanation of mapping.) While SKOS provides standardized labels, useful for porting and linking vocabularies between different systems and the web, ISO 25964-2 provides guidance on the theory and practice of various types of mappings.

ISO 25964-2 defines mapping broadly as the “process of establishing relationships between the concepts of one vocabulary and those of another.” Like SKOS, it also covers different kinds of mapping relationships, although it describes more types: equivalence, compound equivalence, hierarchical, associative, exact, inexact, and partial equivalence. It also discusses mapping on the high level between pairs or multiple vocabularies and in what kind of direction/arrangement. The standard also includes examples. There is really a lot to consider, and I’ll definitely re-read ISO 25964-2 in detail before embarking on my next mapping project.

Software for Taxonomy Mapping

When I first did taxonomy mapping, Excel files of each vocabulary were compared with either the features of Excel or through scripting. Now, mapping can be also done within taxonomy management software, once both vocabularies are in the software, usually requiring that at least one be imported.

As most commercial taxonomy/thesaurus/ontology management software now supports the SKOS standard, such software also supports the SKOS mapping relationships between vocabularies. The leading vendors, PoolParty, Smartlogic and Synaptica additionally include an auto-mapping tool that uses “smart” or “fuzzy” match techniques, including some stemming, to automatically match equivalences or near-matches between concepts in two different vocabularies, which can then be manually reviewed and approved or rejected. To be done correctly, a taxonomist should perform this review. Automated mapping also takes alternative labels (nonpreferred terms) into consideration and creates a propose match if an alternative label in one vocabulary matches a preferred label in another.

The software’s mapping feature is agnostic to your intentions and direction of mapping, so it’s important to plan the mapping so that it supports mapping in the direction you want. In addition to terms with equivalent meaning, it is also acceptable to map from a narrower to a broader concept as the narrower is an example of the broader and can be used for it, but the mapping won’t work in the other direction. It is also acceptable to map from a term that is a preferred label to a concept where that term is an alternative label/nonpreferred term, and that mapping also won’t work in the other direction.

If planning your mapping project seems daunting, the software vendors, PoolParty, Smartlogic, Synaptica, and Access Innovations (vendor of Data Harmony Thesaurus Master) will provide assistance or the full service of mapping. In fact, Access Innovations has not included an auto-mapping feature in DH Thesaurus Master, because customized results may be better achieved through offline mapping.

Mapping is not just between taxonomies, but can be between taxonomies and thesauri, thesauri and ontologies, or other controlled vocabularies, something else that ISO 25964-2 covers. If you need assistance with mapping, I'd be happy to help.

Friday, July 19, 2019

Onsite Corporate Taxonomy Training

I enjoy teaching about taxonomies. The feedback I get from my students or workshop participants helps me improve my methods of communication, teaching, and consulting, and I learn about the varied implementations of taxonomies. The courses evolve and improve over time. I teach online courses, conference workshops, and corporate onsite workshops. I’ve been making enhancements to the latter offering and this week led a two-day onsite workshop at a major company on the West Coast.

Heather Hedden leading an onsite corporate training workshop in taxonomy design and creation.

Accommodating a varied audience

The participants in my “introductory” workshops, whether at conferences or at their corporate offices, have varied knowledge and experience with taxonomies. Some are complete beginners and are curious to learn about taxonomies and what they can do. Others have been tasked to build a taxonomy with little instruction and are looking for best practices and guidelines. Some of have read my book but have not had the opportunity to put what they have read into practice, so the workshop’s exercises are very helpful. Finally, some participants are experienced taxonomists seeking to fill in the gaps in their knowledge.

The absolute beginners may feel overwhelmed at the amount of information on taxonomies presented in one of my workshops, but I feel it’s important to provide enough instruction to enable people to actually create basic taxonomies (while ideally still getting feedback from someone more experienced). Also, I expect people to combine instruction from my workshop with other methods of learning taxonomies, such as reading my book, taking my online course, attending conference session on taxonomies, or getting advice from a taxonomist in their organization. While I would like to offer a more advanced workshops, it’s difficult to find enough experienced practicing taxonomists at the same location. (At a conference is possible, but sometimes conference organizers equate advanced taxonomy topics with ontologies.)

Interactive exercises

Workshop participants doing a card-sorting exercise

Participants like interactive or hands-on exercises. One of the learning benefits of my onsite workshops is that they include interactive exercises that involve the entire group or class. My online course includes exercises or assignment to learn from the practice and from feedback I provide, but only the onsite workshops offer the opportunity to work on assignments with others and thus learn from others. Creating taxonomies, like designing websites or software user interfaces, needs to consider different views and is somewhat subjective. The classroom setting offers the opportunity to learn from others.

Small-group exercises are the best for this kind of learning. My full-length workshops include small-group exercises for designing a set of facets and for doing a card-sorting exercise to categorize topics. Groups may comprise from three to six participants, depending on the total number. In addition to hearing ideas from their group members, participants then share the resulting taxonomy outline to the larger class, and I provide comments. Even exercises that do not involve small groups, but are assignments to consider and shout out answers, are beneficial, because we obtain, discuss, and evaluate various answers beyond the answers that any one individual might consider.

Remote participation is also possible, especially if the remote participants are co-located in the same office. They can form their own small group for the small group exercises, and they can do the card-sorting exercise online. This was the case in my latest corporate workshop.

Customizing corporate workshops

Heather Hedden leading a corporate onsite trainging workshop in taxonomy design and creation

To what extent I should customize the workshops for a specific organization was a question when I first offered corporate workshops. It’s not necessary, nor worth the time, to customize every example of taxonomy terms in the workshop presentation with something from the client’s domain of content. Rather, I found that it is sufficient yet instructive to customize just a few slides, such as those with examples of content types and use cases.

Another way I customize the workshops is by the outline and topics included. While all workshops include the basics (taxonomy types, definitions, uses and benefits, standards, structural design, best practices for creating terms and relationships, and governance), optional topics include: user interface display options, metadata and taxonomies, testing taxonomies, tagging, mapping taxonomies, multilingual taxonomies, integration with search, and taxonomy management software.

Finally, I customize the group exercises so that the choices for topics for facets would be applicable, and the card-sorting exercise may take an actual example especially if the client has a public taxonomy I can use as a basis for the exercise. I also include discussion questions, so that the participants can share and discuss the taxonomy issues as pertinent to their organization. In any case, I sign an NDA, so the client can comfortably share information with me which I may sue in the workshop.

Continuous improvement

I found that asking the client for some input on possible customization, I can also generalize the issues to enhance the workshop presentation for future use. In other words, the client input on “customization” is not always that, but rather leads to a general improvement. The result has been to make the workshop presentation based more on real-world scenarios and less theoretical than my previous conference presentations. I actually did not consider my conference presentations to be that theoretical in the first place (since, after all, my knowledge of taxonomies is based on my work experience, not on studies for a degree in library/information science). But now I have made the workshops even more practical.

Input from the client can also lead to topics for clarification, such as differing use of terminology. For example, a client wanted me to discuss taxonomy “mapping,” which we taxonomists understand to mean the creation of equivalence links between terms in one taxonomy and another, so that one taxonomy may be used to retrieve content that was tagged in the other taxonomy. However, what my client meant by “mapping” was a kind of “see also” related-term relationships between terms in two different taxonomies. Now I know to clarify and discuss both kinds of links between taxonomies.

Just as I am an accidental taxonomist and then an accidental consultant, so am I now also an accidental trainer. Details of my corporate training offerings are on my website.

Sunday, June 30, 2019

Taxonomy Sessions at the 2019 SLA Conference

SLA (Special Libraries Association) offered a good number of taxonomy-related sessions at this year’s annual conference, held June 14-18 in Cleveland, Ohio, thanks to the organizing efforts of its Taxonomy Division. There were enough taxonomy sessions so that there was always at least one session of interest at any time.

SLA is a membership association of librarians and information professionals, particularly involved in “special” libraries or information services. Special libraries include corporate, specialized academic, government, military, law, medical, business, and nonprofit. I’m not a librarian (I’m an accidental taxonomist), so I didn't become a member of this professional organization until a Taxonomy Division was created 10 years ago. I’ve attended and presented at some, but not all of the SLA conferences in the past 10 years, as the taxonomy-related offerings vary, and presentation topics are usually the choice of the Taxonomy Division conference planning committee. This year, for the first time, I presented not one, but two sessions: the full-day preconference “continuing education” workshop and co-presented a session on taxonomy management software.

I was very pleased that there was such a rich program in other taxonomy sessions this year, especially compared to last year, thanks to Taxonomy Division conference program chair Janice Keeler and program committee members Edee Edwards and Margaret Nunez. There were also two sessions on knowledge management, which I found very interesting. The taxonomy-related sessions were:

"Introduction to Taxonomy Design & Creation" (full-day preconference workshop)
"Ensuring Semantic Interoperability and Creating Interoperable Taxonomies" (90 minutes, 2 speakers)
"Taxonomy Governance in Real Life" (90 minutes, panel discussion, 2 speakers and moderator)
"Taxonomy Roundtable” (90 minutes, three roundtables of participant discussions)
"Big Data and Controlled Vocabularies (30 minutes, 2 speakers)
"Taxonomy Basics" (30 Minutes)
"Keeping your Taxonomy Fresh and Relevant" (60 minutes, 2 speakers)
"Taxonomy-Ontology Conversions: Case Studies (75 minutes, 3 speakers)
"Taxonomy Tools and Tool Evaluation" (60 minutes, 2 speakers)

“Ensuring Semantic Interoperability & Creating Interoperable Taxonomies,” was a densely packed presentation covering the different types and issues in controlled vocabulary interoperability with two presenters in turn: Margie Hlava, President and Chairman, Access Innovations Inc., and Marcia Zeng, Professor of Library and Information Science, Kent State University. Marcia explained that interoperability is at different levels: system level, semantic level, and structural level, and her focus was on semantic interoperability, which is addressed in the international standard ISO 25964-2. She discussed in detail each of the different kinds controlled vocabulary interoperability. There are methods that are based on working from an existing knowledge organization system: derivation from an original source and expansion; and there are methods that involve working between/among existing vocabularies: integration/combination and interoperation/shared/harmonization.

“Taxonomy Governance in Real Life” featured two panelists of very different organizations: Paula McCoy, manager from ProQuest, and Susannah Woodbury, taxonomist from Overstock. Taxonomy governance was defined as maintaining the content of a controlled vocabulary (adds/deletes), maintaining the integrity of a vocabulary (standards and usage), and implementing a vocabulary (managing those who work in and use the vocabulary). Topics of discussion included working with stakeholders, the governance of change and how decisions are made, and staying flexible through the iterative process.

The “Taxonomy Roundtable” was a purely discussion-based session, whereby attendees divided into three groups of about 6-7, and each group got to discuss three of the four predefined topics in turn: taxonomy ROI, adding taxonomy to the workflow, implementing taxonomies in search, and taxonomies in user interface design. These topics were chosen based on a Taxonomy Division survey of members’ interests. Each table then reported the outcomes of their discussions to the larger group.

“Big Data and Controlled Vocabularies” was presented by Camille Matthew, Information Science Specialist, NASA Jet Propulsion Laboratory. Camille explained what big data is: an accumulation of data that is too large and complex for processing by traditional database management tools. Big data is big by 5 Vs: volume, velocity, variety, veracity (varied dates/outdated for example), and value. The issue is combining “stuff” (big data) and “strategy.” Strategies include controlled vocabulary, taxonomy, ontology, and metadata standards. Data structure cannot be assumed, so we design for unstructured content.

"Taxonomy Basics" was presented by Heather Kotula, Vice President of Marketing and Communications, Access Innovations Inc. This quick session was aimed at those new to taxonomies. It comprised definitions of taxonomies and other types of controlled vocabularies and also included quite a bit of history into the field of classification and naming.

“Keeping your Taxonomy Fresh and Relevant: The APA Thesaurus” was presented by Marisa Hughes, Taxonomist, American Psychological Association (APA). Marisa had recently led a thorough thesaurus update that took about a year and was completed in February 2019, and this presentation was largely based on lessons learned from that project. Change management is a key part of in taxonomy governance. Change is constant. Creating a responsive and relevant taxonomy involves a set of activities: adapt, determine, engage, delineate, data, identify. One needs to know when and why to change, and a roadmap is also needed.

“Taxonomy-Ontology Conversions: Case Studies” comprised three case-study presenters: Edee Edwards, Ontology Architect at the National Fire Protection Association (NFPA); Mary Chitty, Library Director & Taxonomist at Cambridge HealthTech; and David Bender, Manager, Medical Ontology, Radiological Society of North America. The genesis of this session came out of a conversation at the SLA conference the previous year, when someone from Lexis Nexis asked you build your ontologies: from taxonomies or from data. It was anticipated that the case studies would be all be conversions from taxonomies to ontologies, but that was not necessarily so.

Edee Edwards explained that the NFPA was building a data science team, which was very interested in ontologies. There was also a data governance group involved. At that time, they also needed a system upgrade of vocabulary software, and the new one is SKOS-based. They did a proof of concept with our data science group. NFPA’s primary use for the ontology was auto-tagging.

Mary Chitty's presentation “Preparing your taxonomy to be ready for data scientists & machine readability " presented the case of taxonomies at Cambridge HealthTech. It is still a taxonomy, not an ontology, but Cambridge HealthTech has recently partnered with OntoForce, a semantic search and data science company to use their search engine. The presentation was more about the issues than any solution. Ongoing challenges include dealing with legacy data, integrating acquired companies’ data, scaling up, and dealing with ambiguity.

David Bender’s presentation “The Big Maybe: Should You Convert Your Taxonomy to an Ontology?” presented the example of the controlled vocabulary of the Radiological Society of North America (RSNA), RadLex. RadLex, a model of anatomical procedure and modality as it pertains to radiology, is referred to as a lexicon or terminology, although it is arranged as a hierarchical tree/taxonomy. It was decided to have a structure, as an ontology, but there are still more questions than answers. In moving in the direction of ontology, they are using the tool Protégé and have converted RadLex into an OWL form, but otherwise it is still kept as a taxonomy.

Heather Hedden presenting on taxonomy tools at the SLA 2019 conference

I presented the full-day preconference workshop "Introduction to Taxonomy Design & Creation" and on taxonomy management software in the session "Taxonomy Tools and Tool Evaluation, and my co-presenter Marti Heyman of OCLC presented on how to evaluate taxonomy tools.

SLA Taxonomy Division members can read detailed reports of each of these taxonomy sessions in the next issue of the Division’s “Taxonomy Times” newsletter.

SLA, an international organization, holds its annual conference in June in different cities in North America. It will next be in Charlotte, North Carolina, June 6-9, 2020.