Sunday, December 11, 2016

Use Cases for Taxonomy Development

Developing use cases in the initial design of a taxonomy is something I did not learn about until I went into consulting, but it is a useful approach to taxonomy and metadata design in any circumstance, regardless of the involvement of an external taxonomy consultant.
The use case technique comes from the field of systems analysis, and especially software and systems engineering, but use cases are increasingly applied in the development of systems and structures for knowledge management, content management, information management, etc. Typically use cases for a taxonomy are not limited to the taxonomy alone but are for the design of all metadata and the broader information or knowledge management system. “System” means the combination of software, content, metadata/taxonomies, and users.

What is a use case?

A use case describes a scenario of how a user uses a system to accomplish a particular goal. A use case should not be confused with a case study. It need not be long and detailed, although they may vary in their descriptive length. All use cases include:
  1. A designated user type and role (sometimes called “actor”), which could be as simple as an internal organization job title. Examples of external users could be designated as: undergraduate college student, paralegal, pharmaceutical corporate librarian, experienced online shopper, etc.
  2. A task that the user is engaged in which uses the system. This will likely be described in more detail than the description of the user. Taxonomy use cases would typically involve a specific aspect of one of the following tasks: indexing/tagging, using search to find information, using browse to find information, discovering/exploring for related information and finding/retrieving certain content items
  3. A goal and perhaps ultimate purpose of the user’s task.
I had participated in a consulting project once whereby the stakeholders were advised to create use cases that went so far as identifying fictitious personas, a practice that is often done in marketing planning. I don’t think it’s necessary to go that far in taxonomy use case development, although it might be useful if there are users of the taxonomy who are external customers/clients.

Why create use cases for taxonomies and other metadata?

The task of developing taxonomies and other metadata can benefit from use cases in particular ways:
  • It grounds the taxonomy in reality, ensuring that it is designed to be usable, rather than being an academic taxonomy on a subject domain.
  • It engages the users and other stakeholders in the taxonomy development process, who then become more interested in supporting/promoting or using the taxonomy, especially when the taxonomy serves their user needs and solves their problems.
  • It provides sample situations which can then be utilized for testing the draft taxonomy before the taxonomy and content are fully implemented in the system. As a taxonomist who has led taxonomy testing activities among sample users, I have personally found used cases to be valuable for this purpose. 

What are examples of use cases for taxonomies and other metadata?

The following brief fictitious use case examples are of the kind that could be used for taxonomy development.

Internal organization use cases:

  • A subject-matter-expert author who is required to tag authored documents with subject categories so that users can find documents by subject.
  • A digital asset manager in an advertising agency, who needs to ensure that image files are assigned the proper copyright information.
  • A content manager at a publishing company who, as a major responsibility, needs to assign full metadata to XML file content for various downstream purposes to assembly digital content products.
  • A marketing copywriter seeking an expert on a specific subject among a company’s employees to give feedback on the accuracy of a blog post the copywriter is writing and who is inclined to browse subjects if available. 
  • A manager who wants to find historical information on product offered in order to prepare a presentation about the product.
  • A digital marketer who needs to update the public website with seasonal images that were not used last year (but two years ago is OK).
External/customer use cases:
  • An undergraduate student who uses the default search to look for information on the events leading up to the fall of the Berlin Wall for a history class paper.
  • An experienced online shopper who is searching to purchase carry-on luggage and wants to filter results by price, color, and positive reviews.
  • A corporate librarian conducting competitive intelligence research on market strategies of leading competitor companies in the same industry and who would like to use advance and/or Boolean searching, if possible.
  • A lawyer specialized in commercial law who need to find out where and how to file a financing statement in the proper jurisdiction for a client of his who to secure a loan, but lacks experience in legal research.
  • A cancer patient searching for an oncologist with a certain type of cancer specialty, acceptance of certain insurance, within a certain geographic region, and with a number of good patient reviews.
  • A compliance officer who needs to find regulations and associated policies and procedures that pertain to various departments and products lines of his employer, who knows the names of statutes but not the titles of associated regulations.

How are taxonomy use cases utilized?

In addition to serving the purposes of engaging stakeholders and ensuring the taxonomy is content- and user-focused, use cases can have additional specific applications, such as:

  • Identifying or validating who all the different types of users are, so that their issues and feedback can be taken into consideration in the future.
  • Suggesting improvements in the user interface design.
  • Developing walk-through scenarios, with specific search criteria or topics of browsing spelled out, for offline testing of the taxonomy usability (including adequate depth and breadth) for both indexing/tagging and retrieval. (Read more at the post "Testing Taxonomies.")
  • Providing scenarios that can be used in other taxonomy/knowledge management project research, such as ROI (return on investment) research.

Wednesday, November 30, 2016

Popular Topics in Taxonomies

This month marks the 5th anniversary of The Accidental Taxonomist blog, so it is a fitting time to look back and see which posts were most popular.  Following are the top 10 posts with the most visits (pageviews) from the time they were published to date, with the number of visit indicated:

1)  3722  E-Commerce Taxonomies (Nov 26, 2012)
2)  3267  Taxonomy Software Directories (Apr 11, 2014)
4)  2462  Taxonomies vs. Classification (Apr 2, 2013)
5)  1859  Taxonomies vs. Thesauri (Jan 28, 2014)                         
6)  1743  Digital Asset Management and Taxonomies (May 28, 2012)
7)  1725  Information Architecture and Taxonomies (Nov 9, 2013)       
8)  1670  Taxonomy Design for Content Management Systems (May 4, 2016)                     
9)  1621  Taxonomy Governance (Dec 9, 2013)
10) 1448  Topics and Document Types in Taxonomies (May 6, 2013)

The topic of taxonomies for e-commerce has been the most popular blog since shortly after it was published. This does not necessarily mean that e-commerce is the most common implementation of taxonomies, but it is clearly defined, whereas others, such as enterprise taxonomies, could go by different names, such as business taxonomies, internal taxonomies, organizational taxonomies, intranet taxonomies, etc. Nevertheless, e-commerce is a very significant application of taxonomies. Among my presentations on SlideShare, the presentation on e-commerce taxonomies is also by far the most popular.

Other popular blog post topics on taxonomies tend to be those in combination with other significant topics in the blog title, such as software, ontologies, digital asset management, content management, content management systems, information architecture, and governance. This is not surprising. I am a little more surprised at the popularity of topics “Taxonomies vs. Classification,” Taxonomies vs. Thesauri,” and especially “Topic and Document Types in Taxonomies."

Other posts with high pageview numbers (although not in the top 10) include “Card Sorting and Taxonomies,” “Taxonomies and Content Management, “Evaluating Taxonomies,” “Faceted Search vs. Faceted Browse,” and “Business Taxonomies.”

Blog posts that were less popular (besides the first two) were ones about taxonomists, and not taxonomies, despite the title of this blog, such as “The Remote Taxonomist” and “Mentoring Taxonomists.” The post on “Multilingual Taxonomies” surprisingly has one of the fewest page views, but I had posted it only in my first month, November 2011, of the blog, when the blog was not well known. I would expect it to be found later through searches, though.

Some posts will get high numbers of visits based on their titles, and some will not, such as “Tags and Categories” or “Taxonomies for Multiple Kinds of Users,” even if the topics are of particular interest to taxonomists. It sometimes seems as if I have already posted on all of the leading topics related to taxonomies, and there is not much more to write about. However, here will continue to be interesting topics to write about, but I may simply run out of blog post titles that have high SEO (search engine optimization) value.

Monday, October 31, 2016

Taxonomy Boot Camp London Conference

I was fortunate to attend and present the inaugural Taxonomy Boot Camp London conference earlier this month. After 11 successful years in the United States (initially in New York in 2005, then for four years in San Jose, CA, and six years in Washington, DC), Taxonomy Boot Camp held its first overseas conference at the Olympia Conference Centre in London, October 17-18, 2016. Although taxonomy related topics are presented at many other conference, Taxonomy Boot Camp remains the only conference dedicated to taxonomies.

Conference Format 

The conference was very similar and comparable to Taxonomy Boot Camp in the U.S., with respect to scope and range of topics covered, level of detail, and quality. The only different was some more UK examples/case studies, rather than US examples/case studies.  Sessions were also a similar mix of general topics and case studies. The format was also similar, but not identical.  Whereas the U.S. conference has, in recent years, two tracks on the first day and a combined track the second day, Taxonomy Boot Camp London maintained two tracks on both days, except for the keynotes and one plenary session. As a result, I had to make more decisions about which sessions to attend. The number of speakers is about the same at both conferences, so by holding more concurrent sessions, Taxonomy Boot Camp London had slightly longer sessions per speaker on average. At Taxonomy Boot Camp in the U.S., an individual speaker may speak for only 15-20 minutes in many sessions. A half-day afternoon pre-conference workshop on “Taxonomy Fundamentals” was also part of Taxonomy Boot Camp London, whereas Taxonomy Boot Camp in the U.S. has not had half-day pre-conference workshops, shared with KM World, since 2009, as now the conference starts on the Monday of the KM World pre-conference workshops. Instead, Taxonomy Boot Camp in the U.S. has a 1.5-hour taxonomy basics session on the first day, concurrent with other sessions. 

Attendance was strong for a first time specialized conference with 173 (including 42 speakers). While not as many attendees as Taxonomy Boot Camp in the Washington, DC, which has about 200, this was more attendees than the U.S. Taxonomy Boot Camp conference in its earlier years.  There was, as expected, greater international participation from throughout Europe. There were probably slightly more whose interest in taxonomies is for internal organization information management, rather than for published content, whether corporate, nongovernmental organization, or government agency. While there were some publishers, there was a noticeable lack of those involved in ecommerce. I led the half-day pre-conference workshop, and received the list of 37 attendees and their affiliations for the workshop, and I assume they are a representative sample of the conference attendees.

As with Taxonomy Boot Camp in the U.S., the conference is not held by itself, but is co-located with another conference by the same organizer, Information Today Inc. Whereas in the U.S. Taxonomy Boot Camp is currently co-located with KM World, Enterprise Search & Discovery, and SharePointSymposium, Taxonomy Boot Camp was co-located with Internet Librarian International (ILI), which has been taking place in London every October since 2008. Taxonomy Boot Camp London and ILI (which now has the tagline “The Library Innovation Conference”) are not as integrated as Taxonomy Boot Camp and KM World are. The attendees were more distinct in their professions and interest. Whereas in the U.S. attendees may register for a “platinum” pass which allows access to any of the co-located conference sessions, in London the registrations for the two conferences were distinct. There were no shared keynotes, and meals and breaks were in slightly different areas.  Taxonomy Boot Camp attendees had access to the ILI sponsor booths, but ILI attendees did not have access to the three Taxonomy Boot Camp sponsor booths, which were located within one of the session rooms. I imagine this might change in the future, if the number of Taxonomy Boot Camp London sponsors grows. On the other hand, the relatively contained nature of Taxonomy Boot Camp London made it excellent networking opportunity.

Taxonomy Boot Camp London also had an association partner, the International Society for KnowledgeOrganization (ISKO), whose UK chapter is very active. Its chapter for Canada and the United States is not so active. It’s membership also tends to be more academic, with variations by chapter, but its vice president, Stella Dextre Clarke, who gave a brief presentation, said that the organization hoped to broaden its membership more beyond academia.

Specific Sessions

The two keynotes, one on each morning, were both excellent and relevant to the audience. Mike Atherton, a content strategist at Facebook and formerly and information architect at the BBC for its websites, spoke on “Designing Future-Friendly Content” as the opening keynote. He presented a case study of designing the website for the IA Summit conference, which is redone every year. Some of his key points were: Agree to the strategy, argue the tactics, stand up for taxonomy for information architecture, and be a teacher. 

Patrick Lambe, partner of the knowledge management consultancy Straits Knowledge, and a frequent speaker at Taxonomy Boot Camp in the U.S., presented the second day’s keynote: “Gathering evidence for a taxonomy – knowledge mapping or content modellings.” He spoke of the key issues/decision points as: purpose, constraints, principles, and scope. He said that subject matter experts should only be engaged for feedback on specific questions at the end of a taxonomy project. Design is based on evidence and desired outcomes. Warrant is the evidence behind the design and includes content warrant, user warrant, and standards warrant. There are different approaches for building different kinds of taxonomies. For building an internal/enterprise taxonomy, Patrick recommends undertaking knowledge auditing and knowledge mapping, mapping both activities and assets. For building an external-use taxonomy, or one with both internal and external sources and scope, knowledge mapping does not work. Rather, content modeling is done with use case scenarios and just a sampling of content.

Other informative sessions of note included “How to fast-track taxonomy projects using linked data” by Dave Clarke, CEO of Synaptica. He explained the difference between linked open data and linked enterprise data (behind the firewall), and both have their uses and benefits. Mapping to linked open data resources can be done for semantic enrichment, pulling information from outside into an organizational system. 

Ben Licciardi, Manager (consultant) at PwC, presented "Taxonomies and the systems in which they reside: Is the technology-agnostic approach right for you?" He presented the benefits of both scenarios. Developing a technology-agnostic taxonomy, in addition to enabling the taxonomy to be used in different systems, also gets you thinking outside the box and helps future-proof the taxonomy. A system-focused taxonomy, on the other hand, keeps you grounded in reality, is designed to the customers, and is budget-conscious.

A panel comprising two consultants and a user experience architect spoke in a session titles “Working within multi-disciplinary teams - taxonomist tales from the trenches.” Among other things, they discussed that more people want to be involved on the team of developing taxonomies, and more people should be talked to, including scrum masters, QA team, tech leaders, user experience people, software people, content strategists, product managers, business analysts, data modelers, enterprise architects, etc.  

Congratulations to conference chair Helen Lippell for a successful conference. The date for the next Taxonomy Boot Camp London conference has already been set for October 17-18, 2017, at the same venue.

Friday, September 30, 2016

Directories and Databases of Published Controlled Vocabularies

A source of published controlled vocabularies (taxonomies, thesauri, ontologies, etc.) can be useful for different purposes. Sometimes, finding a vocabulary to license and reuse is the objective, whereas in other cases finding a vocabulary to consult as a source for confirming individual terms and relationships is the goal. Thus, different kinds of directories or databases of controlled vocabularies may be of interest.

In some cases, an individual or organization has a project involving a set of content that would benefit from controlled vocabulary tagging to make it findable/retrievable/discoverable, but lacks the time or resources to build a taxonomy from scratch. Licensing an existing controlled vocabulary may seem like a preferable option. This can be a reasonable solution, depending on the content and scope of the controlled vocabulary in question. In many cases, what is desired is the use of an existing controlled vocabulary as a starting point that can then be edited and expanded to customize it for a specific use. Either case involves the licensing of a controlled vocabulary.

Taxonomists who build taxonomies from scratch or edit proprietary taxonomies like to consult available controlled vocabularies on the same subject to help determine the ideal wording of a term, the inclusion of additional synonyms, and the relationship of the term to others. The results may vary in different controlled vocabularies because they serve different purposes and audiences, but taxonomists know to take that into consideration. In these cases, licensing an entire controlled vocabulary is not needed. Simply viewing a controlled vocabulary and its term relationships is adequate.

If you are interested in licensing a complete controlled vocabulary, you will need to consider both commercial/proprietary controlled vocabularies that require a fee for a license, and public/open source controlled vocabularies that are available for free. Some collections comprise only open source vocabularies. While free is nice, the free license may carry a restriction of no commercial use and/or no modifications in use. So, for commercial or for modified reuse, make sure you consult a controlled vocabulary directory that includes proprietary controlled vocabularies.

If you are interested in merely looking up terms a controlled vocabulary, it is the public, not proprietary, controlled vocabularies that are fully accessible. Therefore, it’s more convenient to consult a database/directory of controlled vocabularies that includes only public vocabularies and preferably either hosts those vocabularies or directly links to the browsable/searchable vocabulary, rather than simply redirecting to the controlled vocabulary publisher’s website, where you may have to hunt around to find access to the controlled vocabulary, if it is even accessible at all.

Comprehensive database-directories of controlled vocabularies


Comprehensive directories are large, listing hundreds of controlled vocabularies, so they are managed as databases, with database records for each referenced controlled vocabulary and search filters, such as vocabulary name, publisher’s name, and subject. There are database record pages with more details for each named controlled vocabulary, including a link to the publisher’s website. The link may or may not be to a navigable controlled vocabulary. The most comprehensive such databases are the following two:

Basel Register of Thesauri, Ontologies & Classifications (BARTOC)
BARTOC, launched in 2013, is a comprehensive database/registry of controlled vocabularies (“knowledge organization systems”) created and managed by the Basel University Library (Switzerland). The database currently lists 1,948 vocabularies of all kinds, in all languages, in all subject areas, and in various publication formats. The database of vocabularies is hosted on Drupal, and advanced search filters comprise the top 10 Dewey Decimal Classification categories, 568 hierarchical Topics, Language, Location, and Type (categorization scheme, classification scheme, dictionary, gazetteer, glossary, list, name authority list, ontology, semantic network, subject heading scheme, synonym ring, taxonomy, terminology, and thesaurus). Links are to the publisher site and sometimes directly to a navigable thesaurus. Despite its comprehensiveness, BARTOC only has a few commercial, proprietary controlled vocabularies.

Taxonomy Warehouse
Taxonomy Warehouse, launched in 1999, is a comprehensive database of varied controlled vocabularies created and managed by the thesaurus management software vendor Synaptica LLC. The database currently lists 763 vocabularies of all kinds in all subject areas from 330 organizations in various publication formats. The database of vocabularies is hosted on the Synaptica software platform, and it can be searched or alphabetically browsed (one page per letter of the alphabet), or browsed by 225 hierarchical subject categories. Unlike BARTOC, vocabularies included are only in English or multilingual including English. Synaptica is much more comprehensive than BARTOC in its inclusion of entries for proprietary controlled vocabularies, such as those of Gale and WAND, which may be licensed for a fee from their publishers but then do not preclude modification or commercial reuse. While a number of links are dead, an overhaul update is planned in coming months.

Hosted vocabulary registries


A hosted “repository” of vocabularies can be useful, because all the vocabularies are navigable through the same user interface on the same site. You can even search for a vocabulary term across multiple controlled vocabularies at once. As publicly accessible vocabularies, many of these can also be downloaded from the site for noncommercial use. This type of database exists mostly for ontologies, because they conform to Semantic Web standards for exchange of information over the web and thus don’t require a lot of data conversion to be hosted, but Linked Data SKOS vocabulary collections are starting to appear. (Note that ontologies are structured and displayed slightly differently than taxonomies or thesauri, so they may not be as useful as reference sourced for editing taxonomies or thesauri.) Publicly accessible ontologies tend to be in the biomedical sciences, so the subject area is also more limited and the ontology databases are aimed at subject matter experts. Vocabulary repositories of this kind include the following two, among many others:

Research Vocabularies Australia
Research Vocabularies Australia is a controlled vocabulary "discovery service" of the Australian National Data Service (ANDS), launched in September 2015. It currently comprises 74 vocabularies, mostly in the sciences, and is intended to grow. About half of the vocabularies are hosted on the ANDS website, and their hierarchies can be browsed and terms can be searched upon in a common user interface. These are Linked Data SKOS vocabularies, not ontologies, and include taxonomies, thesauri, and simple term lists. Vocabulary publishers comprise 33 governmental nongovernmental organizations, Australian and other. The collection of vocabularies can be searched and can be filtered by Subject, Publisher, Language, and License. Although not as large as ontology-only repositories, Research Vocabularies Australia is a significantly large repository of easy-to-access controlled vocabularies all in one place, and thus is a good source for researching terms or for downloading noncommercial-use vocabularies.

Bioportal is a biomedical ontology repository service of the National Center for Biomedical Ontology (NCBO) comprising 516 ontologies, many of which can be downloaded directly from the site. The vocabularies can be searched or browsed, with search filters including controlled fields for Category, Group, and Format. Filters for sorting the list of ontologies are by Popular, Size, Projects, Notes, and Upload date. One can also search for a class (term) within multiple ontologies. A great deal of metadata and summary information is provided for each vocabulary, including history of uploads, a graph of downloads, and a table of metrics, which includes the number of classes, individuals, properties, maximum depth, children, etc.

Ontobee, hosted by the He Group (of Dr. Yongqun “Oliver” He) of the University of Michigan Medical School, provides a sortable tabular list of 181 biomedical ontologies, which can each be individually searched and browsed directly the Ontobee website. Furthermore, terms can be searched in the Ontobee linked data server across all 181 ontologies. The ontologies (with the OWL file extension) can be downloaded, and lists of terms (more useful references for taxonomists) can be downloaded as Excel or text files.

Vocabularies listed on educational or professional organization sites


Some organizations list a sampling of vocabularies in all subject areas to serve as educational examples of different kinds of vocabularies, aimed more at students and professionals in the area of library and information science than for subject matter experts. These vocabulary collections tend to include only vocabularies that can be accessed and navigated on a public website, so they are a good source when researching individual terms. Examples of vocabulary collections of this type include those on the following sites:

American Society for Indexing
“Online Thesauri and Authority Files” is a webpage alphabetical list of about 25 vocabularies, mostly thesauri, in varied subjects with links directly to the browsable vocabularies. While the number of vocabularies is not large, it is maintained, and it is a practical resource for looking up terms in varied thesauri. They are meant to be examples for professional indexers who are also interested in working in thesaurus construction.

Charles Sturt University - School of Information Studies
“Information Organisation Vocabularies” is a webpage under the section “Links and Resources for Students” on this Australian university site. It comprises an alphabetical, sortable table of 328 vocabularies, although there is no explanatory text. The list can be sorted by column headers: Name, Author, Year, Publisher, and Keywords (uncontrolled). There is also a filter-search feature which aids in finding a desired subset of vocabularies. The links to the vocabularies link to the navigable vocabulary on the publisher site or, in the case of a few older vocabularies, to a PDF print thesaurus. Although there are a number of dead links and nothing has been added since 2014, the number of correct links directly into navigable vocabularies is significantly large, so this is a useful resource.

Vocabularies listed on software vendor sites

Some thesaurus/ontology management software vendors provide a sampling of vocabularies in various subject areas created in their tools, aimed at users or potential users of the tool. These vocabularies tend to be directly browsable, but the qualities of the vocabularies may be inconsistent, so care should be taken in using them as an authoritative source. Vocabulary collections of this type include those created in the following software tools:

PoolParty (The Semantic Web Company)
Has an alphabetical list of about 30 web-browsable Linked Data vocabularies, most of which are in English and almost all of which are hosted by The Semantic Web company. Some are very small and were built by the Semantic Web company staff as examples, and some are public thesauri that were imported into PoolParty. In addition to being browsed, almost all of the thesauri can be downloaded, too.

Has a tabular list of "known cases" of over 400 vocabularies managed in TemaTres, some hosted on the TemTres site and some of which link to vocabularies on the owner's server. Table columns are for title, scope (either the number of terms or a description), language, and URL. The vocabularies are in all languages, with a slightly higher proportion in Spanish, due to the fact that TemaTres is developed in Argentina, and only a minority of which are in English. The search feature has limitations due to the inconsistent use of scope descriptions and the fact that titles and descriptions are in different languages.

Has a small sampling of 10 web-browsable thesauri in varied subject areas, of which 7 are in English. Some are hosted on the MultiTes site and some are hosted on the thesaurus publisher sites.

Has links on its VocBench “community” page to about a dozen national and international organizations and two higher education institutions with VocBench-created vocabularies. In some cases the links are to the browsable thesauri, but in other cases the links are just to the organization websites, and the thesauri, if available, are not so easily found.

Has a wiki page that lists and links to websites of ontology publishers in three categories: 80 OWL ontologies, 19 Frame-based ontologies (those ontologies that were developed using the Protégé-Frames editor), and 8 in other ontology formats. Some of the links are dead, some are to the websites of the ontology owners, and some are directly to the XML file. Since the links are not to the navigable ontology in a browser, this list of ontologies is not useful as a source for checking terms, but it is a good source for downloading ontologies, if you have the right software to read them.

Friday, August 26, 2016

Synonyms, Alternate Labels, and Nonpreferred Terms

"Synonyms, Alternate Labels, and Nonpreferred Terms" is the title of my next conference presentation in October and in a different, briefer co-presented format as "How Many Synonyms Should You Have?" in November. So, now would be a good time to explore the topic in this blog.



"Synonyms" is the simple, nonexpert designation to the different names for the same term or concept in a taxonomy or other kind of controlled vocabulary. This is an over-simplification, for what may be involved is far more than just synonyms. Synonyms are words with the same meaning, but a taxonomy comprises terms that are typically phrases, often of two or three words, not just words. Furthermore, synonyms by definition have identical meaning, but in a taxonomy, we can have multiple names for a concept that are merely "close enough" in meaning to function as desired.

"Alternate labels" is a much better designation and is the nomenclature adopted by SKOS-compliant vocabularies. SKOS, which stands for Simply Knowledge Organization System, is a recommended standard of the World Wide Web Consortium for the application of the RDF (Resource Description Framework) interoperability format. Alternate labels refer to "concepts" which are known by their "preferred labels." You could certainly use the designation of "alternate labels" even if the controlled vocabulary or taxonomy is not SKOS compliant, and I have seen that sometimes.

"Nonpreferred terms" is the nomenclature of the thesaurus standard described in either ANSI/NISO Z39-19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies or ISO 25964 Thesauri and interoperability with other vocabularies, Part 1: Thesauri for Information Retrieval. Trained taxonomists, especially those with a library science background, are most familiar and comfortable with this designation, but its meaning is obviously not as intuitive to non-taxonomists.

Although the aforementioned three designations are the most common, there are others out there. I have run into the use of the following: Aliases, Alternate terms, Cross-references, Entry terms, Equivalent terms, Keywords, Nondescriptors, Non-postable terms, NPTs, See references, Use for terms, Use references, Used for terms, and Variants.

Taxonomy/thesaurus/ontology management software that supports the SKOS standard will typically use the SKOS designation of "alternate label," and software that supports the thesaurus standards will typically use the language of "nonpreferred term." As for software that supports both standards, which is becoming increasingly common, "alternate labels" or "alternate terms" is more common than "nonpreferred terms", and other designations might be used, such as "variants." So, for want of an unambiguous single-word designation, I will refer to these as "variants" for the remainder of this post.

Techniques for creating variants


Since synonyms are for single words, and most taxonomy terms are multi-word phrases, a common technique is to substitute a synonym for one word of a multi-word phrase. For example, Movie reviews and Film reviews.

Variants that are not exactly synonyms would also include technical and layperson language, such as Neoplasms and Cancer; older and newer designations, such as Near East and Middle East; and lexical variants, such as Hair loss and Baldness. Experts will tell you that in all of these cases these are not synonyms. They sufficiently equivalent, though, for most taxonomies.

This brings us to another important point. Variants should be roughly equivalent within the context of the taxonomy and the body of content it is used to index. What serves well as a variant in one taxonomy might not be suitable for the same term in another taxonomy.

The number of variants to create for each taxonomy term/concept depends on the search technology and on the display of the taxonomy in the user interface. While a taxonomy could be browsed, it is more common for a taxonomy to be searched. The user searches for terms within the taxonomy, matching search strings against any variant of a term, if not the preferred term itself. The search does not have to be an exact match and may match to taxonomy terms that have at least the same words (in any order) and grammatically stemmed versions of the words (such as education and educational). With this in mind, taxonomists do not need to create variants for every possible variation of a term, as the search technology will be able to take care of some of that. 

As for sources for variants, other than the taxonomist's own knowledge of language, any term variations in sample source documents to be indexed should be considered. If content to be indexed with the taxonomy has already been published, and users have been searching for it on a website or content management system, the user-entered search strings found in the search logs can be an excellent source for variant terms. External reference sources and similar taxonomies can be consulted as a source, but not relied upon as the primary source for variants. 


Saturday, July 30, 2016

Who Are Accidental Taxonomists?

Turning to the name of this blog, who are the accidental taxonomists? I sought an answer to this questions through some of the questions of a survey I conducted of taxonomists to gather information on the opening chapter of my book, and more recently I looked at the job titles of those who had registered for my online taxonomies course.

I conducted the survey twice, in late 2008 and May 2015 in order to gather information for my first and then second edition of book. The participants were solicited from various online discussion groups, such as Taxonomy Community of Practice and those of related subjects of content strategy, information architecture, digital asset management, knowledge management etc. So, perhaps there was a slight degree of predetermining the participants by choosing where to announce the survey, although it can be assumed that practicing taxonomists of any background could be members of Taxonomy Community of Practice.

The differences in responses to some of the same questions over that time period are presented in my blog of June 2015 “Taxonomist Trends.” Among other findings, trends in the backgrounds of those involved in taxonomy showed an increase in backgrounds of knowledge management, content management, content strategy and digital asset management; and a decrease in those with a background in Software/IT and database design, development, or administration. Other backgrounds did not change much.

Another source of information on the backgrounds and current jobs of accidental taxonomists, which I did not include in my book, comes from the job titles and introductions of students in the online continuing education workshop “Taxonomies & Controlled Vocabularies,” which I taught through Simmons College School of Library and Information Science for the past eight years (2008-2016) and now teach on my own. I estimate I have had a total of about 500 students take the workshop, which has been offered of average five times per year. It’s impressive what varied backgrounds these “students” of taxonomy have.

Some of the continuing education students were already employed as taxonomists, and they want to fill in the gaps of their knowledge, especially if they had never taken a course on the subject before. Some were librarians, particularly Simmons School of Library and Information Science alumni, since the continuing education program was marketed towards them. These librarians may not need to create taxonomies in their current position, but they are curious to learn about it and perhaps hope to get into taxonomy work later in their careers. A few participants were even current library science students.

However, the majority of the continuing education students are indeed accidental taxonomists. As they explain in their introductions, they have found that the need to learn about taxonomy creation and maintenance is important to their current jobs. As for what their current jobs are, many are involved with content management, digital asset management, or archives. Job titles, based on self-introductions, of those in the past 6 months have included: archivist, business analyst, cataloger, chief operating officer, consultant, data coordinator, digital asset administrator, digital asset cataloger, digital asset manager, director of content standards, information manager, linguist, photo editor product manager, program manager, senior product analyst, and senior records analyst.

In the first chapter of my book “Who are Taxonomists?” there are five pages of job titles, which were obtained from (1) 130 taxonomist survey respondents indicating their job titles, and (2) obtaining job titles from LinkedIn profiles of several hundred people who had “taxonomy” or “taxonomies” in their profile. I did not look at the job titles of my continuing workshop students for my research for that book chapter. There is a slight difference of who is included, because the survey for my book was specifically of those people already engaged in some degree of taxonomy work. Students of my online workshop, on the other hand, may not have done any taxonomy work yet, but are anticipating doing it. They are potentially accidental taxonomists. Their job titles are thus more varied.

Following is a list of job titles that students of the online workshop “Taxonomies & Controlled Vocabularies” put down on their registration form (although some left the job title field blank), over the years of 2009 - 2014. (After 2014, this information was not included in most of the class lists I received.)
Account Representative
Advance Technical Editor
Archives & Digital Collections Manager
Assistant Professor
Business Analyst
Business Research Specialist
Career Resource Consultant
Cataloging Librarian
Classification Model Developer
Collection Development Librarian
Content Management Assistant
Content Strategist
Corporate Data Steward & Taxonomist
Digital Archivist
Digital Asset Coordinator
Digital Asset Librarian
Digital Content Specialist
Digital Content Strategist
Digital Resources & Metadata Coordinator
Director, Information Architecture
Director, Library & Archives
Electronic Services Supervisor
Engineering Records Specialist
E-Records Manager / Analyst
Graphic Arts Accessioner
Graphic Designer
Graphics Project Archivist
GTA/LIS Student
Head Librarian, Collections Management
Head of Public Services
Human Factors Engineer
Information Analyst
Information Architect
Information Consultant
Information Manager
Information Resources Librarian
Information Scientist
Information Specialist
Information Technology Consultant
Instructional Design Analyst
Instructional Services Librarian
Internal Communications Officer
Knowledge & Information Manager
Knowledge & Learning Specialist
Knowledge Management Analyst
Knowledge Management Associate
Knowledge Management Officer
Knowledge Manager
Lead Library Technician
Legal Editor
Library Director
Library & Research Specialist
Library Assistant
Manager, Knowledge Resource Center
Manager, Library Services
Managing Partner
Marketing Director
Media Content Analyst
Metadata Analyst
Metadata Librarian
Metadata Production Specialist
Metadata Specialist
Monographs Cataloger
Operations Specialist
Program Records Manager
Project Analyst
Rare book cataloger
Recipe Processor
Records Manager
Reference & Electronic Resources Librarian
Reference Librarian
Relationship Manager
Research & Information Management Coordinator
Research Fellow
Research Librarian
Research Publications Manager
Research Specialist
Resource Center Customer & Product Specialist
Science Librarian
Search Specialist
Senior Associate Regulatory Affairs
Senior Business Analyst (Records Management)
Senior Business Systems Analyst
Senior Content Manager
Senior Content Strategist
Senior Data Curator
Senior Information Architect
Senior Information Security Analyst
Senior Knowledge Base Specialist
Senior Management Consultant
Senior Market Analyst
Senior Metator/XML Analyst
Senior Researcher
Senior Specialist, Technology & Metrics
SharePoint Lead Specialist
Social Sciences Liaison Librarian
Staff Writer
Supervising Librarian
Supervisor, Knowledge Management
Systems Librarian
Teacher Librarian
Team Lead Data & Quality
Technical Editor / Taxonomist
Technical Services Librarian
Technical Writer
User Services & Cataloging Librarian
UX Designer
UX Project Manager
Visual Resources Curator
Web Administrator
Web Services Librarian
Worldwide Metadata Coordinator

Finally, the industries in which the taxonomy students work included:
Broadcasting & media
Computer hardware & software
Consumer electronics
Engineering technology
Federal government agencies
Financial services
Health insurance
Healthcare information technology
Information services/publishing
Information technology
International agencies
Law firms
Medical devices
Municipal government
Oil & gas
Religious organizations
Research & development
State/provincial government

Simmons College School of Library and Information Science has put its Continuing Education Program on hiatus for evaluation and restructuring. I hope to be able to offer my online workshop again through Simmons in a future year. In the meantime, I am offering this workshop as an online course directly to individuals or groups. This and other taxonomy training offerings are listed on my website: Online Taxonomy Course.

Wednesday, June 22, 2016

Taxonomies vs. Thesauri: Practical Implementations

The differences between taxonomies and thesauri and when to implement which has been a subject of previous presentations of mine and a previous blog post, Taxonomies vs. Thesauri. Most recently, a presentation of a case study of controlled vocabularies at Cengage Learning, which I gave at the “Taxonomy Café” session at the SLA annual conference this month, the post-presentation roundtable discussions got me thinking more about the differences in practical implementations.

To summarize the differences, while both taxonomies and thesauri have hierarchical relationships among their terms, in a taxonomy all terms are connected into a few large hierarchies with a limited number of top terms so as to serve top-down navigation or drilling-down of topics. While faceted taxonomies function differently, each facet label can be seen as a top term. Associative relationships (related terms) are a standard feature of thesauri but not of taxonomies. Synonyms/nonpreferred terms/alternate labels are required for thesauri, but could be optional in small taxonomies. Taxonomies serve browsing and drilling down by end users who are exploring topics, whereas thesauri serve users who search for (look up) a specific concept and then may following “use” (preferred term), broader, narrower, or related term links to find the best term. A taxonomy works well for a controlled vocabulary that is limited in scope and easily categorized into hierarchies, whereas a thesaurus works better for content and a set of terms that is not easily categorizable and does not have a limited scope.

In practice, I have found that taxonomies are useful for classifying products and services (such as in ecommerce), general enterprise document management, implementations in content management systems which support taxonomies, and all faceted or filtering implementations (SharePoint search, Endeca, and other post-search filtering enterprise search software). Thesauri, on the other hand, are more suitable for indexing and retrieval research literature (articles, white papers, conference presentations and proceedings, patents, etc.), whether commercially published or not.

Taxonomies are easier to create and often easier to implement than thesauri. They generally do not have associative (related term) relationships. In absence of associative relationships between terms and with the emphasis on creating large top-term hierarchies, the thesaurus standard (ANSI/NISO Z39.19) rules for hierarchical relationships do not always have to be strictly followed. The inclusion of synonyms/nonpreferred terms also tends to be less thorough in taxonomies than in thesauri. Thesauri, on the other hand, require greater expertise in the field of information/knowledge organization, particularly to distinguish between hierarchical and associative relationships and to create the optimal number of those relationships and the optimal number of nonpreferred terms. Taxonomies, whether hierarchical or faceted, also tend to be easy to understand and use, accommodated by out-of-the-box content management software, and easier to maintain (and could be maintained by subject matter experts instead of taxonomists). Therefore, if a taxonomy, rather than a thesaurus, will suffice, then it makes more sense to create and maintain a taxonomy.

Thesauri, on the other hand, are more appropriate for the indexing repositories of content for research because they do not restrict the inclusion of terms to established hierarchies. Any terms that represent a minimal threshold of content can be added, even if at first glance they may seem out of scope. For example, a term “Hot drinks” would not likely fit into a taxonomy on health/medicine, but the term would be desired for articles on research correlating the drinking of very hot beverages to esophageal cancer. Thesauri allow for inclusion of terms that, in combination with other terms, can achieve a more nuanced meaning, which may be needed in the research and discovery of what is contained in a body of research literature.

Indeed, in practice, the majority of new controlled vocabularies that are being created are taxonomies, not thesauri, and in fact taxonomies are usually all that are needed. The new implementations tend to be of the kind that are suitable for taxonomies. New repositories of documents for research, on the other hand, while highly important to be indexed with thesauri, do not arise as frequently. More often, collections of documents for researching are already established and often already have thesauri. These thesauri do require the work of taxonomists to update and maintain them, though.