Friday, August 26, 2016

Synonyms, Alternate Labels, and Nonpreferred Terms

"Synonyms, Alternate Labels, and Nonpreferred Terms" is the title of my next conference presentation in October and in a different, briefer co-presented format as "How Many Synonyms Should You Have?" in November. So, now would be a good time to explore the topic in this blog.



"Synonyms" is the simple, nonexpert designation to the different names for the same term or concept in a taxonomy or other kind of controlled vocabulary. This is an over-simplification, for what may be involved is far more than just synonyms. Synonyms are words with the same meaning, but a taxonomy comprises terms that are typically phrases, often of two or three words, not just words. Furthermore, synonyms by definition have identical meaning, but in a taxonomy, we can have multiple names for a concept that are merely "close enough" in meaning to function as desired.

"Alternate labels" is a much better designation and is the nomenclature adopted by SKOS-compliant vocabularies. SKOS, which stands for Simply Knowledge Organization System, is a recommended standard of the World Wide Web Consortium for the application of the RDF (Resource Description Framework) interoperability format. Alternate labels refer to "concepts" which are known by their "preferred labels." You could certainly use the designation of "alternate labels" even if the controlled vocabulary or taxonomy is not SKOS compliant, and I have seen that sometimes.

"Nonpreferred terms" is the nomenclature of the thesaurus standard described in either ANSI/NISO Z39-19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies or ISO 25964 Thesauri and interoperability with other vocabularies, Part 1: Thesauri for Information Retrieval. Trained taxonomists, especially those with a library science background, are most familiar and comfortable with this designation, but its meaning is obviously not as intuitive to non-taxonomists.

Although the aforementioned three designations are the most common, there are others out there. I have run into the use of the following: Aliases, Alternate terms, Cross-references, Entry terms, Equivalent terms, Keywords, Nondescriptors, Non-postable terms, NPTs, See references, Use for terms, Use references, Used for terms, and Variants.

Taxonomy/thesaurus/ontology management software that supports the SKOS standard will typically use the SKOS designation of "alternate label," and software that supports the thesaurus standards will typically use the language of "nonpreferred term." As for software that supports both standards, which is becoming increasingly common, "alternate labels" or "alternate terms" is more common than "nonpreferred terms", and other designations might be used, such as "variants." So, for want of an unambiguous single-word designation, I will refer to these as "variants" for the remainder of this post.

Techniques for creating variants


Since synonyms are for single words, and most taxonomy terms are multi-word phrases, a common technique is to substitute a synonym for one word of a multi-word phrase. For example, Movie reviews and Film reviews.

Variants that are not exactly synonyms would also include technical and layperson language, such as Neoplasms and Cancer; older and newer designations, such as Near East and Middle East; and lexical variants, such as Hair loss and Baldness. Experts will tell you that in all of these cases these are not synonyms. They sufficiently equivalent, though, for most taxonomies.

This brings us to another important point. Variants should be roughly equivalent within the context of the taxonomy and the body of content it is used to index. What serves well as a variant in one taxonomy might not be suitable for the same term in another taxonomy.

The number of variants to create for each taxonomy term/concept depends on the search technology and on the display of the taxonomy in the user interface. While a taxonomy could be browsed, it is more common for a taxonomy to be searched. The user searches for terms within the taxonomy, matching search strings against any variant of a term, if not the preferred term itself. The search does not have to be an exact match and may match to taxonomy terms that have at least the same words (in any order) and grammatically stemmed versions of the words (such as education and educational). With this in mind, taxonomists do not need to create variants for every possible variation of a term, as the search technology will be able to take care of some of that. 

As for sources for variants, other than the taxonomist's own knowledge of language, any term variations in sample source documents to be indexed should be considered. If content to be indexed with the taxonomy has already been published, and users have been searching for it on a website or content management system, the user-entered search strings found in the search logs can be an excellent source for variant terms. External reference sources and similar taxonomies can be consulted as a source, but not relied upon as the primary source for variants. 


Saturday, July 30, 2016

Who Are Accidental Taxonomists?

Turning to the name of this blog, who are the accidental taxonomists? I sought an answer to this questions through some of the questions of a survey I conducted of taxonomists to gather information on the opening chapter of my book, and more recently I looked at the job titles of those who had registered for my online taxonomies course.

I conducted the survey twice, in late 2008 and May 2015 in order to gather information for my first and then second edition of book. The participants were solicited from various online discussion groups, such as Taxonomy Community of Practice and those of related subjects of content strategy, information architecture, digital asset management, knowledge management etc. So, perhaps there was a slight degree of predetermining the participants by choosing where to announce the survey, although it can be assumed that practicing taxonomists of any background could be members of Taxonomy Community of Practice.

The differences in responses to some of the same questions over that time period are presented in my blog of June 2015 “Taxonomist Trends.” Among other findings, trends in the backgrounds of those involved in taxonomy showed an increase in backgrounds of knowledge management, content management, content strategy and digital asset management; and a decrease in those with a background in Software/IT and database design, development, or administration. Other backgrounds did not change much.

Another source of information on the backgrounds and current jobs of accidental taxonomists, which I did not include in my book, comes from the job titles and introductions of students in the online continuing education workshop “Taxonomies & Controlled Vocabularies,” which I have taught through Simmons College School of Library and Information Science for the past eight years (2008-2016). I estimate I have had a total of about 500 students take the workshop, which has been offered of average five times per year. It’s impressive what varied backgrounds these “students” of taxonomy have.

Some of the continuing education students were already employed as taxonomists, and they want to fill in the gaps of their knowledge, especially if they had never taken a course on the subject before. Some were librarians, particularly Simmons School of Library and Information Science alumni, since the continuing education program was marketed towards them. These librarians may not need to create taxonomies in their current position, but they are curious to learn about it and perhaps hope to get into taxonomy work later in their careers. A few participants were even current library science students.

However, the majority of the continuing education students are indeed accidental taxonomists. As they explain in their introductions, they have found that the need to learn about taxonomy creation and maintenance is important to their current jobs. As for what their current jobs are, many are involved with content management, digital asset management, or archives. Job titles, based on self-introductions, of those in the past 6 months have included: archivist, business analyst, cataloger, chief operating officer, consultant, data coordinator, digital asset administrator, digital asset cataloger, digital asset manager, director of content standards, information manager, linguist, photo editor product manager, program manager, senior product analyst, and senior records analyst.

In the first chapter of my book “Who are Taxonomists?” there are five pages of job titles, which were obtained from (1) 130 taxonomist survey respondents indicating their job titles, and (2) obtaining job titles from LinkedIn profiles of several hundred people who had “taxonomy” or “taxonomies” in their profile. I did not look at the job titles of my continuing workshop students for my research for that book chapter. There is a slight difference of who is included, because the survey for my book was specifically of those people already engaged in some degree of taxonomy work. Students of my online workshop, on the other hand, may not have done any taxonomy work yet, but are anticipating doing it. They are potentially accidental taxonomists. Their job titles are thus more varied.

Following is a list of job titles that students of the online workshop “Taxonomies & Controlled Vocabularies” put down on their registration form (although some left the job title field blank), over the years of 2009 - 2014. (After 2014, this information was not included in most of the class lists I received.)
Account Representative
Advance Technical Editor
Archives & Digital Collections Manager
Assistant Professor
Business Analyst
Business Research Specialist
Career Resource Consultant
Cataloging Librarian
Classification Model Developer
Collection Development Librarian
Content Management Assistant
Content Strategist
Corporate Data Steward & Taxonomist
Digital Archivist
Digital Asset Coordinator
Digital Asset Librarian
Digital Content Specialist
Digital Content Strategist
Digital Resources & Metadata Coordinator
Director, Information Architecture
Director, Library & Archives
Electronic Services Supervisor
Engineering Records Specialist
E-Records Manager / Analyst
Graphic Arts Accessioner
Graphic Designer
Graphics Project Archivist
GTA/LIS Student
Head Librarian, Collections Management
Head of Public Services
Human Factors Engineer
Information Analyst
Information Architect
Information Consultant
Information Manager
Information Resources Librarian
Information Scientist
Information Specialist
Information Technology Consultant
Instructional Design Analyst
Instructional Services Librarian
Internal Communications Officer
Knowledge & Information Manager
Knowledge & Learning Specialist
Knowledge Management Analyst
Knowledge Management Associate
Knowledge Management Officer
Knowledge Manager
Lead Library Technician
Legal Editor
Library Director
Library & Research Specialist
Library Assistant
Manager, Knowledge Resource Center
Manager, Library Services
Managing Partner
Marketing Director
Media Content Analyst
Metadata Analyst
Metadata Librarian
Metadata Production Specialist
Metadata Specialist
Monographs Cataloger
Operations Specialist
Program Records Manager
Project Analyst
Rare book cataloger
Recipe Processor
Records Manager
Reference & Electronic Resources Librarian
Reference Librarian
Relationship Manager
Research & Information Management Coordinator
Research Fellow
Research Librarian
Research Publications Manager
Research Specialist
Resource Center Customer & Product Specialist
Science Librarian
Search Specialist
Senior Associate Regulatory Affairs
Senior Business Analyst (Records Management)
Senior Business Systems Analyst
Senior Content Manager
Senior Content Strategist
Senior Data Curator
Senior Information Architect
Senior Information Security Analyst
Senior Knowledge Base Specialist
Senior Management Consultant
Senior Market Analyst
Senior Metator/XML Analyst
Senior Researcher
Senior Specialist, Technology & Metrics
SharePoint Lead Specialist
Social Sciences Liaison Librarian
Staff Writer
Supervising Librarian
Supervisor, Knowledge Management
Systems Librarian
Teacher Librarian
Team Lead Data & Quality
Technical Editor / Taxonomist
Technical Services Librarian
Technical Writer
User Services & Cataloging Librarian
UX Designer
UX Project Manager
Visual Resources Curator
Web Administrator
Web Services Librarian
Worldwide Metadata Coordinator

Finally, the industries in which the taxonomy students work included:
Broadcasting & media
Computer hardware & software
Consumer electronics
Engineering technology
Federal government agencies
Financial services
Health insurance
Healthcare information technology
Information services/publishing
Information technology
International agencies
Law firms
Medical devices
Municipal government
Oil & gas
Religious organizations
Research & development
State/provincial government

Currently Simmons College School of Library and Information Science has put its Continuing Education Program on hiatus for evaluation and restructuring. I hope to be able to offer my online workshop again through Simmons in 2017. In the meantime, other workshops and training I offer in taxonomy creation are listed on my website: Courses, Workshops, and Training.

Wednesday, June 22, 2016

Taxonomies vs. Thesauri: Practical Implementations

The differences between taxonomies and thesauri and when to implement which has been a subject of previous presentations of mine and a previous blog post, Taxonomies vs. Thesauri. Most recently, a presentation of a case study of controlled vocabularies at Cengage Learning, which I gave at the “Taxonomy CafĂ©” session at the SLA annual conference this month, the post-presentation roundtable discussions got me thinking more about the differences in practical implementations.

To summarize the differences, while both taxonomies and thesauri have hierarchical relationships among their terms, in a taxonomy all terms are connected into a few large hierarchies with a limited number of top terms so as to serve top-down navigation or drilling-down of topics. While faceted taxonomies function differently, each facet label can be seen as a top term. Associative relationships (related terms) are a standard feature of thesauri but not of taxonomies. Synonyms/nonpreferred terms/alternate labels are required for thesauri, but could be optional in small taxonomies. Taxonomies serve browsing and drilling down by end users who are exploring topics, whereas thesauri serve users who search for (look up) a specific concept and then may following “use” (preferred term), broader, narrower, or related term links to find the best term. A taxonomy works well for a controlled vocabulary that is limited in scope and easily categorized into hierarchies, whereas a thesaurus works better for content and a set of terms that is not easily categorizable and does not have a limited scope.

In practice, I have found that taxonomies are useful for classifying products and services (such as in ecommerce), general enterprise document management, implementations in content management systems which support taxonomies, and all faceted or filtering implementations (SharePoint search, Endeca, and other post-search filtering enterprise search software). Thesauri, on the other hand, are more suitable for indexing and retrieval research literature (articles, white papers, conference presentations and proceedings, patents, etc.), whether commercially published or not.

Taxonomies are easier to create and often easier to implement than thesauri. They generally do not have associative (related term) relationships. In absence of associative relationships between terms and with the emphasis on creating large top-term hierarchies, the thesaurus standard (ANSI/NISO Z39.19) rules for hierarchical relationships do not always have to be strictly followed. The inclusion of synonyms/nonpreferred terms also tends to be less thorough in taxonomies than in thesauri. Thesauri, on the other hand, require greater expertise in the field of information/knowledge organization, particularly to distinguish between hierarchical and associative relationships and to create the optimal number of those relationships and the optimal number of nonpreferred terms. Taxonomies, whether hierarchical or faceted, also tend to be easy to understand and use, accommodated by out-of-the-box content management software, and easier to maintain (and could be maintained by subject matter experts instead of taxonomists). Therefore, if a taxonomy, rather than a thesaurus, will suffice, then it makes more sense to create and maintain a taxonomy.

Thesauri, on the other hand, are more appropriate for the indexing repositories of content for research because they do not restrict the inclusion of terms to established hierarchies. Any terms that represent a minimal threshold of content can be added, even if at first glance they may seem out of scope. For example, a term “Hot drinks” would not likely fit into a taxonomy on health/medicine, but the term would be desired for articles on research correlating the drinking of very hot beverages to esophageal cancer.  Thesauri allow for inclusion of terms that, in combination with other terms, can achieve a more nuanced meaning, which may be needed in the research and discovery of what is contained in a body of research literature.

Indeed, in practice, the majority of new controlled vocabularies that are being created are taxonomies, not thesauri, and in fact taxonomies are usually all that are needed. The new implementations tend to be of the kind that are suitable for taxonomies. New repositories of documents for research, on the other hand, while highly important to be indexed with thesauri, do not arise as frequently. More often, collections of documents for researching are already established and often already have thesauri. Of course, dealing with collections of other data (not merely traditional documents), especially big data, is another story altogether.

Wednesday, May 4, 2016

Taxonomy Design for Content Management Systems

A very common implementation for taxonomies is in content management systems (CMS). The content managed in this kind of software can be diverse: office application files, PDF documents, image files, audio files, video files, and, in the case of web content management systems, also HTML and any kind of file to be published to the Web. The “management” this kind of software supports is also diverse: enhancing, annotating, tagging, categorizing, reviewing, approving, sharing, assigning, publishing, archiving, and deprecating of content. Finally, the users can be diverse: content creators, content managers, and anyone in an organization who needs access to a subset of the content.

Due to the diversity of content types and purposes, the metadata associated with each content item obviously plays a very important role in a CMS. As for taxonomies, in the context of a CMS, it is probably best to consider a taxonomy as a subset of metadata, although the distinction between taxonomy and metadata can get blurred. Metadata about content can be descriptive, structural, or administrative. Descriptive metadata comprises the attributes that help make the content item retrievable or findable, including title, author, source, date, audience, document type, and also metadata for what the content is about (abstract, keywords, subjects, etc.) Many of these metadata fields should be populated with terms that are on controlled vocabulary lists for each field. In some cases, such as the “subject,” the controlled vocabulary may be rather large and thus organized into a hierarchy, and thus constitutes a hierarchical taxonomy of subjects.  In other cases, various aspects of what content is about might be categorized into different metadata fields with controlled vocabularies, such as: industry, process, specialty, department, location, etc. As a result, a set of controlled vocabularies for each field, could be considered as a faceted taxonomy, with each of these descriptive metadata field functioning as a facet.

With this mind, the task of actually defining the descriptive metadata fields or taxonomy facets need to involve various stakeholders, including both users and other experts and managers. Users include the various people who upload content and will tag the content with metadata and taxonomy terms, and the various end-users who will browse and search for the content using the metadata and taxonomy. Other stakeholders to involve from the beginning may include content managers, metadata architects, content strategists, business analysts, and user experience designers.

A CMS tends to offer two methods of classification: folders and tags. Folders (which in a CMS tend to be “virtual” folders, not actual file directory paths) offer an intuitive user interface for users to put content into categories and then find the content. Tags, on the other hand, are appropriate for assigning all kinds of metadata. Typically, if a dominant means categorizing is identified through conversations with users, such as content type or subject category, this categorization scheme can be used for the folders, and then all other means of categorization and classification can be handled with the tags. 

Recently a colleague asked me which method I thought was best for associating subject disciplines with multimedia content stored in a repository where the system offered both options: put them into folders named for each discipline or assign metadata tags for the disciplines. The answer, of course, is “it depends.” It depends on:
  • Workflow: Will the files always stay in this repository or will they “travel” downstream to other applications?  If the content will likely move to other systems, then tags are preferred.
  • Taxonomy size: Is the taxonomy under consideration for folders large? A large set of folders may be cumbersome to browse through and more suitable for type-ahead lookup in a metadata field lookup table or search box.
  • User preference: Do users who upload prefer to use folders or tags only? Do users who need to retrieve the content prefer to browse through folders or only search on tags?
  • Categorization enforcement: Can you enforce users to assign descriptive tags? If you are concerned that they will not, folders will better enforce the use of the categories.
  • Support for hierarchy: Will the system support a hierarchy of categories within the lookup controlled vocabulary lists for the tag fields, or are hierarchies only supported as folders, or neither? Then consider which fields would benefit most from a hierarchy.
  • Support for synonyms: Do the lookup controlled vocabulary lists for the tag fields include support for synonyms/alternate labels. If so, and if the controlled vocabulary is large, then tags have the advantage over folders, which cannot have synonym labels.

After determining what part of the categorization system, if any, goes into folders, and what goes into tags, the next task is to figure out how many descriptive metadata tag fields to create. Issues include:
  • What metadata can be assigned automatically and what must be done manually? If it can be assigned automatically (such as file format type or language by auto-detect software or maybe even subject category by use of auto-categorization software), that’s great, but manually assigned metadata should be limited so as not to make the task burdensome.
  • What fields are users likely to search on in retrieval? You need to cover the basics, but there is no need for additional fields that users are not likely to use as lookup criteria.
  • What method of classification is important to the users? “Subjects” is a catch-all field, but if users are always thinking of something else too, such Discipline or Product, then these should be pulled out into separate fields or facets.

Finally, when designing taxonomy and metadata for a CMS, the taxonomist should have use of a test data instance of the system to try out the implementation of the taxonomy in the CMS user interface. A taxonomy that looks good offline (in Excel or a taxonomy management system), might appear awkward within the limitations of a CMS’s user interface.

Sunday, April 17, 2016

"The Accidental Taxonomist," 2nd edition

Recently I was asked what I added to the newly published 2nd edition of my book, The Accidental Taxonomist. The additions and changes are summarized in the book's preface, so I have decided to post the entire preface here, which follows:

When I published the first edition of The Accidental Taxonomist, I knew that changes would be needed within a couple of years, mostly to reflect the changes in thesaurus management software vendors, as software is a volatile industry characterized by new companies, acquisitions, and some vendors going out of business. It was also expected that the website examples, given as screenshots in the book, would change. As it turned out, the changes were more widespread than anticipated. I ended up replacing all screenshots and adding some new ones (totaling 44), since even existing software vendors or websites had updated their user interfaces. More than half of the various website URLs found throughout the book also had to be updated.

In the area of software, what I did not anticipate was that software changes have gone beyond just who the vendors are and what features vendors have added. There have also been some notable trends, such as in the adoption of Semantic Web standards, the convergence of taxonomy and ontology support, and more web-based, cloud/software-as-a-service offerings. Thus, in addition to adding more software vendors (and removing a few), I have also added a short section summarizing all of these software trends.

Also with respect to software, the first edition made no mention of SharePoint, since SharePoint 2010, the first version to support taxonomies, came out the same year my book did. So this new edition includes some discussion of managing taxonomies in SharePoint. There is not the space here to go into all the details, so I explore specific topics, such as managing polyhierarchy in SharePoint, on my blog, also called The Accidental Taxonomist.

The standards have changed too. ANSI/NISO Z39.19 2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies was reaffirmed in 2010, but more significantly ISO 2788 Guidelines for the Establishment and Development of Monolingual Thesauri and 5964 Guidelines for the Establishment and Development of Multilingual Thesauri have been replaced by ISO 25964 Thesauri and Interoperability with Other Vocabularies, Part 1 in 2011 and Part 2 in 2013. This is not merely a reorganization of parts. The changes also comprise new content in the area of interoperability, including the exchange of taxonomy data and mappings between vocabularies. Now ANSI/NISO Z39.19 is coming due for a new version, but it is a long process. With an eye to a wider international audience, in this edition I cite the ISO standard along with the ANSI/NISO standard whenever relevant.

In addition to the change in the ISO thesaurus standard, there is also a change involving the wider adoption of other kinds of standards, most significantly those associated with the Semantic Web. Although development had begun earlier, the World Wide Web Consortium (W3C) formally released the SKOS (Simple Knowledge Organization System) standard only in August 2009, when I was busy finalizing my manuscript for the first edition, before the extent of the eventual adoption of SKOS was known. Now it is quite common for taxonomy management software to follow the SKOS specifications of concept modeling and taxonomy output. So, more attention to SKOS is given in this edition.

Another trend, which was already underway at the time I wrote my first edition, but which I simply did not bother to consider in detail, is the convergence of metadata and taxonomy. So, I have added a short section on the topic. I needed the intervening years to actually work in areas where taxonomies and metadata meet, whether through consulting or in a department called Metadata Standards and Services, before I felt I could say something original on the subject.

As for the people who do taxonomy work, the accidental taxonomists, I conducted a new survey, which has shown that their backgrounds remain as diverse as they were when surveyed six years prior, but there are new stories and examples of how people got involved in this type of work and what they like about it. Meanwhile, the opportunities for taxonomists continue to grow. I executed the exact same search for jobs in fall of 2009 and again in fall of 2015, on the job board aggregator, and found the numbers of currently posted openings had significantly increased.

Although I considered myself quite experienced with various taxonomies at the time I wrote the first edition, I have continued to gain additional taxonomy work experience since, so here and there throughout the book I have added information based on further reflection. Thus, in the chapter on planning and designing a taxonomy, I have added some advice regarding designating facets for enterprise taxonomies, questions to ask during stakeholder interviews, how to conduct stakeholder workshops, and methods of testing taxonomies.

I had also started writing my blog the year after the first edition, but the blog post topics are not the same as the additions to this book. The Accidental Taxonomist blog allows me to explore tangents in more detail, and this book is already longer than needs to be!

Taxonomies are interesting in that some things about them are fundamental and do not change, such as the notion of a concept, its varied names, its hierarchical and nonhierarchical relationships with other concepts. But, as anything related to information technology, there are things about taxonomies that do change, such as how they are managed, implemented, and utilized. Thus, it is not only the varied subject matter that makes taxonomy work interesting, but also the various implementations and opportunities to take advantage of in new technologies, such as those related to the Semantic Web and Linked Open Data. Although this new edition addresses these topics, my ongoing blog will cover further considerations in such areas.