The Accidental Taxonomist: Taxonomy terms

Showing posts with label Taxonomy terms. Show all posts

Sunday, July 31, 2022

Taxonomy Challenges Discussed at SLA Conference

When it comes to conferences dealing with the subject of taxonomy creation, implementation, and maintenance, without a doubt Taxonomy Boot Camp and Taxonomy Boot Camp London are by far the best conferences for their content, speakers, and networking opportunities. However, there are other conferences that have sessions on taxonomies.

The annual conference of the Special Libraries Association (SLA) usually has multiple taxonomy-related sessions. This year, July 31 - August 2 in Charlotte, NC, the first in-person conference in three years, was no exception.

Thanks to the volunteer programming efforts of SLA’s Taxonomy Community (one of over 20 specialized topic groups, formerly called “Divisions"), the annual conference is able to include multiple taxonomy sessions, some of which bring together multiple speakers, either co-presenting a single talk or coming together. Even sessions not organized by the Taxonomy Community may include taxonomy topics, such as those dealing with knowledge management, information architecture, or research that uses a taxonomy. A Taxonomy Community networking event is also regularly part of the SLA conference.

This year’s conference is hybrid, so some of the taxonomy sessions are in-person, and some are pre-recorded and available on-demand. Live-streaming was also done for keynotes and some sessions. The following are the in-person taxonomy sessions at the SLA 2022 conference:

“The Role of DEI in Taxonomy Development, Maintenance, Search, and Retrieval,” presented by Marisa Hughes. (This presentation on a popular topic was additionally live-streamed and pre-recorded for on-demand viewing.)

“Current Challenges and Advanced Taxonomy Topics” panel comprising Marisa Hughes, Heather Kotula, John Bertland, and myself.

“Research Sources and Methodologies for Taxonomy Development,” jointly presented by Marisa Hughes and myself.

The following are pre-recorded, on-demand only taxonomy sessions:

“There ain’t no Sanity Clause: Taxonomy and Data Analysis” presented by Michele Lamorte

“Metadata Governance” presented by John Horodyski

Conference session on diversity, equity, and inclusion in taxonomies

Diversity, Equity & Inclusion (DEI) is a growing area of interest in information management/sharing and content creation. Marisa Hughes, the taxonomist who edits the APA Thesaurus of Psychology Index Terms explained the challenges of revising the thesaurus terms to reflect DEI, for which she gave the following definitions:

Diversity: “The vast range of differences among individuals and groups.”
Equity: “The contain of being fair and impartial”
Inclusion: “Welcoming and respecting diverse individuals and Groups. Diversity in practice.

She has been reviewing thousands of terms for accuracy, currency, inclusivity, avoidance of bias, stereotypes, or discrimination. Areas that this DEI review has focused on are:

Racial, ethnic, and cultural identity
Gender diversity and sexual orientation
Age, disability status, and socioeconomic class bias

In the area of disability status, for example, the term should focus on the disability and not the person. Thus, “Hearing impaired” is changed to “Person with hearing loss”; and “Mentally ill” is changed to “Individual with a mental illness.”

Marisa Hughes presenting “The Role of DEI in Taxonomy”

Additional challenges include taking the hierarchical relationships, term usage, and change management. If users can see hierarchical relationships, even if not the full hierarchy, these relationships need to be appropriate. For example, certain personal conditions and behaviors should not be narrower to the term “Disorders.” Term frequency of usage (also called “literary warrant”) is important, but the larger goal is to have respectful terms. Change management involves care that the term changes to not impact search and retrieval. Marissa oversees the large job of reindexing content with new terms, and adding change notes or history notes to changes terms.

Conference panel on current taxonomy challenges

In this session, the four panelists each gave brief opening talks, then were asked questions by the moderator, Judith Theodori, and then it was opened up for general Q&A and discussion with the audience.

I presented on the themes of challenges which came from 138 taxonomist survey responses to the question "What are the pain points or challenges in your taxonomy work?" The leading trends in the responses were:

Achieving stakeholder understanding and buy-in
Competing interests, expectations, and requests
Organizational challenges
Tools and technology inadequacies or not integrated

John Bertland, Digital Librarian and Content Specialist at the Presidio Trust spoke of the taxonomy challenges in his organization including governance at the time organizational change and funding. A specific challenge is expanding and adapting a taxonomy that was originally just for digital asset management to include the content of the intranet.

“Current Taxonomy Challenges” panelists Marisa Hughes,
John Bertland, Heather Kotula, and Heather Hedden

Marisa Hughes, Taxonomist at the American Psychological Association, related the challenge of having to quickly come up with all the COVID related taxonomy in time for the usual thesaurus update scheduled in April 2020. This involved a lot of research on literature that was still rather lacking on the subject.

Another challenging project was to determine the role of historical data in the vocabulary of 3500 terms for the period of 1967 to 1973, which involved removing offensive terms. It was a judgement call of whether to continue to use a potentially offensive term as a non preferred term (alternative label) or not. Heather Kotula, VP, Marketing and Communications of Access Innovations, Inc., the fourth panelist, also discussed the same subject of excluding pejorative terms, referred to “semantic censorship.” In the end it was concluded that often pejorative terms are actually not that much in use in the documents being tagged.

Friday, February 4, 2022

Defining a Taxonomy’s Scope

In planning a taxonomy, I have often said that it is important at the beginning to define the taxonomy’s scope, specifically the subject area scope of the taxonomy’s terms, but without going into more detail. Recently I was asked by a client how to define a taxonomy’s scope. This is a good question. The taxonomy should be suited to the subject area scope of the content that will be tagged with the taxonomy and to the scope of the user’s expectations. Terms or topics only marginal to the subject scope, however, could occur in the content, and whether they should also be included in the taxonomy is a question. Ultimately, that should depend on whether user expectations justify it, as the needs of users should also be a factor in creating a taxonomy. A taxonomy should suit both its content and its users.

Sources for Taxonomy Terms

For content as a source of taxonomy terms, a combination of manual and automated approaches is recommended. By manually reviewing sample individual documents or content items, you can discern the main ideas and main topics, which should form the start and basic structure of the taxonomy and also help define its scope. Automated methods of extracting terms, through text analytics technologies, can bring in many additional terms from a much larger corpus of documents more quickly, picking up terms that a limited manual review would miss. Even though automated text analytics extracts terms based on relevancy and frequency of occurrence, such terms could be out of scope of the subject domain. That’s why it’s important to start first with a manual review of content to define the subject scope. Then, when you enrich the taxonomy with automated extraction, you can approve terms that appear to be in scope or at least closely relevant and reject others. But should you reject all that are out of scope, even if they appear with sufficient frequency and relevancy? My advice is to try to assume the role of the user. Ask yourself: Might a user want to search for content on this term in this content collection?

For user needs and expectations as a contributing source of taxonomy terms, obtaining this information can be very direct, such as by creating a user questionnaire (at least for your internal users) that asks what the topics of importance are, how those users would define the scope, and what “marginal” topics would be acceptable for them to include. You could also request sample challenging (not expected, basic, typical) queries that the users would make. Another good way to obtain input from the user side is to look at search query logs that list search strings that users have entered over a period of time, ranked by frequency. If a search phrase that is slightly out of scope of the subject occurs frequently, then the term should still be considered for inclusion in the taxonomy.

In either case, the scope of the subject gets better defined as the taxonomy is created. For example, a taxonomy for recipes may initially be scoped to comprise terms for the names of dishes, ingredients, and cooking method. But then a different term shows up significant frequency, “Nutrition Facts.” If it occurs in both the content and the user research, then it likely should be included. If it shows up in the content only, but is not validated in user research, then it is more questionable.

Taxonomy Structure

The initial taxonomy structure itself tends to impose limits on scope. Taxonomies tend to be hierarchical with a limited number of top terms. If a candidate term appears in the content that does not seem to belong anywhere in the current taxonomic hierarchy, you might be inclined to exclude it. Factors of user needs (they might want to look up this term in this content), however, should take precedence. For example, the term “COVID-19” might be marginal but still of interest to be included many taxonomies on varied subjects, but there would exist no broader term for diseases in those taxonomies. Then adjustments need to be made, such as renaming or adding broader terms, or perhaps, more likely, the proposed term should be modified to fit the context of the taxonomy, such as becoming “COVID-19 impacts.”

Another thing to consider is adopting more a thesaurus structure than a taxonomy structure, at least for the facet or concept scheme of the taxonomy that is for miscellaneous “topics.” One characteristic of thesauri is to not rely so heavily on extensive hierarchical trees. What this means is that you could decide that it is acceptable that not all terms have broader terms and thus it’s OK to have a very large number of top terms, with the more specific terms linked to other terms only by related-term relationships, another feature of thesauri, if not by broader/narrower-term relationships. Abandoning the full hierarchical tree structure should only be considered if this hierarchy is not displayed as a navigation to the end users.

Documenting Policy

In any case, you need to define policies regarding what kinds of terms can be added and what kinds should not. This will evolve out of the activity of building the taxonomy, especially from evaluating what extracted terms to approve and what search log terms to approve. Whoever is doing this task (hopefully more than one person), should document each instance of uncertainty. While many term approvals and rejections will be obvious, there will be a gray area. This should be collected and discussed together, and then a policy can emerge.

Friday, December 17, 2021

Named Entities in Taxonomies

I have long felt that there is some uncertainty as to where named entities (names of specific people, places, organizations, products, etc.) fit into taxonomies. Standards suggest one way, and practice tends to follow different way in dealing with these proper nouns. As taxonomy trends evolve so does the position on these named entities. The fact that taxonomies are not well-defined leaves it open to question as whether to taxonomies should have any named entities in them, or if taxonomies should comprise only topics.

Historical trends

A historical perspective is needed. Modern, digital information retrieval taxonomies evolved out of thesauri. Thesauri, which originally came out in print format, first appeared in the 1960s and then were formalized by various standards published in the 1970s. The thesaurus standards state clearly that the relationships between a named instance and its type is one of the three kinds of hierarchical relationships permitted and supported in thesauri (the other two being generic-specific and whole-part). While taxonomies may omit the associative (related term) relationship of thesauri, they tend to follow the hierarchical standards of thesauri. Thus, named entities could be included in the taxonomy as the narrowest terms, narrower to a term for whatever “type” they are. But should it always be this way?

Then faceted taxonomies started being implemented in the early 2000s, first in ecommerce and then by the end of the decade in intranets, content management systems, digital asset management systems, and various content-rich websites. Once facets became adopted in information retrieval applications (aside from ecommerce), it became obvious from a user design perspective that named entities belonged in a different facet than the subjects. Facets are for refining a complex search query by different aspects. Sometimes these aspects follow the types of questions: What? Who? Where? When? “What” is usually for a subject,” but “who,” “where,” and “when” (for taxonomy terms naming events, not date ranges) refer to named entities. Sometimes people start a query about a subject, and sometimes people start a query about a named entity, and facets allow people to start off searching any way they wish.

Then in 2009 the World Wide Web Consortium published the Simple Knowledge Organization System (SKOS) recommendation for taxonomies, thesauri, and other controlled vocabularies, which over the following decade became adopted as the standard model for building machine-readable taxonomies. One of the elements described in SKOS is that of the concept scheme, which is defined merely as “an aggregation of one or more SKOS concepts.” There is nothing comparable in the thesaurus standards. While a taxonomist may choose what to do with an “aggregation” of concepts, it has proven practical to separate out different kinds of named entities into concept schemes separate from concept schemes for topics. Thus, the widespread adoption of SKOS has contributed to the trend of separating different named entity sets, which had already started with faceted taxonomies.

My initial, and longest, experience in the domain of taxonomies and controlled vocabularies was as a controlled vocabulary editor at the library database vendor Gale. At Gale (and its predecessor company), named entity controlled vocabularies ("name authorities") have been separate from the subjects, but there were reasons for this. The named entities (named persons, companies, organizations and agencies, named works, products, laws, events, and fictional characters), each have had different sets of attributes and rules for maintenance. Some even have different customized relationships with other controlled vocabularies. Interestingly, it was not always this way. Before I joined in the mid-1990s, some of these named entities (agencies, organizations, works, geographics, and events) were mixed in with the “descriptors” in a Subject MegaFile. But eventually specific attributes and relations, not to mention the growing number of terms and a new vocabulary management system, combined to make it more logical to split off each of the named entity vocabularies. The Events were the last to be split out of the Subjects. So, it’s not because the controlled vocabularies were named entities per se, but rather their growing specialized maintenance needs due to an increase in specific attributes that led to managing them as separate controlled vocabularies. Attributes include, for example, birth date and place for a person, latitude and longitude for a location, and website URL and address for companies and organizations, among many more.

Taxonomies and ontologies

This feature of attributes brings us to the most recent trend in taxonomies, which is the occasional, but growing, convergence of taxonomies and ontologies. Ontologies divide up a knowledge domain into classes, and each class (like the Gale named-entity controlled vocabularies) has its own set of attributes and customized relationships with other classes. Ontologies, according to the Web Ontology Language (OWL) standard, however, have a different perspective on named entities. Ontologies are comprised of classes and subclasses, in hierarchies, which, in turn contain “instances” or “individuals,” which are unique named entities. The relationships between an instance and a class (or subclass) is not, however, considered hierarchical, but rather of a “member” type. Thus, while thesauri make no distinction for named entities, and taxonomies separate out name entities when it’s practical, ontologies make a strict distinction.

Furthermore, for ontologies, which originated in the domains of philosophy and computer science, a named entity as a proper noun is not what matters. Rather, it’s the fact that the instance is unique, and there is only one. This is true for people, companies/organizations, and places. It is not true for brand name products, though. A named product is a proper noun, such as MacBook Pro or Honda Accord, but it is not a unique instance, because there are millions of individual MacBook Pros and Honda Accords in existence. It’s a similar matter for named works, such as books, where one title has millions of copies. “Named entities” or “proper nouns” are grammatical or linguistic designations, which are OK for taxonomies and thesauri, but are not a feature of ontologies, with their philosophical origins.

Fortunately, you don’t have to worry about this philosophical problem if you choose to follow the approach of applying a high-level ontology model to an existing taxonomy or set of controlled vocabularies to extend the ontology with specific terms and named entities (or, from the other direction, to extend the taxonomy with semantic relations and attributes). The OWL-based ontology then may comprise only as many classes and subclasses needed to designate the usage of distinct custom relations and attributes. With this approach, a different ontology class is mapped to each subset or hierarchy or SKOS concept scheme of a larger taxonomy. Each named entity type would typically correspond to a different ontology class, based on the named entity’s own attributes and relations. So, each named entity type would be in its own controlled vocabulary or SKOS concept scheme.

Just because OWL ontologies may include named instances as members of a subclass, does not mean you have to set up your knowledge model that way. This is similar to the idea of the thesaurus standard, which permits named entities to be narrower terms to generic subjects, but you don’t have to set it up that way. Omitting an option described in the thesaurus or ontology standards does not mean you are not in compliance with those standards.

So, in conclusion, while some things about taxonomies have remained constant, other things, such as where to put named entities, have changed over time.

Saturday, September 26, 2020

Adjectives as Terms in Taxonomies

Taxonomies need not follow strict standards, but rather best practices. There are standards for thesauri (ANSI/NISO Z39.19 and ISO 25964), and as taxonomies are similar to thesauri, it’s a good idea to follow thesaurus standards for taxonomy design to the extent applicable. According to thesaurus standards, terms should be nouns or noun phrases, not verbs or adjectives. Similarly, taxonomies usually comprise terms of only nouns or noun phrases. An exception would be in a faceted taxonomy, where there is a facet for a kind of attribute or characteristic, such as color, and there the terms could be adjectives. In this sense, taxonomies are more flexible and have more applications than thesauri do.

Product taxonomies tend to have adjective terms for some of their facets/attributes, including color, size, style, type, status, etc. These kinds of adjectives are reasonably straight-forward, although there may be nuances among colors and styles that are not generally known among the users of the taxonomy. It is rather other, descriptive adjectives that can be more challenging to include in a taxonomy because their meaning tends to be much more subjective than noun-based terms, and thus it’s difficult to tag/index consistently with them.

I recently did some work on a taxonomy where descriptive adjectives were included in an “attribute descriptor” term set or facet. This was a taxonomy for images, including photographs, illustrations and graphical design components. Adjective terms included Elegant, Formal, Funny, Ornate, Simple, Modern, Vintage, among others. I also filled in the role of tagging for a short period of time and found how subjective it was to tag with such adjective terms. I was not confident that I was tagging with such adjectives in a consistent manner. While the adjectives might have seemed like a good idea originally, they were not that practical compared to other components of the taxonomy. Fortunately, the attribute descriptors were not displayed to the user as a dynamic facet but rather supported search, so insufficiencies in adjective tagging were not so obvious.

A recent article in Vogue Business described how adjectives in fashion product ecommerce taxonomies are used, such as by Nordstrom, Rebag, and The Yes. These include terms such as Bright, Chic, Whimsical, Flowy, Billowy, Comfortable, etc. I wouldn’t want to try to tag with those. However, in these cases, the tagging was not manual but automated, using algorithms, hundreds of examples, and machine learning. While auto-categorization is not necessarily more correct than manual tagging, it is more consistent, and when it comes to the subjectivity of adjectives, the challenges are more around consistency than correctness. So, I can see that auto-categorization can be a solution to dealing with the challenges of adjective terms.

Now that it’s established that taxonomies can, in certain circumstances, contain adjective terms, the inclusion of adjectives in a taxonomy should be done with care. If you will have adjectives as terms, my recommendations are:

Keep them separate from other taxonomy terms, by having them in their own term list, vocabulary, or facet.
Ideally, keep the number of adjective terms limited to a few clearly distinguishable terms.
Expect to spend more time and possibly expertise in developing, editing, and maintaining adjective terms than noun-based terms.
Consider implementing auto-categorization (auto-tagging), if resources permit it.
Whether tagging is manual or automated, prepare multiple examples of assets/content items for each adjective term to demonstrate what is the appropriate content for tagging with each adjective.

A thesaurus is more specific than a taxonomy, as a thesaurus has terms for what content is about. A taxonomy has terms for what content is about but other aspects and attributes of content as well. Thus, a taxonomy may include adjectives, whereas a thesaurus does not. Adjective terms, however, should be created with care and special attention to how they will be used in tagging.

Sunday, August 23, 2020

Taxonomy Terms for Different End-Users

The names of taxonomy terms need to be understood by the taxonomy’s users, and all users need to share the same understanding of what the term means. Typically, a taxonomy as two fundamental sets of users: those who tag content with the taxonomy terms and those who retrieve content with the taxonomy terms, the end-users. The taggers can usually be supported by definitions or scope notes for the terms. The end-users rarely have access to such explanatory notes for terms, and even if they did, it would be in some inconvenient collection of documentation that very few end-users would find and read. Therefore, the terms should represent concepts that should be obvious and intuitive the end-users and need no explanation. To this end, it is important to understand the users’ perspective and the terms that they would likely use to describe concepts. User research is thus an important part of taxonomy design.

Some taxonomies have two different end-users, and this is where it can get more complicated. Examples include health information whose end-users include both healthcare providers and patients or their family members; published educational content whose tagging producers are publishers, but the end-users include both students and instructors; marketplace websites who end-users include both sellers and buyers; and job search platforms whose end-users include both employers and job seekers. It is important in these cases that the different kinds of end-users have the same understanding of what a term means, but this sometimes not the case.

Example: The Problem with “Entry Level”

I recently noticed an example of a taxonomy term in the case of job search platforms (LinkedIn, Glassdoor, Indeed, etc.) that seemed to be understood differently by employers and job seekers. There are several controlled vocabularies that can be used in the “advanced” (or faceted) job search features. Job type (Full-time, Part-time, Contract, Temporary, etc.), Location, Company, Industry, and Experience Level. I took an interest in Experience Level (also called Seniority Level), because I wanted to help identify additional jobs for my daughter, who had just graduated from college. So, I selected the filter for "Entry level." The other options include Internship, Entry Level, Associate (in LinkedIn), Mid Senior Level, Director, and Executive.

Experience level options in Glassdoor, LinkedIn, and Indeed

I was dismayed to see so may jobs classified as “Entry level” requiring at least 2 years and sometimes as many as 5 years of experience. That is certainly not entry-level by the definition of a recent college graduate.

Then one day (after my daughter found a job) I noticed a job posting for a taxonomist that on LinkedIn was classified as Entry level. It required at least 2 years of experience designing and managing taxonomies. It was clearly not an entry level for fresh college graduate. This time, however, I was looking at the job differently. I was familiar with the employer, and it was clear that for the employer this was an entry-level professional position in their firm. Even though prior experience was expected, this was the most junior professional position available. So, apparently the human resources representative of the company considered it entry-level compared to other jobs they might hire for and classified it that way. It became obvious that employers and job-seekers do not use the same terms, such as “Entry level” to mean the same thing.

How to make the term Entry level clear, short of creating a definition or scope not that the users/end-users will never read, might be to replace it with two other terms, one for Recent grad and one for Junior associate, but the exact wording may still have drawbacks and requires more research. Simplicity and elegance may have to be sacrificed for clarity. This is just one of the many trade-offs to deal with when creating taxonomies.