Showing posts with label Taxonomy creation. Show all posts
Showing posts with label Taxonomy creation. Show all posts

Sunday, August 10, 2025

When to Design a New Taxonomy for a New System

Often organizations determine that a suitable time to adopt a new taxonomy is in conjunction with adopting a new system for its implementation, such as a content management system (CMS) or digital asset management system (DAM). They can budget taxonomy design and development services as part of the consulting services needed for the content migration and system implementation project, and they can improve and optimize the taxonomy for its new implementation and use.

There is the question of timing, though. Recently, a prospective consulting client asked me whether the new taxonomy should be developed prior to the selection and implementation of a new system or afterwards. Ideally, both the taxonomy project and the CMS or DAM adoption can happen simultaneously. However, the design and development of a taxonomy takes less time (typically 3-4 months) than the adoption of a new CMS or DAM. Altogether, a system selection, with a trial or a proof-of-concept project, implementation, data/content migration, and user training, can take 6-18 months.

Benefits of Taxonomy Development Prior to System Adoption

The primary benefit of developing a taxonomy prior to system adoption is that you can make it a system requirement that the new system supports the taxonomy that you have designed to best serve your users, your desired tagging method, and the nature of your content. These criteria should take precedence over designing a taxonomy to fit the requirements (or limitations) of a CMS or DAM.

Over time, your organization will adopt other systems, and the taxonomy should be suitable for multiple systems, rather than being system specific. Especially if you have an enterprise (enterprise-wide) taxonomy as your eventual goal, designing your ideal taxonomy first should be your approach. If one system cannot take advantage of all features of your taxonomy, another system may. There are also usually development work-arounds to get the full use out of your taxonomy.

Benefits of Taxonomy Development After System Adoption

A CMS or DAM has a variety of functions, and tagging and retrieval of content with a taxonomy in only one of those functions. Workflow management, rights management, authoring features (for CMS) and image/video editing features (for DAM) tend to matter more than taxonomy use among the requirements for a system. You can make “good support of taxonomy management and tagging” a requirement for your new CMS or DAM without getting into the specifics.

Adding features a taxonomy (such as polyhierarchy, related-concept relationships, end-user scope notes, different sets of synonyms/alternative labels to support each tagging and searching) if the system you later adopt does not support them is a waste of time and resources. It’s better to wait until a system in selected and implemented before fully designing a taxonomy.

Iterative Taxonomy Design Approach

When implementing a new taxonomy with a new system, the ideal approach is to spread out the taxonomy design and development tasks over the phases on the system selection and implementation process.

You should consider basic taxonomy requirements early in the system selection process. To do this, you might categorize different taxonomy support features as essential and nice-to-have. The method of tagging (automated, manual, automated with human review, and a mix) needs to be determined as both a system requirement and as a factor in the design of the taxonomy.

Then during the lengthy process of system testing and selection, information-gathering work for the taxonomy may take place. This involves stakeholder interviews, user focus groups or brainstorming sessions, content analysis, and review of existing/legacy taxonomies and other controlled vocabularies. Draft versions of portions of the taxonomy, without all features, may be created and reviewed, prior to the system selection decision.

After the CMS or DAM is selected and is in the process of being implemented the taxonomy design can be refined with features that the new system can support, and then the taxonomy can be fully built out. The new taxonomy can also be tested in the new system for its suitability for tagging and retrieval, and final enhancements are made based on the test results. The documentation of the taxonomy, including guidelines for its maintenance (a governance plan), should be started early in the taxonomy design process, but additional system-specific documentation is created after the new system is implemented.

Monday, March 31, 2025

Customizing Taxonomy Hierarchies

Taxonomies need to be custom-created for their purposes to be most effective. Basically, a taxonomy comprises the concepts or terms that reflect the subject domain of the content that will be tagged and retrieved with the aid of that taxonomy. Taxonomies must also be customized to the requirements (or limitations) of the implemented search technology and the user interface, and ideally the taxonomy is also customized to the needs and preferences of the users. This includes taxonomy design aspects of size, degree of detail, use of synonym/variants, use of hierarchy, and implementation as facets.

Taxonomy customization usually focuses on the concepts/terms/labels and not so much on the exact hierarchy of grouping narrower concepts under broader concepts, other than perhaps limiting the number of hierarchical levels. While the selection and definition of concepts depends on the context of the content, the hierarchical relationships between concepts are typically independent of any specific content and are usually dependent only on the context of the taxonomy itself. Such a context-independent hierarchy is what enables a single taxonomy to be used for multiple different content items of different content creators. This is also the approach used in designing classification systems, which are intended for broad, generic use.
 

Why Customize Hierarchy

However, a customized taxonomy may be designed for a rather specific body of content, and then the hierarchy may depend on the context of that overall body of content, if not the specific content items. For example, the concept “Piano” is often considered narrower to “Musical instruments”, but in certain contexts it may be narrower to “Furniture,” such as for the contexts of interior design, furnishing a bar or restaurant, or for moving and storage services. Furthermore, I would not always recommend that “Piano” be narrower to both broader concepts in the same taxonomy (a taxonomy feature known as “polyhierarchy”), because the same taxonomy might not be used for both contexts. It depends.


When structuring a taxonomy hierarchy, the use and purpose of the hierarchy needs to be considered. A hierarchy is not created simply because it’s a taxonomy and thus traditionally has hierarchy. Possible uses of hierarchy include:

  • Supporting browsing and navigation to guide users to the desired concept.
  • Providing context for concepts to support tagging, whether manual or automated.
  • Enabling “recursive” or “rolled up” retrieval, so that a user’s selection of a concept retrieves not only what was been tagged to that concept but also what has been tagged to all of its narrower concepts, too.
  • Enabling expansion of a search, so that if there are too few or no results for a specific concept, the retrieval set can be expanding to content tagged with the broader concept and/or other narrower concepts of it.
  • Instructing users on the appropriate classification and organization of information

Usually, the same hierarchy can support all of the above goals, although occasionally there are conflicting needs.

Customizing Hierarchy Example

The need for customizing hierarchy became especially clear to me in a recent taxonomy consulting project I did for the business of event venue space rentals. Types of spaces (structures, rooms, etc.) were grouped under broader concepts by their potential use, rather than by structural type. To a lesser extent, events or activities for spaces were also sometimes grouped by the type of space that might be suitable. For example, a generic taxonomy might include “Dance class” and “Technical training” both under the same broader concept for “Classes/training,” but because these different types of classes need different kinds of spaces, in this taxonomy they were put in different parts of the taxonomy hierarchy. “Dance class” was made narrower to “Dance event,” and “Technical training” was made narrower to “Training.”

The hierarchy of concepts used in a taxonomy to tag images may also be structured differently than a taxonomy for tagging text content. In this case, for example, broader concepts for grouping others had been created of “Small meeting” and “Large event,” which may not seem logically needed when the range in number of guests was an additional search attribute/filter. However, these concepts are quite useful for tagging images that may depict a small or large event but do not utilize counts of people. Another example is grouping together under the same broader concept the activities of music rehearsals/practices along with music performance events under the same broader concept of “Music events.” Although the activities of organizing rehearsals and organizing performances are quite different from each other, the venues that are suitable for each and their images are similar.

Despite their similarities in scope and concepts, a taxonomy for venue rentals should not be the same as a taxonomy for real estate of long-term lease or sale of properties (focusing on the space but agnostic to the use), nor for events management (focusing on the details of events and less so on space), nor equipment sales and rentals (focusing on the equipment and less on the use). Even when the concepts are the same, the hierarchy may differ. While the inclusion of concepts and their labels should consider the content, the design of the hierarchy should consider the taxonomy’s use.

Sunday, January 14, 2024

Learning to Create Taxonomies

Knowledge of what taxonomies are, what they are for, and how they are used is quite widespread, even if there are uncertainties and disagreements around the definition of “taxonomy.” People who often look up digital information are familiar with various presentations of taxonomies for selecting terms linked to content. These include hierarchical trees of topic and subtopics to browse, scroll boxes of controlled terms, type-ahead or search-suggest terms that appear below a search box after the first few letters are typed into the box, and terms or named entities grouped by various aspect types (facets) in the left margin to select from in order to limit/refine/filter search results.

Why Learn Taxonomy Creation

There is a big difference, however, between being able to use taxonomies and being able to create taxonomies.

While it is usually best to leave taxonomy creation to the experts, taxonomists are not always available, or the needed taxonomy may be small or apparently “simple,” so it may not be economical to hire a contract taxonomist or a consultant. In other situations, the taxonomy subject may be quite technical, and it would seem preferable to have subject matter experts, rather than an external taxonomist, create the taxonomy.  Thus, people who are not professional taxonomists often create taxonomies.

Generative AI now makes it easier for anyone to “generate” a taxonomy. However, the knowledge of taxonomy principles is needed to make necessary corrections and edit the taxonomy to achieve a decent level of quality. Generative AI should not be used to fully create a taxonomy (which could in fact be extracting published taxonomies violating their copyright), but rather it may be a used as a tool facilitate parts of the taxonomy creation process. (See my post “Taxonomies and ChatGPT.”) The technology thus makes it easier to create taxonomies for those who are not taxonomists and have limited time for taxonomy creation tasks.

There is also the matter of taxonomy maintenance. After a contract taxonomist or consultant creates a taxonomy and leaves, the taxonomy still needs to be kept up to date, with new concepts added and others changed, and over time expanded. While documentation and guidelines written by a taxonomy consultant are helpful, a good understanding of taxonomy creation principles is also needed by anyone responsible for expanding or maintaining a taxonomy.

Finally, taxonomy creation is a collaborative effort, involving stakeholders in various roles (project management, content management, digital asset management, information technology tagging, research, user experience, search, etc.) who are invited to contribute their perspectives. Stakeholders can provide better insights to a taxonomy if they have a better understanding of taxonomy principles. Taxonomy project managers in particular need to understand taxonomy creation even if they are not doing the actual taxonomy creation work.

How to Learn Taxonomy Creation

Fortunately, there are many resources to learn the principles and standards of taxonomy design and creation. There is, of course, my book, The Accidental Taxonomist, which, as the name implies, is intended for anyone who finds themselves, perhaps by “accident” in a position that requires them to create, edit, or manage taxonomies.

Heather Hedden delivering a taxonomy workshop
There are also various half-day and full-day workshops at conferences, virtual short courses through professional associations and other organizations, and asynchronous online training. These usually involve some exercises for practice and provide the appropriate amount of training for getting started with creating taxonomies. I’ve offered various kinds of training, both independently and through other organizations, over the years. My current course offerings are on my website

Upcoming Taxonomy Courses

The next live (virtual) course I will offer is a new course called “Controlled Vocabularies and Taxonomies”  offered through HS Events, on GoToWebinar over four weekly sessions from February 29 though March 27. I will teach this course live (with ample time for Q&A) just once, after which it will become available as a recording for purchase.

HS (Henry Stewart) Events are best known for their dominance in the field of digital asset management (DAM), but the course I will teach is not limited to DAM professionals. Actually, this course is most appropriate for the expanding scope of HS Events, which will introduce a Semantic Data conference event, which includes the subject of taxonomies, co-located with its DAM conferences in London and in New York in 2024.

The first session is an introduction to the definitions, types, uses, benefits, and standards for taxonomies. The second deals with project management side of planning and researching for creating controlled vocabularies and taxonomies. The third session gets into the details of creating terms and relationships. Finally, the fourth session takes up design and implementation issues. After this course takes place, the recordings will be available for purchase for on-demand viewing.

Then in June, I will be teaching a three part, weekly, course "Taxonomy Creation for Content Tagging" through the Society for Technical Communication (STC), so the focus is taxonomies to make documents/documentation more findable, but it is also suitable for anyone interested in learning how to create taxonomies. It will be offered on Zoom on Thursday afternoons, 4:00 – 5:30 pm EDT, June 11, 18, and 25, and the Moodle learning management system is used for additional asynchronous discussion and access to resource. Interactive exercises and live Q&A are included. I had taught this course for the first time last year, but due to my increasingly busy consulting work schedule, I do not plan to teach this course again after this June. More details are on the Interactive Virtual Taxonomy Workshop page my website.

In the future, check for my current training offerings on the Taxonomy Courses & Workshops page of my Hedden Information Management website.

Sunday, December 31, 2023

IT and Taxonomies

Taxonomies are related to many fields of work, including knowledge management, information architecture, website design, website marketing at SEO, document management, terminology management, publishing, product management (for information products), content management and strategy, digital asset management, machine learning for classification, natural language processing for auto-tagging, data management, library and information management, and information technology. Information technology is relevant to the implementation of all taxonomies.

Why is IT involved in taxonomies?

Taxonomies link users to content (and taxonomies extended into ontologies also link users to data), but this linking relies on technology. The technology could be a kind of software, such as a content management system that supports the tagging and retrieval of content by taxonomies along with the feature of taxonomy management. Often, however, additional technology is needed to link multiple software systems together, with APIs, and to move data across systems, with extract-transform-load (ETL) tools. Taxonomies are increasingly built in the SKOS (Simple Knowledge Organization System) standard/data model, which enables taxonomies and other knowledge organization systems to be machine-readable and not just human readable.

Taxonomies are a concern of information technology professionals as they are the owners of, and often also the developers of, the systems in which taxonomies are implemented. The systems could be completely internally developed, or they could be licensed software that typically requires some customization or integration with other systems. In my experience as a taxonomy consultant, I have typically engaged in conversations with those in IT as key stakeholders of the taxonomy. However, the degree of the involvement of IT professionals in the taxonomy itself can vary.

In custom taxonomy implementations, such as in an information service/product or in an ecommerce business, IT professionals are usually not involved in the actual design of the taxonomy, but taxonomists or others who create that taxonomy need to collaborate with IT professionals to understand the system’s capabilities and limitations and may impose requirements. Taxonomists are concerned with how the taxonomy will be displayed to the users, how the users can interact with the taxonomy, how tagging is done, and how the search functions. Custom software development has great flexibility in how it supports a taxonomy.
In implementations of taxonomies in licensed software, there may still be some development work for the IT professionals, but there are limits to what can be done or changed.

Commercial content management systems (CMS) that allow for the custom development of the user interface, referred to as “headless” CMSs, however, are becoming more common. The user interface in particular is very significant to how a taxonomy is designed and how it functions.

Who in IT is involved in taxonomies?

Those who work in IT departments with involvement taxonomies could be in roles doing development or support for systems that manage and consume taxonomies, or they could be in systems integration roles. Additionally, there are taxonomy/metadata/ontology specialists who work within the IT department of an enterprise, especially if a knowledge/information management department does not exist in the organization.

In a survey of taxonomists I conducted in January 2022 for the 3rd edition of The Accidental Taxonomist book, of 162 people who do taxonomy work for their employers, which are not consultancies creating taxonomies for others, a multiple-choice question asked what area they work in. Information technology ranked 4th out of 11 choices, with 17% of the responses, following the areas of knowledge management, content management/strategy, and product development/management, yet ahead of the specialties of library, user experience, marketing, and others.

The survey also asked all respondents to provide their job titles, and some of those working in taxonomies have job title that are closely associated with information technology. These included titles of IT Data Analyst, Data and Technology Platform Products, SharePoint Product Owner, Senior Solutions Consultant, Implementation Project Manager, Data Architect, Senior Manager - Graph Solutions, Enterprise Architect, Staff Engineer - Systems, Information Governance Engineer, Head of Technical Services, and Director of Solutions Delivery.

What does IT do with taxonomies?

From my experience as a taxonomy consultant, I have observed that those working in IT, in their efforts to facilitate the adoption of new software and features that make use of taxonomies, may include starter taxonomies within the tool, whether selected from offerings of software vendor or created by the IT staff themselves. For example, IT professionals might create simple controlled vocabularies in the SharePoint term store, such as for document types, departments, locations, etc., so that users can start using the search refinements right away, and there is also an example of the functionality of taxonomy, which can be improved upon and expanded by someone else later.

Then there is enterprise taxonomy/ontology management software, which should be connected to search systems, content management systems, and tagging systems (if not using a tagging module of the taxonomy management system). In my experience working for a taxonomy software vendor, the IT department was often involved in the software purchasing process, if not actually leading the decision-making. Representatives from the IT department attend pre-sales demos of the tool, ask questions, and compile and compare system requirements when requesting a proposal.

That taxonomy is actually an area concern of IT, was also made clear when I saw that taxonomies were mentioned in a section within a chapter on knowledge management-related systems in my son’s introductory Management Information Systems textbook for a required course for his B.S. in Information Technology.

In sum, IT professionals who support enterprise knowledge or information management systems need to have a basic understanding of taxonomy principles, standards, benefits, and uses. My website contains various taxonomy resources. Some IT professionals may even want to go further and design and create small taxonomies (lacking the time to create large taxonomies), and they may want to read my book or attend my workshops or online courses.

Thursday, November 30, 2023

Generative AI at Taxonomy Boot Camp Conference

Generative AI and large language models (LLMs), the technology behind ChatGPT, have been topics of presentations, keynotes, and attendees’ conversations at all the varied conferences I had the fortune to attend this year, including the Taxonomy Boot Camp conference held November 6-7, in Washington, DC. Taxonomy Boot Camp is the only conference dedicated to taxonomies.

Opening and Keynotes

 

Right from the beginning in the opening welcome, the conference chair Stephanie Lemieux mentioned uses of ChatGPT for taxonomy creation, such as asking prompts: What is a category for a following list of terms?, What label for a concept might be better for scientists, or better for parents?, and What are alternative labels for a specific content? It has become clear that generative AI is a tool to assist taxonomists with specific tasks of a project but is not appropriate for automating the entire creation of a taxonomy. Thus, the Taxonomy Boot Camp theme this year, “Humans in the Loop,” was quite apt for the new era of generative AI, even if not specific to it.

 

The Taxonomy Boot Camp opening keynote, “Ontologies in the New Age of AI by Dean Allemang, was on this subject. Dean is more of an ontologist than a taxonomist, hence the title, but he discussed both taxonomies and ontologies. Allemang made the statement that Generative AI “understands” why we need a taxonomy (even if managers do not). He explained that Schema.org has put RDF on many websites, which ChatGPT “reads.” Allemang has found that ChatGPT also performs perfectly on SPARQL queries, the query language for data, including taxonomies, that is in RDF. Allemang gave ChatGPT query examples, such as “Return all the claims we have by claim number, open date, and close date,” and “What is the total loss of each policy where loss is the sum of loss payment, loss reserve, expense, payment, and expense reserve amount?” Allemang advised taxonomists to identify uses for taxonomies that have not been fully delivered on and use generative AI to deliver it, and if people argue that generative AI does not understand their language, taxonomists should build in a link to the taxonomy that makes generative AI understand it.

 

On the second day, Taxonomy Boot Camp registrants  attend the same shared keynote presentations with all of the KMWorld co-located conferences, and this year these mostly dealt with generative AI, including the opening keynote by Dion Hinchcliffe “Tech-Driven Enterprise Thrills & Chills: The Future of Work.” 


Regular Sessions

In addition to being mentioned in various talks, generative AI was also the subject of a session, “ChatGPT, Taxonomist: Opportunities & Challenges in AI-Assisted Taxonomy Development,”  which comprised two separate presentations.

In this session, Xia Lin presented in “Chat GPT and Generative AI for Taxonomy Development” in which he discussed the steps involved in using ChatGPT in two case studies. In one, a taxonomy for data analytics projects of a small business was developed by providing ChatGPT with the scope of the first level of the taxonomy and then asking ChatGPT to expand individual categories by adding subcategories and then to add definitions of terms and categories. The results were reviewed and revised by experts. But Lin did not stop there. He showed the results of asking ChatGPT to provide stakeholder interview questions around a category, and (for those more technically inclined) how to create a ChatGPT plug-in for various defined functions of taxonomy creation, using ChatGPT’s APIs. 

Also in “ChatGPT and Generative AI for Taxonomy Development” Marjorie Hlava and Heather Kotula jointly presented on issues of the use of ChatGPT to create taxonomies and in general. They explained the risks of bias, plagiarism, ethics, data quality, matching the generated taxonomy to the content, and the amplification of errors upon repeating a prompt. In plagiarism, for example, if you ask ChatGPT to return a complete taxonomy on a subject domain in may return a copyrighted taxonomy that cannot be reused without a license.

Generative AI also impacts the topics of other presentations. For example, in the presentation “In Taxonomy We Trust: Building Buy-In for Taxonomy Projects,” Bonnie Griffin mentioned the importance of “continually re-introducing the value of taxonomy, as generative AI captures attention.” It was also the subject of a debate question in somewhat humorous closing sessions “Taxonomy Showdown—Point/Counterpoint With Taxonomy Experts.”

 

More on Taxonomies and AI

Of course, there is more to AI than just generative AI. Other sessions dealt with machine learning for auto-categorization. These included presentations by each Bob Kasenchak and Rachael Maddison in the session “Machine Learning Is Coming forYour Taxonomy,”  (link to Bob’s slides)  and Wytze Vlietstra’s presentation of  “Vision for Modular Taxonomy Product at Elsevier,” in which the program included “shared infrastructure supported by AI-based decision support tools.” In fact, AI has been a theme of Taxonomy Boot Camp in the past, in 2018. It is generative AI based on large language models that is new. 

For some more details on how this technology may be used for taxonomy development, see my prior blog post this spring Taxonomies and ChatGPT.  To get another perspective on this conference, check out the recent blog post by Taxonomy Boot Camp speaker Mary Katherine Barnes Integrating AI: Insights from KMWorld 2023.

Monday, May 29, 2023

Taxonomies and ChatGPT

ChatGPT, generative AI, and large language models (LLMs) are hot topics of interest in fields of data, information, and knowledge management. LLMs dominated the keynote presentations at the networking conversations at Knowledge Graph Conference in New York and were also discussed in presentations and panels of this conference and Data Summit in Boston, both of which I attended this month. The technology is relevant to taxonomies as well.

ChatGPT is the user interface application on top of GPT (Generative Pre-Trained Transformer), a publicly available LLM developed by OpenAI, which is now in version 4. ChatGPT is thus a form of generative AI, in how it generates answers. There are many other LLMs (Neural network-based AI, trained with deep learning on very large volumes of text), including those which are proprietary, restricted, or for non-commercial research, but only some have generative AI user interfaces. Although we may think of generative AI for providing answers to questions, it can do a lot more, including tasks related to taxonomies.

Organizing terms into hierarchies

Building a taxonomy is a combination of top-down design (identifying the top concepts or facets) and bottom-up building (identifying specific concepts from content analysis). The top-level of a taxonomy is designed to serve user needs and thus should be based on stakeholder interviews, surveys, and brainstorming workshops, which is not something ChatGPT can do.  The bottom-up building a taxonomy, based on terms extracted content or search log terms, may benefit from some AI involvement.

I have made a few test requests of ChatGPT for “Put the following list of terms into a hierarchical taxonomy…,” and the results are bulleted lists with indented narrower concepts. ChatGPT can also generate a taxonomy in a machine-readable SKOS in a requested RDF serialization format, as Bob DuCharme explained in his May 20 blog post “Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy.”

Like card sorting exercises, you can specify the top categories/concepts (like a “closed card sort”), or you can let ChatGPT create the top categories (like an “open card sort”). In any case, better results are with context, of course, so you should also tell ChatGPT what the subject domain or context is. Asking for a hierarchical taxonomy results in a third level of hierarchy sometimes, and not just a single level of grouping. Near duplicates usually appear next to each other in the list, and the taxonomist can then decide if and how to merge them into a single concept.

It is particularly for long lists of terms, where automated methods can save the taxonomist’s time. If a taxonomist comes up with terms based on manual content analysis, stakeholder interviews, or submitted lists from subject matter experts, the term lists tend not to be very long, and even the process of coming up with the terms tends to include some thoughts toward categorization at the same time. Longer term lists (such several hundred) are derived from automated term extraction (using text analytics technologies) across a corpus of dozens or hundreds of documents and from search log reports. ChatGPT is practical for putting these long lists of terms into draft hierarchies. There are inevitably some taxonomic errors in the results, which should be obvious to any taxonomist. For example, I have seen duplicated terms on different levels of the hierarchy.

In both lists of extracted terms and search log lists, terms occur that are not suitable as concepts for a taxonomy, such as verbs and adjectives or vague words. ChatGPT understands grammatical rules, so my prompt also says “Include in the taxonomy only nouns and noun phrases and omit the other terms.”

Generating alternative labels (“synonyms”) for concepts

Asking ChatGPT to “provide a list of synonyms for…” a given term can also be helpful for coming up with alternative labels for taxonomy concepts. Alternative labels should be customized for the context of the content and users, so alternative labels for a concept will vary from one taxonomy to another, and an external source, such as ChatGPT should not relied upon as the only source for alternative labels, but merely as a supplemental source of suggestions to be considered. 

Again, context can help and should be provided. I asked “Provide a list of synonyms for “healthcare” and got 20 terms. But then when I asked “Provide a list of synonyms for health care, meaning the industry,” I received a slightly more focused list of 15 terms. Interestingly, the two-word variant “health care” was not on the list, so “synonyms” is understood by ChatGPT to mean different words with the same meaning and not orthographic variations. Nevertheless, even 15 terms are too many, and the taxonomist should select from the list of suggestions. It might be a good idea to then test search the suggested alternative labels in the content and system being used.

Although by strict definition a “synonym” is a single word with the same meaning as another word, ChatGPT provides acceptable synonyms for terms which are multi-word phrases, or synonymous multi-word phrases, such as “Chemical manufacturing and distribution” provided as a synonym for “chemical industry.”


Other taxonomy-related uses of ChatGPT

Getting help in designing an ontology (a more complex, yet high-level semantic model with defined classes of concepts, customized relationships, and attributes) is also possible with ChatGPT or other LLMs. Again, submitting the request multiple times with slight variations will yield multiple different responses for the ontologist to consider and select ideas from. Ontologies are not expressed in simple text, though, so the prompt request should specify it, such as RDF TTL. Dean Allemang, author of Semantic Web or the Working Ontologist, has written multiple articles (medium.com/@dallemang) recently on ChatGPT and ontologies/knowledge graphs.

ChatGPT can also be used for comparing lists of terms, data conversion, and basic coding, which may be useful for taxonomists who lack coding skills. It can convert taxonomy or ontology data from one data format to another (although taxonomy/ontology management software also imports/exports in multiple formats). Taxonomies and ontologies in their raw data format are most commonly expressed in the RDF (Resource Description Framework) data model which has various serialization format: RDF/XML, JSON, JSON- LD, .ttl (Turtle), etc., and ChatGPT can convert data from one to another. Data extraction can also be done with ChatGPT. For example, knowledge management professional Camille Mathieu recently shared in a LinkedIn post how she used ChatGPT to write a Python script to extract text & metadata from PDFs.

Perhaps what is most intriguing as a future implementation of taxonomies and ChatGPT is to go in the other direction and have knowledge organization systems, such as taxonomies, support the creation and use of queries (as called “prompts”) for generative AI, to obtain better results. This requires some back-end development, though, and is not merely a matter of putting a taxonomy into a prompt.  Since a taxonomy is created for a specific subject domain, the questions need to be confined to the domain of the taxonomy. Semantic Web Company has developed a simple publicly accessible demo “PoolParty Meets Chat GPT,” whereby you can compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. The semantically enriched questions generate answers that have more detail, better accuracy, and even web links to definitions and other articles.

Conclusions

While it’s arguable whether ChatGPT alone is a good way to obtain “facts,” there is no doubt that it is a good way to get suggestions and ideas. These suggestions can support the work of taxonomists and ontologists, and taxonomies and ontologies in turn can support the results of ChatGPT and other LLMs. Because there will be errors from ChatGPT, it should not be used to generate taxonomies by those who are not already knowledgeable with taxonomy requirements and best practices, nor should it be used as a substitute for the expertise of taxonomists.

I hope to experiment more with ChatGPT for taxonomies and share additional details in future blog posts.

Saturday, February 25, 2023

Related Concepts in Taxonomies

Related concepts in a taxonomy
A and B are related; C and D are related.
Taxonomies and thesauri are characterized by having hierarchical relationships linking their terms. The associative relationship (or related concept, Related Term, or RT), on the other hand, is a fundamental feature of thesauri, but it is merely an optional feature of taxonomies. 

An over-simplistic distinction between taxonomies and thesauri is the presence of associative relationships, although I would disagree, because taxonomies can have associative relationships, and there are other structural design differences between taxonomies and thesauri. (See my past blog posts Taxonomies vs. Thesauri and Taxonomies vs. Thesauri: Practical Implementations)

The associative (related) relationship is a generic, nonhierarchical, symmetrical (same in both directions), reciprocal relationship between pairs of terms/concepts in a thesaurus or taxonomy. "Related concept" actually refers to a kind of relationship, not a kind of concept. The following figure illustrates that Data protection and Privacy are related.


It is true that many taxonomies do not have associative relationships. This is for various reasons. The function of the taxonomy in the user interface may not require the support of related concepts, such as when the taxonomy is displayed only as facets for refining results or only as type-ahead taxonomy term suggestions when a user enters a search string into a search box. The taxonomy may be implemented in a system (such as a commercial off-the-shelf content management system or SharePoint) that does not support the links/navigating to related concepts in the user interface. A taxonomy may be too small to make beneficial use of associative relationships if most of the taxonomy can quickly be browsed and seen. Finally, and perhaps of the greatest potential significance, is that relationships across different types of concepts can instead be better supported with customized semantic relationships based on custom schema and ontologies, which can be applied to a taxonomy. For example, having Physicians practice Medicine and Medicine isPracticedBy Physicians, instead of Physicians related Medicine.

It is not so much the presence but rather the extent of associative relationships that also distinguishes thesauri from taxonomies. In a traditional thesaurus, associative relationships are as prolific as hierarchical relationships, and perhaps even more so, and they occur between terms of all different kinds and different types of relatedness. The thesaurus standards (ANSI/NISO Z39.19 and ISO 25964-1) provide a list of possible types of associative relationships (process and agent, action and target, cause and effect, object and property, object and origins, and discipline and object, among many others). When taxonomies have associative relationships, they tend to be limited to only certain categories, facets, or concept schemes of the taxonomy.

Related Concepts and SKOS Concept Schemes

Most taxonomies these days, if they are of any significant size (hundreds or thousands of concepts) and intended for use in more than one application, are created in the SKOS (Simple Knowledge Organization System) data model. (Smaller taxonomies might be created in a spreadsheet and imported into a content management system.) The highest level of organizational structure in SKOS is the concept scheme. SKOS-based taxonomy management software will group and display multiple concept schemes together in a single “project” or “knowledge model,” which is intended for a single business use, set of content, user audience, or implementation (with some overlap of multiple use cases acceptable). While SKOS does not provide any recommendation on what you should use concept schemes for, it has become common practice to designate a concept scheme for a taxonomy facet or a metadata property/field.  Even when concept schemes are not currently implemented as facets, they might be in the future, so it is good practice to created concept schemes to represent facets. The structure of concept schemes representing facets is also is also a good organizing principle for constructing any taxonomy. Concept schemes also tend to reflect top-level “classes” of ontologies (although not the very esoteric top class of “Thing”).

SKOS permits the creation of related concept relationships both within and between concept schemes. SKOS also has mapping relationships called matching properties, including relatedMatch, for use between concept schemes, whether they are in the same “project” (sharing the same, initial, domain part of a URI) or not. The option to use either related or relatedMatch across concept schemes of the same project can be a source of confusion.

Best Practices for SKOS Related Concepts

If you are implementing concept schemes each as a facet/filter/refinement in a user interface, then it is best practice not create associative (related) relationships between concepts in different concept schemes. Facets function as mutually exclusive aspects or dimensions of content items and queries. Any “relatedness” is implicit based on the search results, but not from the taxonomy itself, which should be flexible to allow any combination of concepts from facets and not prescribe relatedness. For example, a user may want to filter a search on movies by which movies meet selected criteria (facets) of a chosen genre, actor, director, topical theme, and country of production, and the result set will implicitly indicate in which movies where these aspects are related.

Enriching a taxonomy with the semantics of an ontology, in addition to supporting additional data attributes (such as movie production year, actor nationality and birth date, etc.), supports connections across concept types that can be utilized in a front-end application. The user can search not only for movies, but also search for other entities, such as actors (who appear in movies of a certain genre directed by a certain director), or directors (who directed movies on certain themes from certain countries), etc.  This involved creating customized, semantic relationships between classes which correspond to the concept schemes: Actor performsIn Movie title and Movie title hasActor Actor, Movie title isProducedIn Country and Country isOriginOf Movie title, etc. These semantic relationships, of course, make any generic SKOS related relationships across the concept schemes unnecessary, redundant, and rather meaningless.

Thus, regardless of the use of your concept schemes, the related concept relationship is best not used between concepts in different concept schemes. Rather, the related concept relationship is better used between concepts within a concept scheme, especially topical (subject) concepts, for example, relating the concepts Data quality and Quality management. Relatedness between named entities within a concept scheme, on the other hand, such as concept schemes for People, Organizations, and Geographic places, is best left to be implicit from the retrieved content and not prescribed in a taxonomy, which may be dependent on the content, change over time, and be too subjective.

Even if the current end-user application of a taxonomy does not support user interaction with related links, associative relationships can support tagging, both manual and automated. Finally, a taxonomy typically has a longer life than a single application, so incorporating in related concept relationships while the taxonomy is being built and regularly maintained is a good practice for the future use of the taxonomy.

Friday, February 4, 2022

Defining a Taxonomy’s Scope

In planning a taxonomy, I have often said that it is important at the beginning to define the taxonomy’s scope, specifically the subject area scope of the taxonomy’s terms, but without going into more detail. Recently I was asked by a client how to define a taxonomy’s scope. This is a good question. The taxonomy should be suited to the subject area scope of the content that will be tagged with the taxonomy and to the scope of the user’s expectations. Terms or topics only marginal to the subject scope, however, could occur in the content, and whether they should also be included in the taxonomy is a question. Ultimately, that should depend on whether user expectations justify it, as the needs of users should also be a factor in creating a taxonomy. A taxonomy should suit both its content and its users.

Sources for Taxonomy Terms

For content as a source of taxonomy terms, a combination of manual and automated approaches is recommended. By manually reviewing sample individual documents or content items, you can discern the main ideas and main topics, which should form the start and basic structure of the taxonomy and also help define its scope. Automated methods of extracting terms, through text analytics technologies, can bring in many additional terms from a much larger corpus of documents more quickly, picking up terms that a limited manual review would miss. Even though automated text analytics extracts terms based on relevancy and frequency of occurrence, such terms could be out of scope of the subject domain. That’s why it’s important to start first with a manual review of content to define the subject scope.  Then, when you enrich the taxonomy with automated extraction, you can approve terms that appear to be in scope or at least closely relevant and reject others. But should you reject all that are out of scope, even if they appear with sufficient frequency and relevancy? My advice is to try to assume the role of the user. Ask yourself: Might a user want to search for content on this term in this content collection?
 
For user needs and expectations as a contributing source of taxonomy terms, obtaining this information can be very direct, such as by creating a user questionnaire (at least for your internal users) that asks what the topics of importance are, how those users would define the scope, and what “marginal” topics would be acceptable for them to include. You could also request sample challenging (not expected, basic, typical) queries that the users would make.  Another good way to obtain input from the user side is to look at search query logs that list search strings that users have entered over a period of time, ranked by frequency. If a search phrase that is slightly out of scope of the subject occurs frequently, then the term should still be considered for inclusion in the taxonomy.

In either case, the scope of the subject gets better defined as the taxonomy is created. For example, a taxonomy for recipes may initially be scoped to comprise terms for the names of dishes, ingredients, and cooking method. But then a different term shows up significant frequency, “Nutrition Facts.” If it occurs in both the content and the user research, then it likely should be included.  If it shows up in the content only, but is not validated in user research, then it is more questionable.

Taxonomy Structure

The initial taxonomy structure itself tends to impose limits on scope. Taxonomies tend to be hierarchical with a limited number of top terms. If a candidate term appears in the content that does not seem to belong anywhere in the current taxonomic hierarchy, you might be inclined to exclude it. Factors of user needs (they might want to look up this term in this content), however, should take precedence. For example, the term “COVID-19” might be marginal but still of interest to be included many taxonomies on varied subjects, but there would exist no broader term for diseases in those taxonomies. Then adjustments need to be made, such as renaming or adding broader terms, or perhaps, more likely, the proposed term should be modified to fit the context of the taxonomy, such as becoming “COVID-19 impacts.”

Another thing to consider is adopting more a thesaurus structure than a taxonomy structure, at least for the facet or concept scheme of the taxonomy that is for miscellaneous “topics.” One characteristic of thesauri is to not rely so heavily on extensive hierarchical trees. What this means is that you could decide that it is acceptable that not all terms have broader terms and thus it’s OK to have a very large number of top terms, with the more specific terms linked to other terms only by related-term relationships, another feature of thesauri, if not by broader/narrower-term relationships. Abandoning the full hierarchical tree structure should only be considered if this hierarchy is not displayed as a navigation to the end users.

Documenting Policy

In any case, you need to define policies regarding what kinds of terms can be added and what kinds should not. This will evolve out of the activity of building the taxonomy, especially from evaluating what extracted terms to approve and what search log terms to approve. Whoever is doing this task (hopefully more than one person), should document each instance of uncertainty. While many term approvals and rejections will be obvious, there will be a gray area. This should be collected and discussed together, and then a policy can emerge.

Saturday, November 27, 2021

Attributes in Taxonomies

When I had done consulting for ecommerce taxonomy clients years ago, and they would refer to the taxonomy facets for products as “attributes,” I felt that might be confusing, because I considered “attributes” something else: a characteristic like metadata of a taxonomy term or a feature of an ontology. I have since come to realize that facets in some cases, especially in ecommerce, can be considered attributes, and they can even be managed in an ontology that is combined with the taxonomy.

Facets in a faceted taxonomy are various taxonomy term “types” that function as refinements or filters in the user interface for limiting search results on content that share similar types of terms or attributes. Users refine or filter their searches by selecting a term or value from each of two or more facets. In a periodical article research database offered by a library, facets might be subject, geographic place, named person, named organization, article type, publication name. Within an enterprise intranet of enterprise content management system facets might be topic, department, office location, and document type. In a health information service, facets might be symptom, body part, patient age, and  patient gender. In a corporate knowledge base for customer service, facets might be product type, brand name, market, region, issue type, and customer type. In most of these cases, a topical taxonomy is one of the facets, even if that topical taxonomy is hierarchical. The primary taxonomy design challenge in such cases is deciding what kind of information would be useful if separated out in its own facet, and what can remain in the generic topics facet. Using the SKOS (Simple Knowledge Organization System) model, each concept scheme serves as a facet.

In a product, ecommerce or marketplace taxonomy, the hierarchical taxonomy of product types is not one of the facets. This large, hierarchical taxonomy is typically the starting point for user browsing. While not constituting a facet, this hierarchy is linked to the facets. The user navigates or drills down through a hierarchical tree of product categories, until a specific product type is found, and then the facets (attributes) relevant to that product type are made available to the user, allowing the user to refine the search further, by selecting from each of several attributes, such as size, color, material, price, and features. This requires a different approach to the taxonomy design than for the faceted taxonomies described above, and thus these post-hierarchical-browse refinements are better known as and more appropriately called attributes.

Ecommerce taxonomy attributes

Attributes can serve as refinements/filters in taxonomies for purposes other than ecommerce, such as job board taxonomies (attributes for job location, skill level, salary range, job type, employer/company, industry, date posted, etc.), an internal enterprise expert-finder (attributes for job title, department, office/work location, skills, subject areas of interest, academic degree, languages, etc.) or taxonomies of movies (attributes for genre, production company, production country, language, award winner type, release date, etc.)

The attributes generally pertain to specific named entities, such as the name of a product offered by a specific seller, the name of a job title at a specific employer, or the title of a movie. There can be attributes for more than one kind of named entity in the same set of taxonomies, such as for employer name in addition to job title in a job board taxonomy. Subjects, which are not named entities, usually do not have attributes, but some do, especially, in the fields of science and medicine, where they would be attributes on the names of species, chemicals, viruses, diseases, etc. I will discuss named entities in more detail in my next blog post.

Issues to consider in creating attributes

In a taxonomy where attributes are important, there are various issues to consider. First, shall there be a hierarchical topical taxonomy presented as an initial browse interface to the users? While this is typical for product/ecommerce taxonomies, it is not usually the case for job board taxonomies nor for a taxonomy for movies. However, it may be desired for a taxonomy of nonfiction books or periodicals, which users more often would categorize my subject. A producer or publisher of educational content will likely have a hierarchical taxonomy of disciplines or subject areas. A research-focused organization would also likely have a hierarchical subject taxonomy in addition to the facet-attributes dealing with location, type, funding source/sponsor, researcher name, etc. Having a hierarchical taxonomy outside of the attributes tends to be a user experience design decision, but it has an important impact on how the overall taxonomy is designed and managed.

More attributes may be created than usable for filtering/refining results. For example, products will likely have SKU numbers among their attributes, which can be displayed and perhaps even made searchable, but would not be one of the filtering-facets presented in the user interface. In a taxonomy for finding internal experts, contact information, such as an email address and phone number, would be attributes on each person’s profile, but these would not be searchable fields. Rather, they would display when the person profile is selected. Thus, another issue when creating attributes is determining which will display and function as filters/refinements and which will display only as additional metadata on a selected item.

If an initial hierarchical topical taxonomy is presented to the users, there arises the question of at what point in the hierarchy should the hierarchy of categories should end and further details should be treated instead with attributes? This question often comes up when designing ecommerce taxonomies. For example, to distinguish gas and electric stoves, should each of these types be a narrower term of stoves, or should energy source be an attribute of stoves?

Another issue to resolve is determining which attributes should be generic across all categories, and which should be category specific. For example, on which product categories in an ecommerce taxonomy is it appropriate to have an attribute for gender (as for clothing or a gift for a woman or for a man)? Related to that is the question of which categories should have their own unique attributes. Are some attributes relevant to major (the broadest) categories, and other attributes relevant only the most specific categories, and yet other attributes apply to various miscellaneous products? For example, color might be relevant for products in different parts of the hierarchy.

Attributes should be managed as belonging to different types based on their values, such as whether they are of controlled vocabularies, dates, currency, numbers, or a simply a Boolean yes/no, such as Remote being an attribute of jobs. If a hierarchical taxonomy resides outside of the attributes, then controlled vocabulary attributes are an additional part of a larger set of taxonomies/controlled vocabularies. How this is managed varies based on the taxonomy/ontology management tool. For example, such term lists might need to be managed as separate concept schemes in a SKOS taxonomy, even though they are used in ontology-based attributes. It can start getting complicated when an attribute type has different values for different categories in the same hierarchical taxonomy implementation. For example, the attribute of Material could have different values for tables than for clothing, and both categories are offered by the same ecommerce seller.

Attributes add another level or layer of expressivity to a taxonomy or set of controlled vocabularies, which brings it closer to an ontology. The distinction between taxonomy and ontology is not necessarily clear. It’s fine to have just some ontology-like features, such as attributes, but it is recommended to use a taxonomy/ontology management tool, such as PoolParty, which manages taxonomies and ontologies (and anything in between) according to Semantic Web/World Wide Web Consortium (W3C) standards.

Saturday, July 31, 2021

Taxonomies and Sitemaps

I was recently asked if a website’s sitemap of company’s website could serve as the start of a taxonomy for an organization. The sitemap, after all, includes all the relevant topics pertaining to an organization’s business offerings, and they are arranged in a hierarchy.  I have previously blogged on the subject of why a website’s navigation is not a taxonomy in Navigation Schemes and Taxonomies. A sitemap is similar to a website’s navigation, but it goes deeper by including the titles or topics of web pages which are not included in the website’s menu, and it is not necessarily intended for user browsing. A sitemap may go five or six levels deep, whereas the website menu navigation menus are usually only two levels. Therefore, a sitemap may seem as if it’s a taxonomy. However, just because a sitemap is as large and detailed as a taxonomy needs to be does not make it suitable as a taxonomy.

Different purposes

We need to understand what a taxonomy is for. It’s to aid users in locating desired content by topic-terms, which reflect both the terminology use of the users and of the content. Taxonomy terms are tagged/indexed to content that is relevant to the term. The starting point when creating a taxonomy is to identify the topics of the content and identify the topics of user interest or search, and then merge those topics into a taxonomy by bringing together different names for the same concept. The concepts are then structurally arranged to show the relationships between the terms, especially hierarchical relationships. The primary purpose of the hierarchy of terms in a taxonomy is to aid the users in finding the appropriate term. When browsing the taxonomy, they may find a broader term or narrower term that better describes their search goals. Then they can select that term to retrieve content that was tagged with the term.  

A sitemap, on the other hand, lists all or most pages of a website, usually by page title and organized in the hierarchical structure of the website. The hierarchical structure of the website was designed to organize information in a logical manner for users to browse and explore, as considered by the information architect who designed the website. The sitemap thus reflects pages, which are often topics but not always. A page may have multiple topics of interest that a user might want to look up. A page is sometimes for performing a function or activity and not necessarily just a topic of information.

A sitemap is typically automatically generated from the page titles, and its primary purpose is not for user but for machines: they tell search engines about pages that are available for crawling on websites and can thus support search engine optimization (SEO). Sitemap are useful in planning the further development or organizational improvement of a website. Whether a sitemap should even be displayed to end users as a tool to find information on a website is questionable. If automatically generated, it's not designed for that purpose, but users could find it helpful, especially users who understand that it is merely the aggregation of page titles organized in the file structure of the website. Some website make it available, and some do not. Some websites have displayed a simplified sitemap instead  that is designed to be a guide to the users, but then it do not include all pages.

Different labels

The title names of pages and thus of sitemap entries often do not correspond to taxonomy terms. They could start out with verb for an activity, they could be commands or questions, or they could be complete sentences. Taxonomy terms are topics or names only represented by nouns or noun phrases, or proper nouns. Examples of sitemap entries that are not good taxonomy terms may include:

How to use…
Get started with…
Help with…
Pay a bill
Shop for…

As with navigation, the entries of a sitemap reflect pages in a one-to-one relationship, in contrast to taxonomy terms, each of which may retrieve multiple pages or content sources, and each page or content item can be tagged with multiple taxonomy terms. As such, entries in a sitemap may actually be more specific than would be needed in a taxonomy.  The user’s selection of multiple taxonomy terms in combination, through filters/refinements, achieves the result of obtaining an appropriate list of relevant content.

Conclusions

Sitemaps should not be used as taxonomies, but their topics (not their labels) may be considered as a good source for a taxonomy. Sitemaps might not even be suitable as a basis or starting point for a taxonomy, but rather as a source for developing taxonomy terms. Rather, it is recommended that a taxonomy be created separately from a sitemap based on a review of content, search log data, and stakeholder and user interviews, and the sitemap is yet one other source for consideration when taxonomy terms. The hierarchy of the sitemap should also not be too closely followed, although parts of its hierarchical structure may be taken into consideration for creating taxonomy relationships.