As a taxonomist, I often write about creating taxonomies from scratch, but in practice, many organizations often obtain at least some taxonomies or controlled vocabularies from other sources. Although internal content about an organization’s business, products, or services requires mostly custom taxonomies, some taxonomies, such as for regions or technologies, may come from other sources. Content that comes from external sources, such as research articles, is also be appropriate for tagging with taxonomies from other sources.
For “other sources,” these could be:
Governmental agencies or nongovernmental organizations which publish taxonomies, thesauri, and subject heading schemes for their purposes but which are freely available
Companies which sell their taxonomies
Taxonomies that are generated by AI
Taxonomies for Re-Use or License
Types of taxonomies available can be categorized in multiple ways that overlap:
- available for free or for a fee
- available for commercial re-use or not available for commercial re-use
- permissible for modification or not permitted to modify
- designed a created for a specific content set or intended for broader use
I had previously blogged on taxonomies for license, discussing the issues of fees, availability for re-use, and permission for modification. Now I want to focus on the issue of using a taxonomy created for a specific purpose.
Recently, I worked for a client that had created taxonomies for the life sciences industries with sections based on branches on the National Library of Medicine’s Medical Subject Headings (MeSH), because it was free. MeSH, however, had been designed for indexing medical research literature, and turned out not to be suitable for my client’s purpose of helping biomedical and pharmaceutical companies find articles relevant to their business and market.
For example, MeSH organizes drug types by their chemical types (Heterocyclic Compounds, Enzymes and Coenzymes, etc.). For a biomedical drug discovery company or a pharmaceutical company, however, the focus and classification of drugs is instead based on what kind of disease they treat (Cancer Drugs, Alzheimer’s Drugs, etc.). Thus using concepts from MeSH is not so suitable for pharmaceutical industry taxonomy.
Previously, I worked at Gale, which developed and managed many controlled vocabularies (or taxonomies) for indexing periodical and reference literature, which it sold to libraries. For a time, Gale also offered for license subject-domain subsets of its subject thesaurus of over 10,000 preferred terms. I realized that the business terms to index articles in business news sources were not necessarily the same terms that a company would want to tag its business documents and intranet pages. Others seemed to realize this too, and Gale didn't sell any stand-alone taxonomy licenses as long as I worked there.
Taxonomies that are designed purely for sale and not designed with specific content and user type in mind are more suitable for licensing and re-use. I’ve seen a few small scale examples of this with sets of keywords for sale for tagging photos. The only commercial business I am aware of that licenses full taxonomies (with alternative labels and multiple hierarchies) in various business and industry domains is WAND. These taxonomies, which are also enriched with alternative labels (synonyms/variants) are a decent way to get started. The taxonomies can then be edited or supplemented as needed. WAND taxonomies, which are manually developed, are particularly useful for product and services categories in various industries.
AI-Generated Taxonomies
When I first explored the use of GenAI to create taxonomies (described in my prior blog post), I felt that the results were quite inadequate, as LLMs were pulling from multiple sources, where the same term could have different meanings in different contexts, different terms could refer to the same thing, and even the hierarchy would vary for different use cases.
More recently, I’ve used ChatGPT and Claude and found that the results, especially when focused in areas of science, technology, and medicine, have improved with respect to specific taxonomy hierarchies. Even when I did not ask for a taxonomy, the LLMs often return respectable three-level hierarchies of concepts in such topic areas as medical devices, drug types, and cell receptors. I also found AI tools useful for disambiguating similar terms or providing synonyms for technical terms I was not sure of.
AI-generated taxonomies are a potential competitor to WAND’s taxonomies for sale, but this depends on the size and subject area. The WAND taxonomies are large and detailed in the number of concepts, hierarchical levels, alternative labels, and they have already been expertly created by humans. Using AI to create taxonomies works better on single hierarchical trees, and always requires human editing to refine and complete the taxonomies. Hierarchies and alternative labels are created in separate steps. For multiple smaller taxonomies or taxonomy facets, AI is likely the more practical option than licensing full taxonomies.
So, it shouldn’t be a surprise that taxonomy management software is starting to integrate GenAI and LLMs to automate taxonomy creation. For example, Graphwise Modeling (formerly PoolParty) introduced a Taxonomy Advisor feature in 2024, which allows users to request suggestions for narrower concepts, alternative labels, and definitions. This month, Graphwise announced the additional Taxonomy Builder feature, which enables the generation of a complete taxonomy hierarchy. It can be used for small portions or larger portions of the taxonomy, as needed, and it’s convenient to have the capabilities within a single tool. It also takes care of the prompt creation, based on the existing hierarchy and the user-entered description of the taxonomy and any additional instructions. I do not create taxonomy hierarchies with AI tools often enough to become good at writing the best prompts, so I appreciate it when a tool helps with that. There will be more about this later, as I working on white paper and will be speaking in a webinar in April on GenAI/LLMs in taxonomy creation.
When to use Other Sources
As mentioned previously, taxonomies published from external sources are best used for content from external sources. When it comes to AI-generated taxonomies, though, it’s not necessary to generate an entire taxonomy, hierarchy, or facet. AI methods are quite suitable for smaller components of a taxonomy, such as narrower concepts to a single concept. As such, AI uses in taxonomy development are more widely applicable, including for enterprise taxonomies. For example, AI could be useful for generating a list of document types for a document type facet, and then after review, those AI-suggested document types that are not applicable can be removed. The starter list of terms can get people thinking of what might be missing, which is easier than trying to come up with a list of terms from scratch.
In conclusion, an AI-generated taxonomy, after human review and editing, is usually a better solution than a licensed taxonomy that was created for a different purpose, such as using MeSH for the commercial side of healthcare. A taxonomy that is partially generated by AI or fully generated by AI that uses multiple sources and appropriate prompts (such as what is built into Taxonomy Builder) is typically a better source than a taxonomy that was created for a specific and different use case or than a taxonomy whose license prohibits editing or commercial re-use. If you choose to generate taxonomies with AI, I am happy to offer my services to review and edit them!
