Monday, February 23, 2026

Taxonomy Sources: Re-Used, Licensed, or AI-Generated

As a taxonomist, I often write about creating taxonomies from scratch, but in practice, many organizations often obtain at least some taxonomies or controlled vocabularies from other sources.  Although internal content about an organization’s business, products, or services requires mostly custom taxonomies, some taxonomies, such as for regions or technologies, may come from other sources. Content that comes from external sources, such as research articles, is also be appropriate for tagging with taxonomies from other sources.

For “other sources,” these could be:

  • Governmental agencies or nongovernmental organizations which publish taxonomies, thesauri, and subject heading schemes for their purposes but which are freely available

  • Companies which sell their taxonomies

  • Taxonomies that are generated by AI

computer monitor with an implemented faceted taxonomy in its screen

Taxonomies for Re-Use or License

Types of taxonomies available can be categorized in multiple ways that overlap:

  •  available for free or for a fee
  •  available for commercial re-use or not available for commercial re-use
  •  permissible for modification or not permitted to modify
  •  designed a created for a specific content set or intended for broader use

I had previously blogged on taxonomies for license, discussing the issues of fees, availability for re-use, and permission for modification. Now I want to focus on the issue of using a taxonomy created for a specific purpose. 


Recently, I worked for a client that had created taxonomies for the life sciences industries with sections based on branches on the National Library of Medicine’s Medical Subject Headings (MeSH), because it was free. MeSH, however, had been designed for indexing medical research literature, and turned out not to be suitable for my client’s purpose of helping biomedical and pharmaceutical companies find articles relevant to their business and market.

For example, MeSH organizes drug types by their chemical types (Heterocyclic Compounds, Enzymes and Coenzymes, etc.). For a biomedical drug discovery company or a pharmaceutical company, however, the focus and classification of drugs is instead based on what kind of disease they treat (Cancer Drugs, Alzheimer’s Drugs, etc.). Thus using concepts from MeSH is not so suitable for pharmaceutical industry taxonomy.


Previously, I worked at Gale, which developed and managed many controlled vocabularies (or taxonomies) for indexing periodical and reference literature, which it sold to libraries. For a time, Gale also offered for license subject-domain subsets of its subject thesaurus of over 10,000 preferred terms. I realized that the business terms to index articles in business news sources were not necessarily the same terms that a company would want to tag its business documents and intranet pages. Others seemed to realize this too, and Gale didn't sell any stand-alone taxonomy licenses as long as I worked there. 


Taxonomies that are designed purely for sale and not designed with specific content and user type in mind are more suitable for licensing and re-use. I’ve seen a few small scale examples of this with sets of keywords for sale for tagging photos. The only commercial business I am aware of that licenses full taxonomies (with alternative labels and multiple hierarchies) in various business and industry domains is WAND. These taxonomies, which are also enriched with alternative labels (synonyms/variants) are a decent way to get started. The taxonomies can then be edited or supplemented as needed. WAND taxonomies, which are manually developed, are particularly useful for product and services categories in various industries.

AI-Generated Taxonomies

When I first explored the use of GenAI to create taxonomies (described in my prior blog post), I felt that the results were quite inadequate, as LLMs were pulling from multiple sources, where the same term could have different meanings in different contexts, different terms could refer to the same thing, and even the hierarchy would vary for different use cases.


More recently, I’ve used ChatGPT and Claude and found that the results, especially when focused in areas of science, technology, and medicine, have improved with respect to specific taxonomy hierarchies. Even when I did not ask for a taxonomy, the LLMs often return respectable three-level hierarchies of concepts in such topic areas as medical devices, drug types, and cell receptors. I also found AI tools useful for disambiguating similar terms or providing synonyms for technical terms I was not sure of. 


AI-generated taxonomies are a potential competitor to WAND’s taxonomies for sale, but this depends on the size and subject area. The WAND taxonomies are large and detailed in the number of concepts, hierarchical levels, alternative labels, and they have already been expertly created by humans. Using AI to create taxonomies works better on single hierarchical trees, and always requires human editing to refine and complete the taxonomies. Hierarchies and alternative labels are created in separate steps. For multiple smaller taxonomies or taxonomy facets, AI is likely the more practical option than licensing full taxonomies. 


So, it shouldn’t be a surprise that taxonomy management software is starting to integrate GenAI and LLMs to automate taxonomy creation. For example, Graphwise Modeling (formerly PoolParty) introduced a Taxonomy Advisor feature in 2024, which allows users to request suggestions for narrower concepts, alternative labels, and definitions. This month, Graphwise announced the additional Taxonomy Builder feature, which enables the generation of a complete taxonomy hierarchy. It can be used for small portions or larger portions of the taxonomy, as needed, and it’s convenient to have the capabilities within a single tool. It also takes care of the prompt creation, based on the existing hierarchy and the user-entered description of the taxonomy and any additional instructions. I do not create taxonomy hierarchies with AI tools often enough to become good at writing the best prompts, so I appreciate it when a tool helps with that. There will be more about this later, as I working on white paper and will be speaking in a webinar in April on GenAI/LLMs in taxonomy creation. 

When to use Other Sources

As mentioned previously, taxonomies published from external sources are best used for content from external sources. When it comes to AI-generated taxonomies, though, it’s not necessary to generate an entire taxonomy, hierarchy, or facet. AI methods are quite suitable for smaller components of a taxonomy, such as narrower concepts to a single concept. As such, AI uses in taxonomy development are more widely applicable, including for enterprise taxonomies. For example, AI could be useful for generating a list of document types for a document type facet, and then after review, those AI-suggested document types that are not applicable can be removed. The starter list of terms can get people thinking of what might be missing, which is easier than trying to come up with a list of terms from scratch. 


In conclusion, an AI-generated taxonomy, after human review and editing, is usually a better solution than a licensed taxonomy that was created for a different purpose, such as using MeSH for the commercial side of healthcare. A taxonomy that is partially generated by AI or fully generated by AI that uses multiple sources and appropriate prompts (such as what is built into Taxonomy Builder) is typically a better source than a taxonomy that was created for a specific and different use case or than a taxonomy whose license prohibits editing or commercial re-use. If you choose to generate taxonomies with AI, I am happy to offer my services to review and edit them!

Saturday, January 31, 2026

What a Taxonomy is Not

Although taxonomies have become increasingly common within enterprises and on websites, they are not always well understood. Taxonomies are sometimes confused with other knowledge organizations systems, such as classification systems, website navigation schemes, business glossaries, or ontologies.


A taxonomy is a controlled, structurally organized set of unambiguous concepts, which may describe content, information, or data, and which users may be interested in querying about. A taxonomy links users to the information they seek by bringing together various users’ terms with the terms that occur in the content or data. Prior to the emergence of modern taxonomies in applications for digital information, indexes at the back of printed books had been serving a similar role (and they still do). I have already written a blog post on Taxonomy Definition, so to further clarify what taxonomies are, it is also useful to explain what taxonomies are not

 


Taxonomies are not the same as classification systems/schemes (such as industrial classification codes for economic analysis or medical classifications for health data collection or health insurance purposes), as the latter have mutually exclusive classes to which items are assigned for non-redundant analysis. Classification thus allows comparison, analysis, identification, location, and other actions associated with things based on their class. Taxonomies are organized sets of concepts tagged to content or associated with data, where the taxonomy organization serves merely for finding the desired concept or providing context for tagging. Thus, a concept may have more than one broader concept and thus appear in more than one place in the taxonomy hierarchy. 


Taxonomies are not the same as navigation systems, which are common in websites or web applications. A taxonomy is more similar to an index, while a navigation system is more similar to a table of contents. Menu labels in a navigation can link to only one page, whereas concepts in a taxonomy are tagged to multiple pages, content items, or data records. Navigation systems are only used in browsing, but taxonomies may be both browsed and searched for their concepts. Navigation systems reflect paths and established links to content, whereas taxonomies comprise concepts that become metadata when tagged to content. Navigation systems, like classification systems, are not frequently or easily changed, whereas taxonomies can grow and change continuously, as needed.


Taxonomies are not the same as business glossaries, which are lists of terms of relevance to an organization’s business along with their definitions, although there is usually considerable overlap between the terms an organization gathers for its glossary(s). Not only is there usually the difference of a taxonomy’s hierarchical structure (although categories could be assigned to glossary terms), but the ultimate objectives differ, resulting in differences of scopes of term inclusion. A business glossary includes all terms of importance to the business but may not be understood by everyone, so definitions need to be provided. There could be terms of importance, that need no definition, such as Marketing, so they are not included in the glossary. Technical terms and acronyms are usually included. A taxonomy, on the other hand, includes only the terms/concepts of which there are sufficient documents, pages, or content items to be tagged for retrieval. Sufficient content on a subject is a leading criteria for including a concept in a taxonomy.


Finally, taxonomies are not the same as ontologies. The confusion between the two may arise because taxonomies and ontologies are increasingly used in combination, and software (now referred to as TOMS for taxonomy-ontology management system) allows you to create a taxonomy and ontology as a single project or knowledge model. An ontology can be an upper-level model of a knowledge domain, but domain-specific ontologies may include multiple hierarchical levels of subclasses, and thus include what are essentially taxonomies. A taxonomy, however, can stand on its own without an ontology and serve the functions of tagging and retrieval via browsing and/or searching without the extension of an ontology. Ontologies support complex, multi-part queries involving reactions, and they support reasoning and inference, which taxonomies do not. Each utilizes different data models: SKOS for taxonomies and RDFS and OWL for ontologies. 


Prior blog posts I have written that compare taxonomies to other knowledge organization systems in more detail are: 

Tuesday, December 30, 2025

Taxonomy Benefits Over an Ontology

In a recent conversation based on a LinkedIn post, someone asked “Why choose a taxonomy over an ontology?” This is a good question, since there has been a growing understanding that ontologies build upon taxonomies by adding more semantics, which enable additional benefits. I have presented at conferences on the topic of extending a taxonomy with an ontology. Taxonomies, however, have benefits that ontologies alone cannot provide.

I have compared taxonomies and ontologies in a past blog post (Taxonomies vs. Ontologies). Comparing their uses to taxonomies, ontologies support more complex multi-part searches, enable searching on data and not just content or full documents, and can connect across data in different repositories and sources, which leads to creating knowledge graphs or a semantic layer. Additionally, ontologies support modeling and exploration of complex relationships, graph visualizations, and support for reasoning and inferencing based on logic. Meanwhile, ontologies also include the basic feature of taxonomies of unlimited hierarchies of classes and subclasses. Thus, it may seem as if ontologies are superior to taxonomies and provide greater benefits than taxonomies.

Taxonomies, however, especially those based on the SKOS (Simple Knowledge Organization System) data model, have features and benefits not supported by ontologies alone which are based only on OWL and RDFS standards.  These taxonomy (or more broadly “controlled vocabulary”) features include the incorporation of synonyms to support searching and tagging, the support of multilingual concepts, the inclusion of definitions and notes in a standardized manner, the ability to map and link taxonomies together based on equivalent or related concepts, the alignment of the taxonomy with end-user applications including browsable hierarchies and facets for filtering, and finally the ease of implementation into various content systems.

Taxonomies are richer than ontologies in their linguistic aspects, including both synonyms and labels in other languages. Taxonomies are traditionally based on thesauri, which include the feature of having “equivalence” among multiple terms, whereby a preferred term may be “used for” other nonpreferred terms. The SKOS data model specifies a preferred label and any number of alternative labels and hidden labels for a concept. Furthermore, concepts may have labels in multiple languages, and this supports tagging content in different languages and retrieval by users of different languages.

In ontologies, there exists the OWL property of sameAs for equivalence of individuals and equivalentClass for equivalence of classes, but both tend to be used to declare equivalence across different datasets rather than for use within a single ontology, as there is no designation of preferred and alternative names. So, these OWL properties are more like mapping properties than support of synonyms within a controlled vocabulary. As such they do not support the basic purpose of alternative labels in a taxonomy, which is to enable matches to support searching on variant labels and tagging despite different words in texts for the same thing.

The SKOS data model for taxonomies defines properties for scope notes, editorial notes, history notes, examples, and definitions. These are standardized fields and thus the meanings of these notes fields are consistent across taxonomies, supporting interoperability and migration. In OWL ontologies there exists an annotation property, but its use broadly includes labels, definitions, synonyms, attribution, notes, or comments.  With such inconsistent use, annotations are not well supported in importing, exporting, or linking of ontologies.

SKOS also has a set of mapping relationships. While OWL supports equivalence with SameAs and equivalentClass, SKOS taxonomies have not only equivalence relationships, exactMatch, but also closeMatch, narrowMatch, broadMatch, and relatedMatch, and thus all concepts in two separate taxonomies can be mapped to each other, unlike two ontologies which may share only a few matches. The full mapping of one taxonomy for another supports various uses, including using one taxonomy in the front end and the other in the back end, tagged to content.

Finally, taxonomies are better suited for various content-based implementation and applications, especially with out-of-the-box systems, such web content management systems, digital asset management systems, SharePoint, etc. A taxonomy modeled is several SKOS concept schemes can designate each concept scheme as a facet in faceted search/browse system, in which a facet serves as a filter. A taxonomy built as a hierarchy tree can be implemented so that users can expand the tree to browse to narrower concepts and then they can retrieve content tagged with the most specific concept desired. Ontologies, even if they contain hierarchies of classes and subclasses, are typically visualized as graphs, and any hierarchies are not displayed in a front-end application. Furthermore, ontology visualizations are usually not linked to actual content or data as they serve just for visualizing.

In sum, while ontologies add richer semantics/ meaning to relationships and attributes, taxonomies have richer semantics/meaning for concepts. Combining a taxonomy and ontology can bring the best of both worlds, and semantic web standards of SKOS, OWL, and RDF-S are all compatible for combining within a single project, since they are all based on the RDF (Resource Description Framework) data model. However, in many cases, a taxonomy with rich meaning for concepts, support for synonyms in search and tagging, along interactive displays of hierarchies and/or facets, is all that is needed. You can always add an ontology later.

Sunday, November 9, 2025

Schema Vocabularies and Value Vocabularies

There are different types of controlled vocabularies for information and knowledge management. Usually, we think of the various kinds of controlled vocabularies for purposes of tagging and finding information, such as term lists, authority files, thesauri, and taxonomies. In the broader context of information and knowledge management, there also exist higher-level controlled vocabularies called schema vocabularies. In this context, the better known (default) controlled vocabularies comprising specific concepts or terms for tagging content are called value vocabularies, since their terms/concepts are considered values.

This dichotomy of schema and value vocabularies occurs particularly within the context of metadata. Metadata management comprises two components: (1) a list of metadata types, also called elements, properties, or fields; and (2) the terms or values possible for each metadata element. I discussed types of metadata in more detail in my last blog post, "Types of Metadata Schema." Thus, a schema vocabulary comprises the names of metadata elements, and a value vocabulary is list of terms/concepts for a specific metadata element. For example, a schema vocabulary, might include Country, Language, Source, and Topic; and the multiple values vocabularies would be the lists of approved countries, languages, sources, and topics. It should be noted that in some systems, e.g. RDF, OWL, etc., the distinction between metadata elements and metadata values can be fuzzy. Furthermore, not all schema vocabulary elements have a corresponding value vocabulary (a controlled vocabulary), though, as some metadata elements may be for such values as title, description, and date. 

In my observation, we speak of “vocabularies” rather than “controlled vocabularies” in this context, especially with respect to schema, for various reasons. Schema vocabularies are referred to simply as “vocabularies,” rather than “controlled vocabularies,” because they are not traditional controlled vocabularies used for tagging, and also because their “control” is different from the control of value vocabularies. Value vocabularies can be changed but through defined policies and procedures, which depend on the implementation and ownership, and changes can be frequent, e.g. weekly, monthly, quarterly, or annually. Schema vocabularies, on the other hand, are intended to be standard, and are updated only very infrequently, such as once per 5-10 years, and usually by a standards body. Schema vocabularies provide control by their very nature. Meanwhile, it is often necessary to call out the controlled feature of value vocabularies, since some metadata properties may have uncontrolled keywords as their values.

Schema vocabularies may be metadata schema, such as Dublin Core (for published resources) or IPTC metadata (for photos), but other kinds of information and content management schema can also be considered as schema vocabularies in that a “vocabulary” defines the various elements. Such other schema vocabularies include SKOS (Simple Knowledge Organization System), DCAT (Data Catalog Vocabulary), and iiRDS (intelligent information Request and Delivery Standard), among others. Our panel “Using Schema and Value Vocabularies to Provide Consistency Across Structured Content” addressed these schema and other data frameworks, which are similar to but not the same as schema, such as OWL and DITA, at the recent DCMI (Dublin Core Metadata Initiative) conference in Barcelona in October.  Other speakers were Joseph Busch, who had the idea of this topic for a conference panel, Lief Erickson, Noz Urbina, and Peter Winstanley.

DCMI 2025 Panel: "Schema and Value Vocabularies for Consistency"

My presentation the DCMI panel, was "Schema and Value Vocabularies for Thesauri and Taxonomies," which explained that SKOS is a schema vocabulary, and specific SKOS-based taxonomies and thesauri are value vocabularies. SKOS (Simple Knowledge Organization System) is the W3C data model schema for knowledge organization systems, especially taxonomies and thesauri. It can also be considered a schema vocabulary, because it has standard elements with defined display names and machine-readable concatenated forms. In fact, the designation “elements” is what is used in the SKOS model. SKOS, however, is a special kind of schema vocabulary, and it’s not a metadata schema. When SKOS-based taxonomies or thesauri serve as the value vocabularies for metadata elements, those metadata elements are managed as specific SKOS Concept Schemes. In a faceted taxonomy, each Concept Scheme serves as a facet.

Taxonomists don’t usually think of vocabularies being classified as either "schema vocabularies" or "value vocabularies." However, as taxonomies have increasingly been integrated with metadata and serve purposes beyond just browsing, searching and retrieving content, it’s important to see the bigger picture of where taxonomies as value vocabularies fit in, and where taxonomies can provide more benefits.

Friday, October 31, 2025

Types of Metadata Schemas

Taxonomies or sets of controlled vocabularies are typically implemented as values for  various metadata elements (also called metadata properties or fields). Metadata elements that contain controlled vocabularies could be Topic, Activity, Location, Organization Name, People/Role Type, Document Type, Content Language, etc. These are often implemented as facets in faceted search, although they do not have to be. There may be additional metadata elements for non-taxonomy values, such as Document Title, Image Caption, Creator/Author, Creation Date, Rights Status, etc. In addition to designing taxonomies, in my consulting projects I often also design such broader metadata schemas.


Custom Metadata Schemas

A custom (use case-specific) metadata schema specifies which metadata elements to include for different purposes. These include content tagging and management, content workflow management, end-user search filters, or merely displayed on content records for identification.

A custom metadata schema may specify the following:

  • A definition for each metadata element
  • Sample values for each metadata element
  • In what user interfaces the metadata element appears
  • The ownership or authority of a metadata element, whether a department or role 

A custom metadata schema also specifies rules about the application of each metadata element, including:

  • The value type for the metadata element (For example, controlled vocabulary terms, uncontrolled keywords, free text, date, integers, Boolean yes/no, etc.)
  • Whether assignment of a value from the metadata element is required or optional for each content item (or depends on the specific type of content item).
  • Whether the assignment of the values from the metadata element is limited to just one or can be multiple, which is referred to as “cardinality.” (For example, the assignment of only one Document Type but up to four Topics per content item.)

Table example of a custom metadata scheme
Example of a Custom Metadata Schema
 

Standard Metadata Element Sets and Schemas

In the context of metadata schemas, there exist not only these custom metadata schemas, but also standard metadata sets of elements and their schemas. They provide predefined metadata elements that are intended to be sufficiently generic for various use cases. Perhaps the most widely used standard metadata schema in is Dublin Core, which is a set of 15 basic (core) elements intended for published documents. These elements are Title, Subject, Description, Type, Source, Relations, Coverage, Creator, Publisher, Contributor, Rights, Date, Format, Identifier, and Language. There are other standard metadata schema that are somewhat more specific for a subject domain, such as IPTC (International Press Telecommunications Council) metadata which is intended for images. When standard data notation, such as XML or RDF, whose specification may also part of the standard metadata scheme, metadata can then be shared.

Standard metadata schema include information for each element such as definition and type, but unlike custom metadata schemas, standard metadata schemas do not include any instructions on their application, such as cardinality and implementation, as that depends on each use case. Therefore, if you choose to apply a standard metadata schema, you need to additionally decide and document how it should be applied, especially which elements are to be used for which purposes, in which systems, along with metadata element-specific rules of requirements and cardinality, as describe above. This kind of document is referred to as an application profile.

My most recent conference presentation, a panel at the DCMI (Dublin Core Metadata Initiative) conference in Barcelona, October 22-25, addressed application profiles. Panel organizer, Joseph Busch, explained in his presentation: “An application profile defines a specific set of requirements, settings, and metadata for a particular application to ensure compatibility and functionality. The profile adapts general standards or frameworks to meet the needs of a specific use case, for example.”

Taxonomists usually don’t speak to their stakeholders or clients of "application profiles," because such specifications are typically already included within a larger taxonomy governance plan, something taxonomists commonly create and promote. When taxonomists work specifically with metadata experts, however, they should consider the specific needs of an application profile.

Finally, a standard metadata schema, with its predefined labels for metadata elements, can also be considered a kind of (controlled) vocabulary. This is the topic of my next blog post, "Schema Vocabularies and Value Vocabularies."