Showing posts with label Hierarchical taxonomy. Show all posts
Showing posts with label Hierarchical taxonomy. Show all posts

Monday, March 31, 2025

Customizing Taxonomy Hierarchies

Taxonomies need to be custom-created for their purposes to be most effective. Basically, a taxonomy comprises the concepts or terms that reflect the subject domain of the content that will be tagged and retrieved with the aid of that taxonomy. Taxonomies must also be customized to the requirements (or limitations) of the implemented search technology and the user interface, and ideally the taxonomy is also customized to the needs and preferences of the users. This includes taxonomy design aspects of size, degree of detail, use of synonym/variants, use of hierarchy, and implementation as facets.

Taxonomy customization usually focuses on the concepts/terms/labels and not so much on the exact hierarchy of grouping narrower concepts under broader concepts, other than perhaps limiting the number of hierarchical levels. While the selection and definition of concepts depends on the context of the content, the hierarchical relationships between concepts are typically independent of any specific content and are usually dependent only on the context of the taxonomy itself. Such a context-independent hierarchy is what enables a single taxonomy to be used for multiple different content items of different content creators. This is also the approach used in designing classification systems, which are intended for broad, generic use.
 

Why Customize Hierarchy

However, a customized taxonomy may be designed for a rather specific body of content, and then the hierarchy may depend on the context of that overall body of content, if not the specific content items. For example, the concept “Piano” is often considered narrower to “Musical instruments”, but in certain contexts it may be narrower to “Furniture,” such as for the contexts of interior design, furnishing a bar or restaurant, or for moving and storage services. Furthermore, I would not always recommend that “Piano” be narrower to both broader concepts in the same taxonomy (a taxonomy feature known as “polyhierarchy”), because the same taxonomy might not be used for both contexts. It depends.


When structuring a taxonomy hierarchy, the use and purpose of the hierarchy needs to be considered. A hierarchy is not created simply because it’s a taxonomy and thus traditionally has hierarchy. Possible uses of hierarchy include:

  • Supporting browsing and navigation to guide users to the desired concept.
  • Providing context for concepts to support tagging, whether manual or automated.
  • Enabling “recursive” or “rolled up” retrieval, so that a user’s selection of a concept retrieves not only what was been tagged to that concept but also what has been tagged to all of its narrower concepts, too.
  • Enabling expansion of a search, so that if there are too few or no results for a specific concept, the retrieval set can be expanding to content tagged with the broader concept and/or other narrower concepts of it.
  • Instructing users on the appropriate classification and organization of information

Usually, the same hierarchy can support all of the above goals, although occasionally there are conflicting needs.

Customizing Hierarchy Example

The need for customizing hierarchy became especially clear to me in a recent taxonomy consulting project I did for the business of event venue space rentals. Types of spaces (structures, rooms, etc.) were grouped under broader concepts by their potential use, rather than by structural type. To a lesser extent, events or activities for spaces were also sometimes grouped by the type of space that might be suitable. For example, a generic taxonomy might include “Dance class” and “Technical training” both under the same broader concept for “Classes/training,” but because these different types of classes need different kinds of spaces, in this taxonomy they were put in different parts of the taxonomy hierarchy. “Dance class” was made narrower to “Dance event,” and “Technical training” was made narrower to “Training.”

The hierarchy of concepts used in a taxonomy to tag images may also be structured differently than a taxonomy for tagging text content. In this case, for example, broader concepts for grouping others had been created of “Small meeting” and “Large event,” which may not seem logically needed when the range in number of guests was an additional search attribute/filter. However, these concepts are quite useful for tagging images that may depict a small or large event but do not utilize counts of people. Another example is grouping together under the same broader concept the activities of music rehearsals/practices along with music performance events under the same broader concept of “Music events.” Although the activities of organizing rehearsals and organizing performances are quite different from each other, the venues that are suitable for each and their images are similar.

Despite their similarities in scope and concepts, a taxonomy for venue rentals should not be the same as a taxonomy for real estate of long-term lease or sale of properties (focusing on the space but agnostic to the use), nor for events management (focusing on the details of events and less so on space), nor equipment sales and rentals (focusing on the equipment and less on the use). Even when the concepts are the same, the hierarchy may differ. While the inclusion of concepts and their labels should consider the content, the design of the hierarchy should consider the taxonomy’s use.

Sunday, March 24, 2024

History of Modern Information Taxonomies

The word “taxonomy” was coined in 1813 by the Swiss botanist A. P. de Candolle, who developed a new method of classifying plants. The word is derived from the combination of Greek words τάξις (taxis), meaning “order” or “arrangement,” and νόμος (nomos), meaning “method” or “law.” The designation of taxonomy was then applied after-the-fact to Carl Linneaus’ binomial nomenclature system that had been published under the title Systema Naturae initially in 1735.

Today’s information taxonomies have their origins in a combination of classification systems, library subject heading schemes, and literature retrieval thesauri, and thus have features that combine all of these. Despite their name, information taxonomies are closer to subject heading schemes and thesauri, than they are to classification systems.

Classification systems

Classification systems have a multi-level hierarchy of classes, where a subclass is fully contained in its parent class, and consequently members of a subclass are also members of the parent class. Members (things) can belong to only one class, though. Historic examples include:

  • Linnaean classification of organisms (1735-1758)
  • Paris Bookseller's classification (1842)
  • International Classification of Diseases (originally Bertillon Classification of Causes of Death, 1860)
  • Dewey Decimal Classification (1876) and other library classifications
  • Industry classification systems:
    • Standard Industrial Classification System (U.S) (1937)
    • International Standard Industrial Classification (U.N.) (1948)

The requirement that a thing (an organism, book, document, medical diagnosis, economic establishment) can go into only one class supports various purposes, which are not for information retrieval:

  • Understanding and organism’s evolutionary background; identifying potential medicinal herbs
  • Locating and reshelving a book on its shelf
  • Performing heath data analysis from hospital records; billing health insurance companies appropriately
  • Doing economic analysis of industries by aggregate establishment data

When it comes to information resources, classification systems may be used to determine in what (virtual) file folder a document belongs or, to support machine-learning based auto-classification.

Classification systems are also useful for data analysis, since content or records are assigned to only one classification, and this prevents any double counting. Large, data-heavy organizations might have developed their own internal classification systems for data tracking purposes. Such classifications do not serve the same purpose of a tagging/information retrieval taxonomy and should not substitute for a taxonomy but rather exist alongside for separate purposes.

Subject heading schemes

Subject heading schemes were developed to help people find books and later also articles on various subjects with more detail and flexibility for growth than classification systems. Subject headings are used for cataloguing and indexing, not for classification. Unlike classification (for shelf location) of which an item has only one classification, an item (book, article, other media) can have multiple subjects.

Features of subject heading schemes:

  • Alphabetical arrangement of a very large number of subjects and/or named entities (proper nouns)
  • Cross-references of See (Use) and See also (Related)
  • Headings with large numbers of citations broken down to group the citations by a sub-heading or subdivision, in what is also called pre-coordination. For example, China – Foreign relations.

Back-of-the-book indexes, whose format evolved over the first half of the 20th century, follow a similar style.

Examples of early subject heading schemes:

  • Library of Congress Subject Headings (1898) and other national library systems
  • US. National Library of Medicine’s Medical Subject Headings (1954)

Library subject headings were adopted for periodical article indexes early on. The Reader’s Guide to Periodical Literature published by the H.W, Wilson Company had been using subject headings, including subdivisions and cross-references, since shortly after its introduction in 1901 (as can be seen in the 1900 -1905 cumulative index excerpted in the screenshot below).

(The two-digit years are from the prior century.)

Eventually, subject heading schemes adopted thesaurus features of Broader term, Narrower term, and Related term relationships, as was the case for Library of Congress Subject Headings, starting in 1985. Thus, subject heading schemes and thesauri have become very similar. The name “heading” in subject headings implies that there also exist some sub-headings/subdivisions, a feature which is not a typical of thesauri, though.

Thesauri

Information thesauri (in contrast to a dictionary thesaurus, like Roget’s) emerged in the mid-20th century outside of libraries for the more specialized subject needs of the federal government, scientific publishers, and technology companies. The word “thesaurus” was first used to refer to a controlled vocabulary, as a set of words/terms, not classification codes, for information retrieval in the 1950s.

Early thesauri include:

  • E. I. Dupont de Nemours Company’s thesaurus (1959)
  • Thesaurus of Armed Services Technical Information Agency (ASTIA) Descriptors, U.S. Department of Defense (1960)
  • Chemical Engineering Thesaurus, published by the American Institute of Chemical Engineers (1961)

Additional professional organization publishers of scientific journals created their own thesauri in the 1960s. Dialog, the first online information service for article citations, which also utilized thesauri of information publishers, was launched in 1966.

Soon thereafter, standards for thesauri were developed and published:

  • UNESCO Guidelines for the establishment and development of monolingual thesauri (1970)
  • DIN 1463 (Deutsches Institut für Normung) Guidelines for the establishment and development of monolingual thesauri (1972)
  • ISO 2788 Guidelines for the establishment and development of monolingual thesauri (1974) (superseded by ISO 25964-1 2011)
  • ANSI American National Standard for Thesaurus Structure, Construction, and Use (1974) (superseded by ANSI/NISO Z39.19 1993)

Modern information taxonomies

The word “taxonomy” for a hierarchical structure (like a classification scheme) of terms for tagging and retrieval (like a thesaurus) gradually became popular in the 1990s. These new taxonomy-like thesauri became popular, largely due to advancements of software and website user interfaces to enable interactive displays of hierarchies. Taxonomies had the same primary purpose of thesauri, which is information findability and retrieval, but taxonomy implementations introduced new designs for browsing and expanding hierarchies. It was found that “taxonomy” also tended to resonate with business audiences better than “thesaurus.” A market for business and commercial taxonomies started to be recognized by software vendors and by consultants by the end of the 1990s.

Combining an interactive user interface with a database enabled the introduction of dynamic filters or refinements of searches by selected taxonomy terms based on different aspects, and thus faceted taxonomies emerged and have since become a popular, if not dominant, implementation of taxonomies for many different use cases. Faceted taxonomies, by combining search terms for refinement, do not need to be as large and detailed as thesauri.

As for the next chapter in the history of taxonomies, that involves a convergence with ontologies. You can read more about that in my past blog article “Taxonomies vs. Ontologies.”

 

Monday, May 29, 2023

Taxonomies and ChatGPT

ChatGPT, generative AI, and large language models (LLMs) are hot topics of interest in fields of data, information, and knowledge management. LLMs dominated the keynote presentations at the networking conversations at Knowledge Graph Conference in New York and were also discussed in presentations and panels of this conference and Data Summit in Boston, both of which I attended this month. The technology is relevant to taxonomies as well.

ChatGPT is the user interface application on top of GPT (Generative Pre-Trained Transformer), a publicly available LLM developed by OpenAI, which is now in version 4. ChatGPT is thus a form of generative AI, in how it generates answers. There are many other LLMs (Neural network-based AI, trained with deep learning on very large volumes of text), including those which are proprietary, restricted, or for non-commercial research, but only some have generative AI user interfaces. Although we may think of generative AI for providing answers to questions, it can do a lot more, including tasks related to taxonomies.

Organizing terms into hierarchies

Building a taxonomy is a combination of top-down design (identifying the top concepts or facets) and bottom-up building (identifying specific concepts from content analysis). The top-level of a taxonomy is designed to serve user needs and thus should be based on stakeholder interviews, surveys, and brainstorming workshops, which is not something ChatGPT can do.  The bottom-up building a taxonomy, based on terms extracted content or search log terms, may benefit from some AI involvement.

I have made a few test requests of ChatGPT for “Put the following list of terms into a hierarchical taxonomy…,” and the results are bulleted lists with indented narrower concepts. ChatGPT can also generate a taxonomy in a machine-readable SKOS in a requested RDF serialization format, as Bob DuCharme explained in his May 20 blog post “Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy.”

Like card sorting exercises, you can specify the top categories/concepts (like a “closed card sort”), or you can let ChatGPT create the top categories (like an “open card sort”). In any case, better results are with context, of course, so you should also tell ChatGPT what the subject domain or context is. Asking for a hierarchical taxonomy results in a third level of hierarchy sometimes, and not just a single level of grouping. Near duplicates usually appear next to each other in the list, and the taxonomist can then decide if and how to merge them into a single concept.

It is particularly for long lists of terms, where automated methods can save the taxonomist’s time. If a taxonomist comes up with terms based on manual content analysis, stakeholder interviews, or submitted lists from subject matter experts, the term lists tend not to be very long, and even the process of coming up with the terms tends to include some thoughts toward categorization at the same time. Longer term lists (such several hundred) are derived from automated term extraction (using text analytics technologies) across a corpus of dozens or hundreds of documents and from search log reports. ChatGPT is practical for putting these long lists of terms into draft hierarchies. There are inevitably some taxonomic errors in the results, which should be obvious to any taxonomist. For example, I have seen duplicated terms on different levels of the hierarchy.

In both lists of extracted terms and search log lists, terms occur that are not suitable as concepts for a taxonomy, such as verbs and adjectives or vague words. ChatGPT understands grammatical rules, so my prompt also says “Include in the taxonomy only nouns and noun phrases and omit the other terms.”

Generating alternative labels (“synonyms”) for concepts

Asking ChatGPT to “provide a list of synonyms for…” a given term can also be helpful for coming up with alternative labels for taxonomy concepts. Alternative labels should be customized for the context of the content and users, so alternative labels for a concept will vary from one taxonomy to another, and an external source, such as ChatGPT should not relied upon as the only source for alternative labels, but merely as a supplemental source of suggestions to be considered. 

Again, context can help and should be provided. I asked “Provide a list of synonyms for “healthcare” and got 20 terms. But then when I asked “Provide a list of synonyms for health care, meaning the industry,” I received a slightly more focused list of 15 terms. Interestingly, the two-word variant “health care” was not on the list, so “synonyms” is understood by ChatGPT to mean different words with the same meaning and not orthographic variations. Nevertheless, even 15 terms are too many, and the taxonomist should select from the list of suggestions. It might be a good idea to then test search the suggested alternative labels in the content and system being used.

Although by strict definition a “synonym” is a single word with the same meaning as another word, ChatGPT provides acceptable synonyms for terms which are multi-word phrases, or synonymous multi-word phrases, such as “Chemical manufacturing and distribution” provided as a synonym for “chemical industry.”


Other taxonomy-related uses of ChatGPT

Getting help in designing an ontology (a more complex, yet high-level semantic model with defined classes of concepts, customized relationships, and attributes) is also possible with ChatGPT or other LLMs. Again, submitting the request multiple times with slight variations will yield multiple different responses for the ontologist to consider and select ideas from. Ontologies are not expressed in simple text, though, so the prompt request should specify it, such as RDF TTL. Dean Allemang, author of Semantic Web or the Working Ontologist, has written multiple articles (medium.com/@dallemang) recently on ChatGPT and ontologies/knowledge graphs.

ChatGPT can also be used for comparing lists of terms, data conversion, and basic coding, which may be useful for taxonomists who lack coding skills. It can convert taxonomy or ontology data from one data format to another (although taxonomy/ontology management software also imports/exports in multiple formats). Taxonomies and ontologies in their raw data format are most commonly expressed in the RDF (Resource Description Framework) data model which has various serialization format: RDF/XML, JSON, JSON- LD, .ttl (Turtle), etc., and ChatGPT can convert data from one to another. Data extraction can also be done with ChatGPT. For example, knowledge management professional Camille Mathieu recently shared in a LinkedIn post how she used ChatGPT to write a Python script to extract text & metadata from PDFs.

Perhaps what is most intriguing as a future implementation of taxonomies and ChatGPT is to go in the other direction and have knowledge organization systems, such as taxonomies, support the creation and use of queries (as called “prompts”) for generative AI, to obtain better results. This requires some back-end development, though, and is not merely a matter of putting a taxonomy into a prompt.  Since a taxonomy is created for a specific subject domain, the questions need to be confined to the domain of the taxonomy. Semantic Web Company has developed a simple publicly accessible demo “PoolParty Meets Chat GPT,” whereby you can compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. The semantically enriched questions generate answers that have more detail, better accuracy, and even web links to definitions and other articles.

Conclusions

While it’s arguable whether ChatGPT alone is a good way to obtain “facts,” there is no doubt that it is a good way to get suggestions and ideas. These suggestions can support the work of taxonomists and ontologists, and taxonomies and ontologies in turn can support the results of ChatGPT and other LLMs. Because there will be errors from ChatGPT, it should not be used to generate taxonomies by those who are not already knowledgeable with taxonomy requirements and best practices, nor should it be used as a substitute for the expertise of taxonomists.

I hope to experiment more with ChatGPT for taxonomies and share additional details in future blog posts.

Friday, December 30, 2022

Taxonomy Definition

I usually explain that a taxonomy is a structured kind of controlled vocabulary, which is list of terms (or concepts) usually used to tag content to aid in its retrieval. The structure can be hierarchical, faceted, or a combination. Other people have defined taxonomies for a general audience in more simplistic ways as a kind of hierarchical classification system. So, while a taxonomy has two main features (naming and structure), my preferred definition has focused on the controlled vocabulary and naming aspect, whereas other definitions focus on the hierarchical classification aspect of taxonomies. However, a taxonomy and a classification system are not necessarily the same. While it is understandable that a definition is simplified for a general audience, it should not be simplified to the extent of being misleading.

I have blogged previously on the differences between taxonomies and classification systems, so I won’t repeat all the differences again.  The main point is that a classification system is generic and rigid and is intended to be used widely, such as the Dewey Decimal Classification for libraries, whereas a taxonomy tends to be customized for a particular use case and context and is flexible and undergoes changes.

Meanwhile, there are also a few well-known classification systems that are called “taxonomies,” such as the Linnaean taxonomy of organisms and Bloom’s taxonomy of educational objectives.  These seem quite different from the information-retrieval type of taxonomy. The Linnaean hierarchical levels have names (Kingdom, Phylum, Class, etc.). The relationship of the hierarchical levels to each other are not all of the thesaurus standards: generic-specific, generic-instance, or whole-part. Rather, the Linnaean taxonomic relationship are generic-specific only, or more precisely that of member of class or subclass. Bloom's taxonomy has a completely different hierarchical model that does not follow thesaurus standards at all.

How does a taxonomy of concepts for information retrieval relate to a scientific taxonomy? They are similar, and the differences are not so great that there should be considered different meanings of the word “taxonomy.” If we consider that taxonomies are systems to name and organize things hierarchically, then a taxonomy for information retrieval, comprised of terms for tagging and retrieving content (documents, images, etc.), can be considered a taxonomy of a controlled vocabulary, in contrast to taxonomies of things, such as organisms. This is a slightly different perspective than to consider a taxonomy as a kind of controlled vocabulary, as I previously had. The following diagram illustrates a possible way to consider how information-retrieval taxonomies related to classification systems and controlled vocabularies.

Diagram showing that information taxonomies are at the interssection of classification systems and controlled vocabularies

Several kinds of knowledge organization systems are defined by their published standards. For thesauri, there are ANSI/NISO Z39.19 and ISO 25964. For terminologies, there is ISO/TC 37/SC 3 and other related standards. For ontologies, there is OWL (Web Ontology Language) from the W3C. There is no standard, however, specifically for “taxonomies” or even for “classification systems,” which is a reason why these remain difficult to define. The designations “classification system,” “classification scheme,” and “taxonomy” have been used interchangeably.

Wikipedia provides the definition at the entry for Taxonomy: “A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types.” But then it goes on to say, “it may refer to a categorisation of things or concepts.” Thus, an information-retrieval taxonomy is a categorization of concepts (also called terms in a controlled vocabulary). It is not a classification system, since the goal is not to classify things, not even the things tagged with the taxonomy concepts, but rather to organize the set of concepts that have been identified as appropriate for tagging and retrieving a set of content.


Saturday, April 30, 2022

Polyhierarchy in Taxonomies

A defining characteristic of taxonomies is that terms/concepts are arranged in broader-narrower hierarchies, which may resemble tree structures. A limited number of top concepts each have narrower concepts, which in turn may have narrower concepts, etc., and the narrowest concepts at the bottom of the hierarchy are sometimes referred to as leaf nodes, as “leaf” extends the metaphor of “tree.” The tree model has its limits, though, because taxonomies may also have occasional cases of “polyhierarchy,” whereby a concept may have two or more broader concepts, instead of just one.

 

People who are new to taxonomies, however, might not consider polyhierarchies, because they tend to think of taxonomies as classification systems. Hierarchical information taxonomies have their origin in classification systems, such as the Linnean taxonomy of organisms, library classification systems, and industry classification systems. Classification systems, however, do not allow polyhierarchy within the system. Originally, classification systems were for physical things, such as books, which can belong in only one place, so there could be no polyhierarchy. Standard classification systems, such as industry classification systems, were developed by governmental, international, or nongovernmental organizations with a primary purpose of gathering and organizing statistical data about classes, and thus polyhierarchy is not permitted, as it would lead to double-counting of members of a class.

 

The primary purpose of hierarchy in a taxonomy is to provide guided browsing of topics to end-users, who may start out looking at broad categories and then drill down to find the narrowest concept of interest. Thus, polyhierarchy serves the same purpose. The idea is that different people will start at different points at the top of the hierarchy to arrive at the same concept of interest, which is tagged to the same content set. A polyhierarchy should be implemented if the concept’s relationship is correctly and inherently hierarchical in both of its cases. An example of a polyhierarchy is Educational software, which has both Software and Educational products as broader concepts. Educational software is a kind of software, fully included within Software, and Educational software is a kind of educational product, fully included within Educational products.

 



 

Taxonomy standards and polyhierarchy issues

 

Taxonomy/thesaurus standards (ANSI/NISO Z39.19 and ISO 25964) describe three kinds of hierarchical relationships--generic-specific, generic-instance, and whole-part,--and polyhierarchy may exist within any of these types. Polyhierarchy that combines different hierarchical types, however, can be problematic, so it is best to avoid mixing hierarchical relationship types. For example, the following polyhierarchy mixes different types:

 

Washington, DC

Broader: United States (whole-part)

Broader: Capital cities (generic-instance)

 

The reason to avoid creating a mixed type polyhierarchyis simply that the browsable hierarchy user experience can get compromised and potentially confusing. Extensive hierarchies with large numbers of narrower concept relationships would result. A hierarchical taxonomy tree should be designed with a dominant hierarchy design. An exception is a thesaurus, which is not designed so much for top-down browsing but for browsing from term to term. Mixing hierarchical types within a thesaurus is thus acceptable.

 

It is also recommended to avoid creating hierarchical relationships across different facets in a faceted taxonomy. This is because facets are designed to be mutually exclusively, so that concepts from multiple facets can be used in combination to limit/filter/refine a search. As such, facets are designed to be distinct aspects. There could be an occasional exception of polyhierarchy, though, but more than 2-3 polyhierarchies across an entire faceted taxonomy should be a cause for review.

 

With the wider adoption of the SKOS (Simple Knowledge OrganizationSystem) model for taxonomies and in taxonomy management systems, taxonomies are more commonly organized into concept schemes. A concept scheme can be represented as a facet in a faceted taxonomy, but it is not limited to use as a facet. Utilizing concept schemes, it makes sense to have separate concept schemes with different hierarchical types, some for generic-specific (for type, categories, topics), one or more for whole-part (geography, organizational structures), and some containing lists of instances (named entities). In this model, Washington, DC, would be narrower only to the United States in the whole-part hierarchical concept scheme for geographic places. It could also be linked to Capital cities, which is in a different concept scheme for place types, with a different kind of relationship (“related” or perhaps a semantic relationship from an ontology).

 

Although SKOS permits hierarchical relationships across different concept schemes, it is best practice not to do this but rather to create hierarchical relationships and polyhierarchies confined within a concept scheme, just as it is recommended not to have polyhierarchy across facets.

 

Additional polyhierarchy considerations

Polyhierarchy concerns concepts in the taxonomy, and it is not about objects, items, or assets that get tagged with taxonomy concepts, such as an individual publication, document, image, product record, etc. Each of these may get tagged with multiple taxonomy concepts, and as such may have multiple “classifications” and thus can appear as if they are in a polyhierarchy, if a frontend application displays tagged items as if they are leaf nodes in a taxonomy.

A polyhierarchy usually involves only two broader concepts, not more. Having more than two broader concepts is extremely rare. If you find yourself creating polyhierarchies of three or more multiple times in a taxonomy, check to make sure you are not doing something wrong with the hierarchy design.

Some content management systems, which have built-in taxonomy management and tagging features, do not support polyhierarchy. The best known is SharePoint with taxonomies managed in its Term Store feature. Taxonomy terms may be “reused” across Term Sets, but they are not permitted within a Term Set, where it is most suitable. See my past post, Polyhierarchy in the SharePoint Term Store, for more details

Saturday, July 31, 2021

Taxonomies and Sitemaps

I was recently asked if a website’s sitemap of company’s website could serve as the start of a taxonomy for an organization. The sitemap, after all, includes all the relevant topics pertaining to an organization’s business offerings, and they are arranged in a hierarchy.  I have previously blogged on the subject of why a website’s navigation is not a taxonomy in Navigation Schemes and Taxonomies. A sitemap is similar to a website’s navigation, but it goes deeper by including the titles or topics of web pages which are not included in the website’s menu, and it is not necessarily intended for user browsing. A sitemap may go five or six levels deep, whereas the website menu navigation menus are usually only two levels. Therefore, a sitemap may seem as if it’s a taxonomy. However, just because a sitemap is as large and detailed as a taxonomy needs to be does not make it suitable as a taxonomy.

Different purposes

We need to understand what a taxonomy is for. It’s to aid users in locating desired content by topic-terms, which reflect both the terminology use of the users and of the content. Taxonomy terms are tagged/indexed to content that is relevant to the term. The starting point when creating a taxonomy is to identify the topics of the content and identify the topics of user interest or search, and then merge those topics into a taxonomy by bringing together different names for the same concept. The concepts are then structurally arranged to show the relationships between the terms, especially hierarchical relationships. The primary purpose of the hierarchy of terms in a taxonomy is to aid the users in finding the appropriate term. When browsing the taxonomy, they may find a broader term or narrower term that better describes their search goals. Then they can select that term to retrieve content that was tagged with the term.  

A sitemap, on the other hand, lists all or most pages of a website, usually by page title and organized in the hierarchical structure of the website. The hierarchical structure of the website was designed to organize information in a logical manner for users to browse and explore, as considered by the information architect who designed the website. The sitemap thus reflects pages, which are often topics but not always. A page may have multiple topics of interest that a user might want to look up. A page is sometimes for performing a function or activity and not necessarily just a topic of information.

A sitemap is typically automatically generated from the page titles, and its primary purpose is not for user but for machines: they tell search engines about pages that are available for crawling on websites and can thus support search engine optimization (SEO). Sitemap are useful in planning the further development or organizational improvement of a website. Whether a sitemap should even be displayed to end users as a tool to find information on a website is questionable. If automatically generated, it's not designed for that purpose, but users could find it helpful, especially users who understand that it is merely the aggregation of page titles organized in the file structure of the website. Some website make it available, and some do not. Some websites have displayed a simplified sitemap instead  that is designed to be a guide to the users, but then it do not include all pages.

Different labels

The title names of pages and thus of sitemap entries often do not correspond to taxonomy terms. They could start out with verb for an activity, they could be commands or questions, or they could be complete sentences. Taxonomy terms are topics or names only represented by nouns or noun phrases, or proper nouns. Examples of sitemap entries that are not good taxonomy terms may include:

How to use…
Get started with…
Help with…
Pay a bill
Shop for…

As with navigation, the entries of a sitemap reflect pages in a one-to-one relationship, in contrast to taxonomy terms, each of which may retrieve multiple pages or content sources, and each page or content item can be tagged with multiple taxonomy terms. As such, entries in a sitemap may actually be more specific than would be needed in a taxonomy.  The user’s selection of multiple taxonomy terms in combination, through filters/refinements, achieves the result of obtaining an appropriate list of relevant content.

Conclusions

Sitemaps should not be used as taxonomies, but their topics (not their labels) may be considered as a good source for a taxonomy. Sitemaps might not even be suitable as a basis or starting point for a taxonomy, but rather as a source for developing taxonomy terms. Rather, it is recommended that a taxonomy be created separately from a sitemap based on a review of content, search log data, and stakeholder and user interviews, and the sitemap is yet one other source for consideration when taxonomy terms. The hierarchy of the sitemap should also not be too closely followed, although parts of its hierarchical structure may be taken into consideration for creating taxonomy relationships.