The Accidental Taxonomist

Friday, June 30, 2023

Taxonomies for Technical Documentation

Taxonomies are primarily for tagging content for what is about so that precise content can easily be found by users, who browse or search on the taxonomy terms. The types of content tagged and implementations of taxonomies are numerous. One growing area of taxonomy use is technical documentation.

Technical documentation describes and explains the use or design of products or services. We refer to “documentation,” rather than “documents,” because the format can vary, including book-length manuals, multi-page PDF files such as white papers, content for printed product inserts or brochures, public website pages, and internal content management system pages. Technical documentation has existed for a long time. It used to be published only in print, especially as manual, like books, so the tools of information findability were the table of contents and the index at the back of manual. Now that technical documentation is most often consumed online and always managed digitally, an alphabetical browsable index is not practical to create, maintain, or use. Furthermore, indexes also cannot serve multiple-use (multi-channel) content well.

Taxonomies for content tagging and retrieval

In contrast to creating an alphabetical index of terms referencing page numbers or linked to content sections, tagging content with a taxonomy, has several benefits.

Taxonomies provide a better user experience than indexes. While an index requires the user to browse a long alphabetical list of terms until the desired term is found, the browsing of taxonomies does not require the user to already know the name of the desired term. Taxonomies that are arranged in hierarchical trees allow the user to drill down from broad categories to a specific topic. Taxonomies that are arranged as facets allow the user to select displayed terms (often listed by frequency of tagged usage) grouped by various facets (aspects) to limit the search results.

Facets for technical documentation could be:

User audience
Content type
Product (name or module)
Feature or function
Topic

The process of tagging with a taxonomy or other controlled vocabulary is also simpler than creating an index. Creating a back-of-the-book index involves not only determining important concepts, but also giving them names as terms, determining subentries if any, and creating cross-references. Only trained indexers can do this well. Tagging with a taxonomy, especially if the taxonomy is already well-designed, is not so challenging. Since the terms and their synonyms or cross-references have already been established, it’s just a matter of looking up the term that describes to concept. Technical content now tends to be managed in component content management systems (CCMSs), so the unit of content to be tagged is already designated as a component. (See my April blog post.) Thus, content managers, editors, and writers can competently do tagging themselves. Tagging with a taxonomy can also be automated.

An index is tied to a specific document or collection. The same taxonomy, on the other hand, can be used for more than just technical documentation but across the enterprise, such as for website and other marketing content, product information, and research and development. Consistent terms support more efficient and comprehensive information gathering, sharing, and analysis.

Taxonomies to serve technical documentation’s diverse users

Taxonomies are a useful information finding tool when content is being used by different kinds of users. The same, or parts of the same, technical documentation often have diverse users: product customers, prospective customers, technical support agents, consultant staff, product managers, engineers, etc.

Taxonomy concepts have synonyms or alternative labels to reflect the preferred wording of different groups of users. Matches to even these synonyms can be displayed after a search string is entered into a search box.
https://help.poolparty.biz documentation search on taxonomy concepts
The same taxonomy can be adapted to different user groups with different user interfaces. For example, exposing more metadata in an “advanced search” or displaying just a subset of a larger set of facets.
Taxonomy concepts can be managed with labels in multiple languages, supporting the tagging and retrieval of multilingual content for users of different languages.

Events on taxonomies in technical documentation

I have found increasing interest in taxonomies at technical documentation events. While I have been writing and speaking about taxonomies for a long time, in the past year I have been invited to talk about taxonomies at several events and programs more focused on technical documentation.

Recent past events focusing on technical documentation, at which I spoke, with recordings available:

“Indexes, Search, and Taxonomies: Paths to Findability” Society for Technical Communication webinar, June 2023 (recording available for purchase in late July)
“Taxonomy For Delivering Targeted Technical Content” BrightTALK webinar, April 2023
“From Document Search to Document Understanding” presented by Helmut Nagy, ConVEx, April 2023 (The recording of my presentation on knowledge hubs, is only available for conference registrants.)

Upcoming presentations of mine focusing on taxonomies and technical documentation:

“Taxonomy Creation for Content Tagging” online workshop, Society for Technical Communication Tuesdays, July 18, July 25, and August 1, 4:00 – 5:30 EDT (Registration is still open.)
Taxonomy panel, ConVEx Ideas online conference, July 19, 12:00 – 1:30 pm EDT
“Leveraging Semantics to Provide Targeted Training Content: A Case Study” LavaCon content strategy conference, San Diego and hybrid online, October 16, 1:30 – 3:00 pm PDT

Monday, May 29, 2023

Taxonomies and ChatGPT

ChatGPT, generative AI, and large language models (LLMs) are hot topics of interest in fields of data, information, and knowledge management. LLMs dominated the keynote presentations at the networking conversations at Knowledge Graph Conference in New York and were also discussed in presentations and panels of this conference and Data Summit in Boston, both of which I attended this month. The technology is relevant to taxonomies as well.

ChatGPT is the user interface application on top of GPT (Generative Pre-Trained Transformer), a publicly available LLM developed by OpenAI, which is now in version 4. ChatGPT is thus a form of generative AI, in how it generates answers. There are many other LLMs (Neural network-based AI, trained with deep learning on very large volumes of text), including those which are proprietary, restricted, or for non-commercial research, but only some have generative AI user interfaces. Although we may think of generative AI for providing answers to questions, it can do a lot more, including tasks related to taxonomies.

Organizing terms into hierarchies

Building a taxonomy is a combination of top-down design (identifying the top concepts or facets) and bottom-up building (identifying specific concepts from content analysis). The top-level of a taxonomy is designed to serve user needs and thus should be based on stakeholder interviews, surveys, and brainstorming workshops, which is not something ChatGPT can do. The bottom-up building a taxonomy, based on terms extracted content or search log terms, may benefit from some AI involvement.

I have made a few test requests of ChatGPT for “Put the following list of terms into a hierarchical taxonomy…,” and the results are bulleted lists with indented narrower concepts. ChatGPT can also generate a taxonomy in a machine-readable SKOS in a requested RDF serialization format, as Bob DuCharme explained in his May 20 blog post “Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy.”

Like card sorting exercises, you can specify the top categories/concepts (like a “closed card sort”), or you can let ChatGPT create the top categories (like an “open card sort”). In any case, better results are with context, of course, so you should also tell ChatGPT what the subject domain or context is. Asking for a hierarchical taxonomy results in a third level of hierarchy sometimes, and not just a single level of grouping. Near duplicates usually appear next to each other in the list, and the taxonomist can then decide if and how to merge them into a single concept.

It is particularly for long lists of terms, where automated methods can save the taxonomist’s time. If a taxonomist comes up with terms based on manual content analysis, stakeholder interviews, or submitted lists from subject matter experts, the term lists tend not to be very long, and even the process of coming up with the terms tends to include some thoughts toward categorization at the same time. Longer term lists (such several hundred) are derived from automated term extraction (using text analytics technologies) across a corpus of dozens or hundreds of documents and from search log reports. ChatGPT is practical for putting these long lists of terms into draft hierarchies. There are inevitably some taxonomic errors in the results, which should be obvious to any taxonomist. For example, I have seen duplicated terms on different levels of the hierarchy.

In both lists of extracted terms and search log lists, terms occur that are not suitable as concepts for a taxonomy, such as verbs and adjectives or vague words. ChatGPT understands grammatical rules, so my prompt also says “Include in the taxonomy only nouns and noun phrases and omit the other terms.”

Generating alternative labels (“synonyms”) for concepts

Asking ChatGPT to “provide a list of synonyms for…” a given term can also be helpful for coming up with alternative labels for taxonomy concepts. Alternative labels should be customized for the context of the content and users, so alternative labels for a concept will vary from one taxonomy to another, and an external source, such as ChatGPT should not relied upon as the only source for alternative labels, but merely as a supplemental source of suggestions to be considered.

Again, context can help and should be provided. I asked “Provide a list of synonyms for “healthcare” and got 20 terms. But then when I asked “Provide a list of synonyms for health care, meaning the industry,” I received a slightly more focused list of 15 terms. Interestingly, the two-word variant “health care” was not on the list, so “synonyms” is understood by ChatGPT to mean different words with the same meaning and not orthographic variations. Nevertheless, even 15 terms are too many, and the taxonomist should select from the list of suggestions. It might be a good idea to then test search the suggested alternative labels in the content and system being used.

Although by strict definition a “synonym” is a single word with the same meaning as another word, ChatGPT provides acceptable synonyms for terms which are multi-word phrases, or synonymous multi-word phrases, such as “Chemical manufacturing and distribution” provided as a synonym for “chemical industry.”

Other taxonomy-related uses of ChatGPT

Getting help in designing an ontology (a more complex, yet high-level semantic model with defined classes of concepts, customized relationships, and attributes) is also possible with ChatGPT or other LLMs. Again, submitting the request multiple times with slight variations will yield multiple different responses for the ontologist to consider and select ideas from. Ontologies are not expressed in simple text, though, so the prompt request should specify it, such as RDF TTL. Dean Allemang, author of Semantic Web or the Working Ontologist, has written multiple articles (medium.com/@dallemang) recently on ChatGPT and ontologies/knowledge graphs.

ChatGPT can also be used for comparing lists of terms, data conversion, and basic coding, which may be useful for taxonomists who lack coding skills. It can convert taxonomy or ontology data from one data format to another (although taxonomy/ontology management software also imports/exports in multiple formats). Taxonomies and ontologies in their raw data format are most commonly expressed in the RDF (Resource Description Framework) data model which has various serialization format: RDF/XML, JSON, JSON- LD, .ttl (Turtle), etc., and ChatGPT can convert data from one to another. Data extraction can also be done with ChatGPT. For example, knowledge management professional Camille Mathieu recently shared in a LinkedIn post how she used ChatGPT to write a Python script to extract text & metadata from PDFs.

Perhaps what is most intriguing as a future implementation of taxonomies and ChatGPT is to go in the other direction and have knowledge organization systems, such as taxonomies, support the creation and use of queries (as called “prompts”) for generative AI, to obtain better results. This requires some back-end development, though, and is not merely a matter of putting a taxonomy into a prompt. Since a taxonomy is created for a specific subject domain, the questions need to be confined to the domain of the taxonomy. Semantic Web Company has developed a simple publicly accessible demo “PoolParty Meets Chat GPT,” whereby you can compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. The semantically enriched questions generate answers that have more detail, better accuracy, and even web links to definitions and other articles.

Conclusions

While it’s arguable whether ChatGPT alone is a good way to obtain “facts,” there is no doubt that it is a good way to get suggestions and ideas. These suggestions can support the work of taxonomists and ontologists, and taxonomies and ontologies in turn can support the results of ChatGPT and other LLMs. Because there will be errors from ChatGPT, it should not be used to generate taxonomies by those who are not already knowledgeable with taxonomy requirements and best practices, nor should it be used as a substitute for the expertise of taxonomists.

I hope to experiment more with ChatGPT for taxonomies and share additional details in future blog posts.

Sunday, April 30, 2023

Taxonomies for Content Components

The primary purpose of taxonomies is to support consistent topical tagging (indexing) of content and full and accurate content retrieval based on the tagged taxonomy concepts that the end-user selects. The unit of content that is tagged makes a difference in the retrieval results and user experience. Users want to find specific content, such as a paragraph, a captioned image, a timestamp section within an audio or video file. This is not always possible. The traditional method of tagging is to tag the entire file, document, or web page, even if the specific topic with the desired information is only part of the larger file, such as a few sentences within a web page or document of multiple paragraphs. The user then spends time (or wastes time) trying to find the desired information in the larger file.

Content components

Fortunately, there are methods to tag and retrieve content at smaller units, such as a text section identified with a heading, within a longer document. These methods depend on having “structured” content, where sections are marked off using a markup language, most commonly Extensible Markup Language (XML). As XML is rather generic, there have emerged standards specifically for XML-based component-based content management, including DITA (Darwin Information Typing Architecture).

www.dita-ot.org

Structuring content was not originally developed for the purpose of detailed topical tagging/indexing and retrieval, though, but rather for the purpose of creating (authoring) and publishing content, especially to the web, more efficiently. Originally, the focus of structured content was on marking up the document style and supporting keyword tags for the entire document. The first content management systems (CMSs) were developed shortly after the web in the 1990s to facilitate the publishing of web pages, although later a distinction emerged be web content management systems and enterprise content management systems.

By the early 2000s, component content management systems (CCMSs) emerged, whereby content is managed in units (components) smaller and more specific than an entire document. CCMSs enable content publishing to be more modular and flexible, supporting content reuse, and making it easier to update content, by updating only the relevant components, instead of the entire document. CCMSs are especially used for creating technical documentation, but they are not limited to that use. Examples of CCMSs include Adobe FrameMaker, Documentum, Hereto, Kontent.ai, Quark, Paligo, Sanity, and Tridion Docs. While more precise tagging was not the original goal of CCMSs, it is a beneficial outcome.

Taxonomies and component content management

CCMSs, along with all CMSs, have come to support taxonomies and tagging better over the years. This includes both support for more taxonomy features, such as hierarchies and synonym (alternative labels), and support for importing and exporting taxonomies in standard interoperable formats. With respect to CCMSs, taxonomies can be built out to a greater level of detail, with concepts specific to the component topics of CCMS. However, whoever is creating the taxonomy should remember not to create concepts that are so specific that a concept is applicable to only a single component topic. A single taxonomy concept should retrieve multiple results.

CCMSs, along with all CMSs, can also connect to or integrate with taxonomies managed in dedicated taxonomy management systems, such as PoolParty. Since organizations tend to have multiple CMSs, each for different kinds of content and purposes, they are likely to end up creating multiple, separate (siloed) taxonomies with similar or overlapping concepts. Therefore, the best strategy for enterprise taxonomy management is to manage taxonomies centrally, either as a single master taxonomy or with multiple taxonomies linked together in dedicated taxonomy management software, which can connect to CMSs with APIs (application programming interfaces) to push the taxonomy out to the CMSs, including CCMSs. Additionally, prebuilt integrations of taxonomy management systems and CCMSs, such as PoolParty and Tridion Docs, are becoming more common.

There is also a growing interest in taxonomies at conferences dealing with component content management. Last October I attended the LavaCon conference for content strategy for the first time, where my pre-conference workshop on taxonomies was well attended. Two weeks ago, I participated in the ConVEx conference, where there is more focus on component content management than at LavaCon. (ConVEx was formerly the DITA North America conference.) In contrast to LavaCon’s two presentations on taxonomies, ConVEx had a track with the “taxonomy” theme and five presentations focused on taxonomies and another three presentations with topics related to taxonomies.

Component content management enables more targeted topic tagging and opens up more possibilities for rich taxonomies. Thus, as a taxonomist, I look forward to learning more about CCMSs and how they taxonomies can best be applied in these systems.

Friday, March 31, 2023

Taxonomy and Information Architecture Compared

There is considerable overlap between the fields of information taxonomies and information architecture. Both involve information organization, labeling, search, and findability. In some organizations the job roles and titles are combined. I previously blogged on “Information Architecture and Taxonomies,” observing that “information architecture” in name seemed to be declining while aspects of its practice continued to be strong, since it was an underlying theme in several of the talks at major taxonomy conference, Taxonomy Boot Camp in 2013.

Photo of Information Architecture Conference opening: welcome on the screen and a jazz band playing

Information Architecture Conference opening. Photo Marisela Meskus

This week, for the first time, I am attending in person the Information Architecture Conference, being held in New Orleans March 28 - April 1, so it’s been interesting to hear how information architects consider taxonomies.

How Information Architecture and Taxonomy Overlap

The fields of information architecture and taxonomy are related beyond the stated shared practices of information organization, labeling, search, and findability.

When I give an introduction to taxonomies, I explain that a taxonomy is an intermediary between users and content to connect users to content by means of terms that the users understand and by the display of the terms in hierarchies, facet-filters, or type-ahead suggestions, which enable users to explore and interact with the taxonomy. This is clearly an aspect of information architecture.

In my own career path, I discovered taxonomy and information architecture at the same time. I had been working as a “controlled vocabulary editor” and had the opportunity to work on an interdisciplinary team for a newly design information product. A user interface for school library research database included both a hierarchical taxonomy that was designed to fit with a particular user interface.

At the Information Architecture Conference, I asked for a raise of hands of my session audience of how many had worked with taxonomies, and it seemed to be over 80%. At the conference, I met information architects who specialized in taxonomies, and taxonomists who had an interest and done some work in information architecture. Even though I identify as a taxonomist, I already knew a number of speakers at the Information Architecture conference due to the overlapping communities.

How Information Architecture and Taxonomy Differ

Information architecture is a discipline and a profession that is larger and more established than that of taxonomies. Although taxonomy work is growing, there are still more college courses on information architecture than on taxonomies, more books on information architecture than on taxonomies, and more people with “information architect” than “taxonomist” as a job title (based on LinkedIn searches).

Listening to sessions at the Information Architecture Conference and having discussions with participants, I began to see a clearer picture on how the fields of information architecture and taxonomies differ.

The Information Architecture Conference brings together a community of professionals who share ideas and experiences. There is no comparable taxonomist community as taxonomy work, compared to information architecture work, tends to be done by those with different professional backgrounds: information architects, librarians, content managers, metadata architects, indexers, ontologists, etc. It’s telling that there is not just one conference at which I present about taxonomies but multiple. (Knowledge management, content strategy, knowledge graphs, and data science are the fields of conferences at which I have spoken about taxonomies in the past year.) The only conference about taxonomies, Taxonomy Boot Camp, is more of specialized track within the KM World conference, and aims to provide taxonomy best practices and case studies to managers and directors of content, product, or knowledge management. It is not really a forum for taxonomists to discuss topics of their profession, as the Information Architecture Conference is.

It seems that information architecture is more of a discipline and a field, whereas taxonomy is more of tool or system (although a very important one). In addition to information architects in organizations in various industries and consultants, the Information Architecture Conference includes professors and students in the field. By contrast taxonomy is not a field of study, research, or focus in academia. It is a focus area only in industry and consulting. Information architecture seems to allow more room for theory than does the taxonomy field.

How Information Architecture and Taxonomy Are Related

From a "taxonomic" perspective, which is broader? For information architects, taxonomy is narrower than information architecture. There is no doubt that information architecture is broader in various ways, including content/information organization, design, user experience, and even organization of non-digital information spaces. For example, information architects are concerned not only with taxonomies to support searching and browsing for information, but also with content organization and navigation menu structuring in websites and in software user interfaces.

Taxonomists, on the other hand, do not consider taxonomies as a sub-field of information architecture, but rather consider the two fields as adjacent and closely related. This is because the taxonomies that information architects create tend to be small, such as term lists for metadata properties or facets or as hierarchies to model menu navigation or site maps. Professional taxonomists tend to work on large dynamic taxonomies or thesauri that are used to tag/index and retrieve content or data in one or more systems, often where the user interface is already prescribed.

The related fields or disciplines are also different. Information architecture has a closer relationship with fields of design, user experience, sociology, and psychology. Taxonomy has a closer relationship with indexing/tagging, natural language processing, ontologies, Semantic Web technologies, and knowledge management. One related field shared by both information architecture and taxonomy is structured content, which was also a subject of presentations at this year's Information Architecture conference and the field of my next conference.