Tuesday, August 31, 2021

Knowledge Engineering and Taxonomies

My next conference workshop (at SEMANTiCS September 7) on taxonomies and ontologies has in its title “knowledge engineering.” I figured this may resonate more with the audience of computer scientists, data scientists, and Semantic technology and AI experts. People come (often accidentally) to the field of designing taxonomies, ontologies, and knowledge organization systems in general from different backgrounds, and may work in different disciplines or departments. They may have very different training, job titles, job descriptions. 

Also, my job title now, at Semantic Web Company, is knowledge engineer, although there is not much agreement on what that job title means. I had once before, over 10 years ago, applied for a position with the job title knowledge engineer, and the role focused on writing rules for rules-based auto-classification. This involves using a taxonomy with logic rules and regular expressions for each taxonomy term to support automated indexing, rather than using training sets and machine learning. My current job, however, involves designing taxonomies and ontologies, often in combination.

Creating a taxonomy or thesaurus alone is not knowledge engineering. This is because  a taxonomy does not describe all aspects of a knowledge domain, just the concepts and their hierarchical relationships, or in the case of a thesaurus, some additional nonspecific related (See also) relationships. Furthermore, there already exist published guidelines/standards for taxonomies and thesauri, as ANSI/NISO Z39.19 and ISO 25964-1 that specify best practices for design. 

It makes more sense to call ontology design a form of knowledge engineering. Ontologies have a much higher level of semantics or expressiveness, which needs to be defined by the ontologist or knowledge engineer. There are customized, semantic relationships (such as “is located in” and “contains”), which are to be applied between designated classes (such as organizations and places), any number of customized attributes (such as address or latitude/longitude) that can be specified for a class. Standards for ontologies, such as OWL, which are from the World Wide Web Consortium (W3C), are only for machine readability and interoperability, but not for best practices, so there is more room for interpretation and innovation when it comes to designing an ontology, than there is for a taxonomy or thesaurus

Knowledge engineering may involve more than designing on ontology but may include all the various kinds of controlled vocabularies for the content and data of an organization. This includes determining what kind of vocabularies are needed and how they are related to each other.

Knowledge engineering is also very similar to knowledge modeling, which I blogged about before in the post "Knowledge Modeling." Knowledge egnineering is a more general function, whereas knowledge modeling is a more specific activity. 

Knowledge engineering also goes beyond taxonomy/ontology design and creation to include the follow-through application, which is namely the management of tagging or classification of content with the taxonomy. This is, after all, how a useful knowledge base is created, with content tagged and available for retrieval. Definitions of knowledge engineering sometimes refer to it as a field within artificial intelligence (AI) to build knowledge bases. While I might not agree that this is always part of the definition of knowledge engineering, AI is used for automated tagging of content with a taxonomy. 


It's probably better to define knowledge engineering more broadly as methods to support the development and transmission of knowledge, specifically by by transforming data to information and information to knowledge, as the frequently depicted pyramid on the right suggests. This transformation is specifically done by designing and creating links between data, which is supported by taxonomies and ontologies.


Saturday, July 31, 2021

Taxonomies and Sitemaps

I was recently asked if a website’s sitemap of company’s website could serve as the start of a taxonomy for an organization. The sitemap, after all, includes all the relevant topics pertaining to an organization’s business offerings, and they are arranged in a hierarchy.  I have previously blogged on the subject of why a website’s navigation is not a taxonomy in Navigation Schemes and Taxonomies. A sitemap is similar to a website’s navigation, but it goes deeper by including the titles or topics of web pages which are not included in the website’s menu, and it is not necessarily intended for user browsing. A sitemap may go five or six levels deep, whereas the website menu navigation menus are usually only two levels. Therefore, a sitemap may seem as if it’s a taxonomy. However, just because a sitemap is as large and detailed as a taxonomy needs to be does not make it suitable as a taxonomy.

Different purposes

We need to understand what a taxonomy is for. It’s to aid users in locating desired content by topic-terms, which reflect both the terminology use of the users and of the content. Taxonomy terms are tagged/indexed to content that is relevant to the term. The starting point when creating a taxonomy is to identify the topics of the content and identify the topics of user interest or search, and then merge those topics into a taxonomy by bringing together different names for the same concept. The concepts are then structurally arranged to show the relationships between the terms, especially hierarchical relationships. The primary purpose of the hierarchy of terms in a taxonomy is to aid the users in finding the appropriate term. When browsing the taxonomy, they may find a broader term or narrower term that better describes their search goals. Then they can select that term to retrieve content that was tagged with the term.  

A sitemap, on the other hand, lists all or most pages of a website, usually by page title and organized in the hierarchical structure of the website. The hierarchical structure of the website was designed to organize information in a logical manner for users to browse and explore, as considered by the information architect who designed the website. The sitemap thus reflects pages, which are often topics but not always. A page may have multiple topics of interest that a user might want to look up. A page is sometimes for performing a function or activity and not necessarily just a topic of information.

A sitemap is typically automatically generated from the page titles, and its primary purpose is not for user but for machines: they tell search engines about pages that are available for crawling on websites and can thus support search engine optimization (SEO). Sitemap are useful in planning the further development or organizational improvement of a website. Whether a sitemap should even be displayed to end users as a tool to find information on a website is questionable. If automatically generated, it's not designed for that purpose, but users could find it helpful, especially users who understand that it is merely the aggregation of page titles organized in the file structure of the website. Some website make it available, and some do not. Some websites have displayed a simplified sitemap instead  that is designed to be a guide to the users, but then it do not include all pages.

Different labels

The title names of pages and thus of sitemap entries often do not correspond to taxonomy terms. They could start out with verb for an activity, they could be commands or questions, or they could be complete sentences. Taxonomy terms are topics or names only represented by nouns or noun phrases, or proper nouns. Examples of sitemap entries that are not good taxonomy terms may include:

How to use…
Get started with…
Help with…
Pay a bill
Shop for…

As with navigation, the entries of a sitemap reflect pages in a one-to-one relationship, in contrast to taxonomy terms, each of which may retrieve multiple pages or content sources, and each page or content item can be tagged with multiple taxonomy terms. As such, entries in a sitemap may actually be more specific than would be needed in a taxonomy.  The user’s selection of multiple taxonomy terms in combination, through filters/refinements, achieves the result of obtaining an appropriate list of relevant content.

Conclusions

Sitemaps should not be used as taxonomies, but their topics (not their labels) may be considered as a good source for a taxonomy. Sitemaps might not even be suitable as a basis or starting point for a taxonomy, but rather as a source for developing taxonomy terms. Rather, it is recommended that a taxonomy be created separately from a sitemap based on a review of content, search log data, and stakeholder and user interviews, and the sitemap is yet one other source for consideration when taxonomy terms. The hierarchy of the sitemap should also not be too closely followed, although parts of its hierarchical structure may be taken into consideration for creating taxonomy relationships.

Wednesday, June 30, 2021

Taxonomy Management

As taxonomies become more common for information management and retrieval in all kinds of organizations and in various applications, the task of creating new taxonomies from scratch is less needed than the task of managing existing taxonomies. What is required for taxonomy management, however, might not be completely clear. I’ve written several posts on this blog which I tagged with the topic “Taxonomy maintenance,” but none tagged with “Taxonomy management.” That needs to be corrected. Taxonomy maintenance is part of the larger responsibility of taxonomy management.

Taxonomy management includes the following:

Taxonomy in PoolParty software

Taxonomy maintenance: adding concepts, merging concepts, editing select labels, adding alternative labels, adding relationships, etc. on an individual concept basis, to keep the taxonomy up to date, as new content and new concepts are introduced and terminology changes. These changes may arise from suggestions from those doing tagging, proactive review of new content and new trends, periodic review of search logs, and periodic text analytics of content. This is an on-going task, that can be done by one  ore more taxonomy editors, including those who are subject matter experts. In such cases, the taxonomy-editing work of non-taxonomists should be reviewed by a taxonomist.

Taxonomy governance: developing taxonomy maintenance policies and documentation. This comprises documenting the taxonomy type, features, purpose, ownership, use, etc., and documenting how the taxonomy should be updated to keep its style consistent, including the criteria for adding new concepts to the taxonomy. Taxonomies should be documented when they are created, but sometimes they are not and need to be. Documentation may need to be updated from time to time.

Taxonomy tagging management: developing and updating tagging rules or policies, ensuring tagging quality (comprehensiveness and correctness), and updating or improving the taxonomy if tagging issues indicate it. Tagging can be manual, automated, or automated with human review. Periodic review of the tagging is a necessary task. Even when managing tagging is another individual’s responsibility, managing taxonomies is not completely separate from managing tagging, and this is an ongoing responsibility of the taxonomist who manages the taxonomy.

Taxonomy integration with end-user applications: including websites and web content management systems (CMSs), enterprise content management systems, digital asset management systems, search software, and other custom applications such as recommendation, personalization, and question answering. A taxonomy may be managed within an application, such as a specific CMS or SharePoint, but then it is usable only for that single application. As organizations increase the number of their information management systems, it eventually becomes clear that separate siloed taxonomies are not a good idea, and a single taxonomy should be centrally managed and ported or synced with the taxonomy management components of each tool. Taxonomy application integration involves both technical aspects, such as integrations with APIs, and nontechnical aspects related to user experience, such as considering how the taxonomy displays to the end-users and how they interact with it. Often, an existing taxonomy needs to be adapted to a new application.

Taxonomy review and revision: reviewing a taxonomy for quality standards and against best practices guidelines and checklists, and making general widespread improvements, such as: ensuring that concepts and their labels are clear and unambiguous and that concepts are sufficiently distinct in their meaning, adding alternative sufficient labels (synonyms), ensuring that hierarchical relationships always follow the standards, adding polyhierarchy and associative relationships, changing the capitalization and plural style, ensuring that the hierarchy is not too detailed and deep in some areas. This task is undertaken by a taxonomist or taxonomy consultant only occasionally, especially if the taxonomy will undergo an extension or will be migrated to a new system.

Taxonomy extension: merging redundant taxonomies, integrating complementary taxonomies mapping/linking taxonomies or other vocabularies in the same domain to extend their use, or translating taxonomies to add additional languages.  This could include merging or linking a taxonomy and a glossary or terminology or linking the custom taxonomy to an industry standard classification scheme that is familiar to users. Taxonomy extension could also involve adding semantics of an ontology model with custom relationships and attributes. This task is also undertaken by a taxonomist or taxonomy consultant only occasionally.

The inclusion of all of these tasks of taxonomy management requires a dedicated taxonomy/thesaurus management tool, as spreadsheets are insufficient, and the taxonomy editing module of a single application not only tends to lack certain taxonomy management features but will not serve the needs of enterprise-wide taxonomy management.

I will discuss this all in more detail in an upcoming Pool Party webinar “Taxonomy Management 101” on August 4.

Sunday, May 30, 2021

Taxonomy Design Research

https://unsplash.com/photos/WC6MJ0kRzGw?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

I recently wrote an article “Taxonomies: Connecting Users to Content” for an online publication, Boxes and Arrows, on information architecture (IA) and user experience (UX). As I was working with the editors on the section of gathering information from users, I realized that IA and UX have very formalized researcher roles. There is a job title for “UX Researcher” with career guides and resources on what skills are needed, and many more jobs on job board sites posted for “UX researcher” than for “taxonomist.” Meanwhile, there is no such job as a “taxonomy researcher.” But designing and developing taxonomies, which are often part of information architecture or UX, does require research, including user research.

Taxonomy research is not as formalized and does not involve standard tools, as UX research does, but it is still important. There is not nearly as much published about taxonomy research as there is for UX research. However, certain research practices, I have found, are common in the taxonomy consulting industry. It’s a matter of best practices. Even when taxonomies are designed internally and not with an external taxonomy consultant’s assistance, research is still part of the process. The type of research may vary based on the background and experience of the person leading the effort.

Taxonomy design research includes:
  • Interviewing sample users and other stakeholders
  • Gathering input from brainstorming sessions
  • Analyzing content to be tagged
  • Analyzing existing vocabularies of all kinds
  • Analyzing any search log reports
  • Taxonomy testing

While UX research is a form of user research, taxonomy research involves both user research and content research (or content analysis), because a taxonomy needs to consider both user needs and content suitability.

Interviewing stakeholders

The primary method of gaining user input on a taxonomy is through interviews and questionnaires, ideally both in combination, where a conversation follows up on a list of questions sent to the person being interviewed. It’s important to ask different kinds of questions tailored to the different kinds of users, questions dealing with tagging vs. questions dealing with retrieval of content. The input gathered from users in these interviews and questionnaires can be used to better design and the taxonomy and its user interface, to obtain use cases to later test the taxonomy, to identify possible facets for a faceted taxonomy, and also to collect some concepts for the taxonomy.

Brainstorming sessions

Another method of obtaining input from users is through a brainstorming session. This method is particularly useful for internal enterprise taxonomies. Representative users from different departments can contribute their ideas by suggesting sample terms, which are written down on a white board, flipchart, or sticky notes, and then working with a facilitator, the brainstorming group can remove outliers, bring together synonyms and similar terms, and come up with categories or facets to group the terms. PoolParty is the only taxonomy management software that has an integrated brainstorming module called CardSorting.

Analyzing content

After determining the scope of content inclusion, content analysis should be performed on a representative sample of content of each of the different types and subject areas of content that will be tagged and retrieved, to identify topics and named entities relevant to the content. This form of content analysis is similar to indexing without a controlled vocabulary.  The taxonomist assumes the role of an indexer or someone tagging the content and notes what index terms or tags would best describe the content.

Automatic term extraction involves using text analytics software (which may be incorporated into taxonomy management software, such as in PoolParty) to extract candidate taxonomy terms based on their frequency and relevancy within a body (corpus) of text content. The suggested terms need to be analyzed for the context of their usage before determining whether they should be added to the taxonomy.  

Analyzing existing vocabularies

If an organization already has some controlled vocabularies (taxonomies, thesauri, term lists, terminologies, glossaries, etc.), whether currently in use or not, these should be analyzed as sources of terms for incorporation into the new taxonomy. Assuming the project is to create a new taxonomy, any existing controlled vocabularies may have been for a different purpose, so only some of the terms would be relevant. Glossaries tend to have too many detailed terms that are not needed for information retrieval, but these and any other vocabularies are good sources for synonyms/alternative labels.

Analyzing any search log reports

When creating or editing a taxonomy, it’s always useful to look at search logs, which indicate what users have been typing into the search box. Search log reports can be sorted by search string frequency, so that the most frequently used search strings are considered for inclusion into the taxonomy. The search strings should be edited to confirm with taxonomy style and policy, but the exact search strings should be included as synonyms/alternative labels to support future searches.

Taxonomy testing

Near the completion of a taxonomy project, there should be some activity of taxonomy testing. Taxonomy use testing should test a taxonomy’s suitability for tagging content by manually test-tagging sample documents and determining if the desired terms are available in the taxonomy. Taxonomy use testing should also test the retrieval capabilities of the taxonomy. This is done by attempting to retrieve pre-identified documents with searches conducted by sample users with the search terms of their choice.

Other test on taxonomies, such as card sorting and A-B testing, which are also used in UX navigation testing, may be used in taxonomy development to test the preferences of the top two levels of a hierarchical taxonomy, but such tests are less suitable for multiple-level hierarchical taxonomies or for faceted taxonomies. More details are in my previous blog post on Testing Taxonomies.

Conclusions

Creating a taxonomy involves many research-related tasks, which can take up as much time or more than actually creating terms in a taxonomy. While there is a creative aspect to developing a taxonomy, a taxonomy also has to be based on research and analysis, with the emphasis on analysis. The research is more qualitative than quantitative, though.

Friday, April 30, 2021

Taxonomy Trends

Last fall I gave an 8-minute video presentation as part of the SEMANTiCS Video Forum 2020 on the subject of taxonomy trends, but the short talk allowed time to discuss only two of the past year’s trends. More recently, I reflected on longer-term trends in taxonomies when the chair, Jane Dysart, of Computers in Libraries conference suggested that my pre-conference taxonomy workshop last month also include the what’s new with taxonomies and assigned me the workshop title “Taxo Update: Latest in Designing & Maintaining Taxonomies.” While, by their nature and purpose, taxonomies should remain somewhat consistent in their design, I came up with some ideas in various sections of the workshop presentation.  Now that the event is past, I’ve collected my observations of taxonomy the trends that I included in that workshop.
 

Convergence of types

A trend in the broader realm of knowledge organizations is the convergence of different types. We are seeing a convergence of taxonomies and thesauri, which is due to factors including the widespread adoption of the SKOS (Simple Knowledge Organization System), which supports both taxonomies and thesauri fully. Vocabulary management software, which is becoming more widely adopted than just using spreadsheets or the basic taxonomy editing feature of a content management system, supports both taxonomies and thesauri with no distinction. There may also be a growing preference to have the features of both: a dominating hierarchical structure as in taxonomies, and the benefit of additional associative (non-hierarchical) relations as supported in thesauri.

There is also a convergence of taxonomies and ontologies. This is also partly due to software tools, such as PoolParty, that support both taxonomies and ontologies in an integrated manner. There is a growing interest in ontology features, such as semantic relations and custom attributes, without having a large complex ontology, so a simple ontology can be applied as a semantic layer to existing taxonomies. This brings up the fact that there are growing number of taxonomies in existence that can be utilized within an ontology, rather than being replaced by an ontology. Finally, there is an increasing interest in ontologies as they form a basis of knowledge graphs, which are becoming more popular.

Interest in standards

In the past, the focus on taxonomy-related standards was mostly on ANSI/NISO Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and its related ISO standard, ISO 25964 Thesauri and interoperability with other vocabularies. Their emphasis on best practices for thesauri has perhaps limited these standards somewhat in their broader application to taxonomies. More recently, however, there has emerged a greater interest in interoperability of taxonomies and other controlled vocabularies, which is recognized and addressed the second part of ISO 25964.

Even more significant is the is the increased adoption of SKOS and other W3C (World WideWeb Consortium) guidelines and recommendations, which directly support interoperability and exchange and sharing of vocabularies. As the number of taxonomies and other controlled vocabularies grow, there is a greater interest in re-using parts of them, sharing them, and linking them, which is enabled by representing the data in a standard format. The SKOS model can also be expressed in RDF (Resource Description Framework) triples, which makes it suitable for general Semantic Web sharing and linking, whether on the web or behind the firewall with Semantic Web standards. SKOS has also become the standard supported by most taxonomy management software.

Besides supporting interoperability, another trend coming out of SKOS is a shift in thinking of terms to that of concepts.  Terms are strings of text, but concepts are ideas that may have various labels. Thus, people talk about “things, not strings.”

As for the standards for taxonomy and thesaurus best practices design, ANSI/NISO Z39.19 is not forgotten but rather there is sufficient interest in the taxonomy community to review and revise this standard again soon. I expect work on that to start in late 2021 or early 2022, and I hope to be involved. I will report more on that in a future blog post.

Trends in taxonomy structural design

In hierarchical taxonomies, there is the trend that hierarchies are created increasingly for purposes other than fully displayed for end-user browsing. Traditionally, the hierarchical design structure of taxonomies was solely for the purpose of serving end-users who would be browsing and need guidance in going from broad categories to narrower topics. The associative (related term or see also) relationship also guides users who are browsing and those who are doing manual indexing/tagging to identify related concepts of interest. As fully browsable taxonomies are becoming less common (due to their growing size and the availability of alternative methods of search and findability), and more indexing is automated, hierarchical and associative relationships between concepts are less often implemented to support browsing, and are more often used so support auto-tagging, providing context for a concept’s meaning by the presence of broader and related concepts.

When the relationships between concepts are not displayed to end users, the taxonomy structure does not necessarily need to be as consistent, such as always having a set number of hierarchical levels in all places of the taxonomy. A taxonomy does not have to appear as complete and comprehensive, either, but rather it merely represents the content. Associative relationships between concepts may also be implemented more inconsistently. This is another factor that contributes to the convergence of taxonomies and thesauri, since by definition thesauri have associative relationships and taxonomies do not. But you may end up creating a taxonomy/thesaurus with just a few associative relationships.

Despite the trend of less fully displayed hierarchical taxonomies, there are still many taxonomies that are fully displayed, such as in ecommerce applications. A growing trend is to combine different methods of expanding form one level to the next between different levels of the same taxonomy. There is also more sophistication integrating both common and custom facets into different levels of a hierarchy.

Trends in uses

Last, but certainly not least, is the trend in wider adoption of taxonomies for various uses. This was the topic of my prior blog post, Industry Uses for Taxonomies.