The Accidental Taxonomist: Search

Showing posts with label Search. Show all posts

Monday, September 30, 2024

Topical Taxonomies for Filtering Searches

PoolParty GraphSearch

We taxonomists have long been advocating how a taxonomy of disambiguated concepts tagged to content retrieves more accurate results than search algorithms alone. But if users prefer simply entering text strings into a search box and not browsing taxonomies, how best to support users with a taxonomy can be a challenge.

A faceted taxonomy with taxonomy aspects as filters for refining search results has become a common taxonomy solution, especially for intranets, partner portals, and knowledge bases. For these purposes, certain facets, such as Content type, Product/Service, Location, and Department, are common and logical. When it comes to the designating “Topics,” however, it’s not so easy.

Specific Terms Gathered from Analysis

When gathering information and sources for terms, most sources will yield highly specific terms. These include terms arising from search log analysis, brainstorming sessions with sample users, automated text analytics term extraction from a large corpus of content and manual review a representative sample of documents/pages. These are all standard methods for taxonomy design, which I conduct as a consultant.

The difficulty is that there are often so many specific topics, so the new topical taxonomy could potentially have many hundreds of terms. Some may be relevant to only one or two documents or occurred in only a couple of searches out of thousands. They would not serve the purpose to refine searches.

Another problem is that many of the terms suggested from these methods are not even topical. Often, the top searches found in search logs of enterprise/intranet searches are for commonly used named tools, platforms, or services.

The main issue, however, in deriving terms for a topical facet/filter based on search terms is that the objective of the topical facet, like all facets, is to limit searches, not to duplicate searches. What is really needed in the topical facet are topical categories that are broader than the search terms. How to identify these broader topical categories can be more challenging.

Identifying Broader Topical Categories

Identifying broader terms or categories for topic filters is not as simple as identifying specific search terms, nor as straightforward as identifying the set of facets. Typical methods of obtaining candidate terms from both users and from the content need to be done, but with a focus on identifying broader terms or categories.

Categories from Stakeholder Engagement

Engaging stakeholders or other sample users in activities to brainstorm taxonomy terms will result in a mix of specific and broad terms. It is then the task of the taxonomist-facilitator to help guide the participants to identify which terms are broader and which are narrower within the same topical facet. Involving stakeholders/sample users is important, because if a single taxonomist or an external consulting team tries to do this on their own, their designated broader terms, while hierarchically correct, might not suit the intended users. The taxonomist-facilitator may suggest broader terms and then obtain immediate validation from the participants of the appropriateness of those suggestions.

Categories from Content Analysis

Analyzing content for broad topics is more effectively done manually than with automated methods. Manual content analysis will yield both specific and potentially broader concepts. A taxonomist or content strategist experienced in content analysis for identifying meaning will be able to determine the main concept for a piece of content.

Automated methods, based on text analytics technologies, tend to focus on term extraction, and will extract terms even more specific and less useful than search log results. However, if a list of derived search terms is large enough (as may search logs or automated term extraction lists tend to be), another, newer option is to make use of LLM and generative AI technologies to categorize the specific terms and thus generate broader terms. The LLMs should be trained on the same or similar content, which is internal enterprise content, not the public web, to provide the correct context. Even then, the identified broader terms or categories will not always be correct and will require an experienced taxonomist to review.

Tuesday, April 30, 2024

Synonym Rings (or Search Thesaurus)

A synonym ring is a simple kind of controlled vocabulary that, as the name suggests, has controlled synonyms for concepts and nothing more. I have long included mention of synonym rings in presentations I’ve given with sections listing and describing controlled vocabulary types, and the synonym ring has appeared on diagrams illustrating comparative complexity and included features of the various controlled vocabularies, progressing from the simplest term lists to synonym rings, name authorities, taxonomies, thesauri, and finally ontologies.

However, until now, I have not gone into detail about synonym ring use and design.

The name “synonym ring” is generally known only by taxonomists and other information professionals. It is called a “ring” because all synonyms point to each other, as in a circle or ring, rather than to a preferred term/label. Another name for it is a “search thesaurus,” although it should be clear that “thesaurus” is meant to be the Roget’s type and not the information retrieval type (similar to a taxonomy). I have also read the name “synset” but have not heard it in practice.

What we are talking about is a managed set of concepts, each with one or more synonyms, created specifically for supporting search, matching end-user search strings to text strings in the content being searched, for commonly searched concepts. The synonyms also match to variant names of the concept throughout the body of text that is being searched. Because the synonym ring’s purpose is to support search, it is not browsed and thus not displayed to the end users. Therefore, a preferred term or preferred label for each concept is not needed and thus not included.

Whether in a synonym ring or in another controlled vocabulary or taxonomy, “synonyms” refer to concept variants and not literal grammatical synonyms. In a controlled vocabulary, they are often phrases, not single words, and they are for things/concepts, and not all kinds of words (different parts of speech) found in a dictionary. They also don’t have to be exact synonyms, but rather sufficiently synonymous for the context of the content being searched.

Features of a synonym ring (search thesaurus)

It includes only concepts for which there are “synonyms,” Each concept must have at least two synonyms. If there are no synonyms for the concept, then the concept is not included in the synonym ring (in contrast to a regular controlled vocabulary). So, important concepts may be absent.
Synonyms are not displayed to the users, so slang, deprecated, potentially offensive terms, etc. may be included.
It supports searching only and not tagging. People doing manual tagging or systems doing auto-tagging will not be able to make use of the synonyms to identify the best concept to tag with. (They could utilize another taxonomy implemented in another system for tagging.)

Implementation of synonym rings

Typically, when taxonomists are called upon to design a taxonomy, they design it with synonyms (aka alternative labels, nonpreferred terms, variants, etc.) included. Thus, creating a dedicated synonym ring type of controlled vocabulary is not common, since the necessary synonyms are already included in the taxonomy. Small taxonomies may not have synonyms, though.

Search that is built into content/record management systems may support search synonyms, but this tends to be more ad hoc than as a managed controlled vocabulary. Recently I looked into the synonym support in controlled vocabularies and taxonomies in Salesforce Service Cloud. It supports the creation of “custom synonym groups,” where each group is a synonym ring of up to six synonyms per concept, but these have to be entered individually in the user interface, rather than as an imported as a list. As such, it’s not really a “controlled vocabulary” set.

Some content management systems with included taxonomies only enable synonyms as part of their standard displayed taxonomies and not as non-displayed search synonyms. Other systems, such as SharePoint support the use of synonyms for its taxonomies (managed in its Term Store) for tagging but not for searching.

Adding search synonyms in systems that support it often have it as a systems administrator feature, which is something that the technical systems administrators may do, while taxonomists, information architects and knowledge, managers may not know about it. After all, a set of synonyms is not a “taxonomy,” so taxonomist involvement may not even be considered. Thus, communication is necessary between those who advocate the need for comprehensive search synonyms and know how best to create them and those who are in a technical role for implementing them in a system.

Advantages of synonym rings

A synonym ring is relatively easy to develop. While there are nuances to creating synonyms (described below), it’s easier than creating other controlled vocabularies or taxonomies, since there is no need to worry about which term should be preferred and how to best create a hierarchy. Since it is not displayed, getting input from users is not required.

By focusing on supporting only searching and not also tagging, the task of coming up with synonyms is also simpler, since sometimes you want synonyms to support search and not tagging and sometimes for tagging and not searching (such as when the synonyms display to users) and trying to design for both scenarios in the same taxonomy is not easy.

When searching is the primary way that users access content, rather than browsing and filtering, a synonym ring may be an ideal solution. It might not make sense to go to the effort to design and create a hierarchical taxonomy for terms that users are searching on, if the goal is to simply enhance search.

A taxonomy runs the risk of being too broad or too specific, but a synonym ring never has that issue. The size of a synonym ring type of controlled vocabulary is flexible, and it can be built out gradually over time with no detriment.

Disadvantages of synonym rings

A synonym ring is not a standard controlled vocabulary type and is not supported in the SKOS (Simple Knowledge Organization System) data model standard of the World Wide Web Consortium. This is because a SKOS controlled vocabulary (including taxonomies) needs to have preferred labels for its concepts. Thus, synonym rings are not interoperable in the same way that other controlled vocabularies are. You cannot link to external synonym rings, and you cannot even import or export them easily. They are managed within a siloed system.

Since synonym rings do not support tagging, an additional tagging controlled vocabulary with synonyms, which is somewhat redundant in its subject scope, may need to be created

Creating synonyms for a synonym ring

“Synonyms” can include dictionary synonyms, synonyms for individual words withing multi-word phrases (e.g. political protests / political demonstrations), formal and colloquial names, acronyms, etc. Following is a list of example types:

synonyms: Cars / Automobiles
quasi-synonyms: Learning / Training
variant spellings: Email / E-mail
lexical variants: Selling / Sales
foreign language names: München / Munich
acronyms/spelled out: GDP / Gross domestic product
scientific/popular names: Neoplasms / Cancer
older/current names: Near East / Middle East

Care should be taken not to include synonyms that are not sufficiently equivalent or may be vague and have other usages, such as “development” (which could refer to software development, nonprofit fundraising, or something else). It depends on context, so in the example with “tools” as a synonym software would be acceptable if the content were only about technology and not include manufacturing, construction, etc.

Synonyms can be identified when doing research for concepts to include, including manual content analysis, automatic term extraction, lists of uncontrolled keyword tags, and search log reports. Search logs are especially suitable for synonym rings, since their usage is the same: user search strings. However, often searches are on single words, whose meaning is vague. For example, a search string word of “application” is too vague and not be used as a synonym. You should only take search log search strings if their meaning is clear.

Finally, developing synonyms for a synonym ring implemented in an internal content management system is not the same as developing synonyms for a public website to support web search engine optimization (SEO), for which they are also called “search synonyms.” For SEO, web search engine algorithms need to be considered, and obtaining the greatest number of visitors is the goal, even if those site visitors did not intend to come to the website. In such cases, more specific concepts (e.g. “iPhone” as synonym for “cell phone”) as “synonyms” would be fine. If website visitors do not find what they are looking for, that’s OK. By contrast, users of enterprise CMS or search system, would consider it a waste of their time if they retrieved additional content that did not match their search. Although sample user testing is not needed, search testing to check the accuracy of results should be performed.

Friday, November 24, 2017

Auto-categorization and Taxonomies

Taxonomies and thesauri are only truly useful if their terms are appropriately indexed or tagged to content. My path to taxonomist had been as an indexer, so I always value the importance of human indexers. Nevertheless, I must acknowledge that automated indexing, also called auto-categorization, is becoming increasingly common and important.

At the most recent Taxonomy Boot Camp conference (November 6-7, in Washington, DC), a trend I discerned was the increasingly commonplace use of auto-categorization (or at least machine-aided indexing) with taxonomies. Conference presentations didn’t state auto-categorization as something new but rather sometime more matter of-the-fact, and by the way, the software vendor used in this case is so-and-so. There were also sessions on artificial intelligence and taxonomy and on leveraging taxonomy management with machine learning. There is also a lot of interest in text analytics, a field broader than auto-categorization, which justified the first Text Analytics Forum conference co-located with and immediately following Taxonomy Boot Camp (which I, unfortunately, did not have time for).

When conference speakers and others state that automated indexing has been proven repeatedly in test comparisons to be more “reliable” and more “consistent” than human/manual indexing, while true, that does not mean it is better. Human indexing is certainly not as consistent, as two trained indexers will not index exactly the same way, but the way they differ is rarely so substantial. One indexer may add an additional index term. Another indexer may index with a slightly different, but related, term. Automated indexing, on the other hand, while consistent, is not as correct. Depending on the method, it can be approximately 20% inaccurate, indexing with completely wrong terms or completely missing the most appropriate terms. That’s where “machine-aided indexing” comes in, where indexing is initially automated, but a human quickly reviews the suggested terms, adding or deleting terms as appropriate.

The primary reason for implementing automated indexing is not so much to achieve consistent indexing, but rather to achieve efficient indexing. This is because the amount of content to be indexed in many organizations is growing too fast to be kept up with by manual indexing. Publishers of external content for subscribers have also transitioned to partial automated indexes or machine-aided indexing.

While enterprise search engines do not utilize taxonomies by default (but can be configured to make use of them), auto-categorization software generally uses some form of taxonomies. Search engines can function out-of-the-box without any taxonomies or controlled vocabularies, although a search thesaurus (a.k.a synonym ring) can significantly improve search precision and recall. Auto-categorization software, on the other hand, relies on “categories,” which can be simple controlled vocabularies or hierarchical or faceted taxonomies. Thus, as auto-categorization is gaining wider adoption, the need for taxonomies to support them is also growing.

Automated indexing technologies have not advanced significantly in recent years, but there have been improvements in auto-categorization software by effectively combining more than one technology method within the same software product. The main technology methods are (1) rules-based and (2) machine-learning. Regardless of the method, automated indexing is still not fully automated. Humans are required to put in time and effort beforehand to either write or edit rules for each taxonomy term, or to provide and test training sets of sample documents to index for machine learning. These could be dedicated roles or additional tasks to be performed by the taxonomist.

Auto-categorization is also becoming more common, because software products that effectively combine taxonomy management with auto-categorization have become more established and better integrated. Although there are many organizations which continue to use distinctly separate software for each of taxonomy management and auto-categorization, organizations newer to taxonomy adoption prefer to have a single solution. Synaptica is the one major taxonomy management vendor which does not yet include fully integrated auto-categorization, and they are very actively working on incorporating the technology. I have separate chapters in my book, The Accidental Taxonomist for software for taxonomy management and software for auto-categorization, but in my second edition I ended up repeating more vendors in both sections.