The Accidental Taxonomist: Tagging

Showing posts with label Tagging. Show all posts

Monday, May 20, 2024

Tagging with a New Taxonomy

The benefits to information users of having content tagged with a taxonomy are great. They include increased accuracy and comprehensiveness of search results, speed and efficiency in obtaining results, the ability filter search results, the opportunity to explore and discover related information, greater confidence in the completeness of results, and an overall better user experience. The benefits are worth the challenges of creating a taxonomy, and the benefits should be worth the challenges of properly tagging with a taxonomy as well.

Often the greatest challenge to taxonomy adoption is the ability to tag all of the content with the taxonomy terms as intended. Issues include allocating resources for tagging, implementing a new content management workflow, establishing criteria and quality control for tagging, and tagging a large volume of legacy untagged content.

Tagging Resources

While taxonomy development has one-time project expenses (such as the hours of consultant or contractor), the ongoing tagging with a taxonomy requires an annual budget on top of some startup expenses, whether tagging is manual or automated. Manual tagging requires budgeting for the working hours, while auto-tagging typically requires an annual software license. Automated also requires some human involvement for quality checks and refinements of tagging parameters.

Which method, manual or automated, to choose depends on the volume and speed of tagging required, the nature of the content, and the need for accuracy. Automated methods are more cost effective for large volumes of content tagging and can tag more quickly. Automated (AI) methods can tag text or images, but the same tool/technology does not do both, so for mixed content, manual tagging may be a more practical and affordable option. Automated methods are also better for content of a consistent type (e.g. all resumes, all news, all technical support articles), whereas a diversity of content (e.g. everything on the intranet or on the public website), can be tagged more accurately if done manually. Manual tagging may not be as consistent as automated methods, but unlike automated tagging, it is rarely wrong. If 10-15% mis-tagged content cannot be tolerated, then manual tagging may be preferred.

Automated tagging is not free from manual labor. If tagging is done by machine learning, then the machine needs to learn from examples, and sample tagged content may need to be prepared and submitted to the system as such examples. If tagging is done by rules, then rules need to be written for most of the taxonomy concepts. Prebuilt starter taxonomies may be pre-trained or have tagging rules included, though, but they likely will need refinement. In fact, any auto-tagging needs to be tuned and refined as the content and the taxonomy evolve.

Tagging Workflow

Whether manual or automated, tagging content requires setting up new content management workflows. It needs to be determined who does the tagging: the author, the editor, or someone else. Unless trained professional indexers tag the content, tagging review by an editor may be desired.

While manual tagging can be done within the same system (some kind of content management system) where the content is stored, these systems usually don’t have the functionality of auto-tagging built in. Automated tagging is typically done by establishing an integration between the auto-tagging tool (which may be a module of a taxonomy management system) and the content management system and the setting up of a data “pipeline” for the tagging tool. Setting this up may require some additionally billed services of the software vendor.

Also as part of the tagging workflow should be a method for taggers or those who review automated tagging to be able to suggest new terms to add to the taxonomy, as they see new concepts in the content.

Tagging Standards

Establishing criteria and quality control for tagging begins with setting tagging policy and guidelines. This includes setting the policy regarding to what detail to tag, how many terms of each type may be tagged to a single piece of content, whether a certain taxonomy term type is required or not for tagging, and whether the tagging of certain terms should trigger the additional tagging of another term (such as a broader term). These policies can be set as parameters for auto-tagging. For manual tagging, some of the tagging policies can be system enforced, but other policies cannot be.

Tagging has both policies (rules) and guidelines (best practices/recommendations). A policy, for example, would be the minimum and maximum number of tags permitted, whereas a guideline would be a suggested narrower range of tags.

Whether manual or automated, tagging should be occasionally checked for accuracy, as a periodic quality control function. Based on the results, revisions may be needed for the taxonomy, and/or the tagging guidelines/policies may need to be revised.

Legacy Content Tagging

Even if there is an established workflow for tagging newly added content, there is the challenge of tagging all the legacy content that is already in the system. It’s rare that a taxonomy is implemented before any content is already collected and made available for searching.

Automated tagging may be a good way to handle the backlog of untagged content. However, software is intended to be licensed for at least a year and be a part of the regular workflow, rather than for a one-time backlog tagging project. So, the long-term use of auto-tagging software needs to be considered.

If manual tagging only will be the selected method for the long-term, then you should consider the tagging services of a freelancer, contractor, temp, or intern (library science student) to take care of tagging the initial backlog of content. Freelance indexers can be found through the American Society for Indexing and indexing societies in other countries. They prefer to call the activity “indexing,” rather than “tagging.”

While taxonomy creation is a project, taxonomy management and maintenance are an on-going program, and it’s the same with tagging. Backlog tagging will be a project, but ongoing tagging is a related program, and should be related to taxonomy management and maintenance. Tagging should be an important part of an information and content management strategy and not an afterthought.

Thursday, October 31, 2019

Managing Tagging with a Taxonomy

A lot of work can be put into designing and creating a taxonomy, but if it’s not implemented or used properly for tagging or indexing, then that work can be wasted. As the volume of content has grown, many organizations have invested in auto-tagging/auto-categorization solutions utilizing text analytics technologies. However, there remain many situations where manual tagging is still more practical. So, support for correct and efficient manual tagging needs to be considered. This is the topic of my upcoming presentation at the Taxonomy Boot Camp conference, in Washington, DC, on November 4.

A taxonomy can be designed to support manual tagging by including alternative labels (synonyms), hierarchical and associative relationships between terms, and term notes, to guide those doing the tagging to the most appropriate terms, even if these taxonomy features are not fully available to end-users in their user interface. It may be easier to have these features available in a customized manual tagging/indexing tool than it is to make them available in the end-user application. A taxonomy has more than one set of users, and the tagging-users need the full benefits a taxonomy can offer.

It’s very important to develop a customized policy for tagging with a taxonomy, so that it is used correctly and consistently. Any policy for tagging or indexing should include both rules and recommended guidelines. Examples of policy topics include:

Criteria for determining topic or name relevancy for tagging
Depth and level of detail of tagging
Comprehensiveness of aspects (what, who, where, when, how, why, etc.)
Required term types/facets (and any dependencies)
Number of terms (of each type) to tag
Tagging of certain terms in combination (e.g.: a parent/broader term in addition to its narrower/child term)
Other types of metadata that must be entered

It’s often not enough to just provide people with a policy document. Some degree of training on proper tagging can be very beneficial. In a current SharePoint taxonomy project, one of the users who tags uploaded documents said to me, “The problem is that we have not been trained. We are guessing.” Policy and guidelines should initially be delivered as a presentation (live or web meeting) to allow for questions and answers.

With large volume tagging, the initial tagging should be reviewed and feedback should be provided. This is the case for both new and experienced indexers. Even experienced indexers need to become familiar with the content and learn the policies and guidelines that are particular to the organization and project. In a recent taxonomy project that involved indexing hundreds of articles by a professional indexer, even the professional indexer’s initial indexing was reviewed to make sure it was as thorough and accurate as required.

Finally, there needs to me a method of communication and feedback between those doing the tagging and the person (taxonomist) who is managing the taxonomy, which is a controlled vocabulary, after all. The taxonomist should inform those tagging of new terms and changed terms, especially if they are high-profile terms, and may also provide tips for tagging new and trending topics. Meanwhile those doing tagging need a method to contact the taxonomist to request clarifications or the addition of new terms. This could be by email, but collaboration workspaces may also work well. While I, as a consultant, do not stay on as tagging continues, I like to be available at the start of tagging with a new taxonomy, to answer indexing questions, something I did just this past month on my most recent consulting project.

Thursday, September 6, 2018

An Open Vocabulary Tagging Experiment for Discoverability

Does tagging content with terms from a shared, publicly available controlled vocabulary make a difference in increasing content discoverability on the web? A colleague of mine proposed finding out by experimenting with tagging the same content, such as two identical blog posts, differently: one with terms typical for posts on the blog and one with terms from a publicly available controlled vocabulary. Then after a few weeks the statistic of visitor traffic to the two post versions would be compared.

Wikidata and VIAF, were chosen as the sources of publicly available controlled vocabulary terms. Since VIAF contains only name authorities (proper nouns), I used terms just from Wikidata in my blog tagging experiment, whereas my colleague used terms from both Wikidata and VIAF in his blog post tagging experiment (The Open Web Tagging Experiment on the Ol' Patio Boat Blog).

The preceding blog post on The Accidental Taxonomist blog, "Using Linked and Other Open Vocabularies," had been posted twice identically, except that one version was tagged with terms from Wikidata, linking to them, and one was tagged with terms that have been created and used just for The Accidental Taxonomist blog. I did not linked to either blog post from other social media, as I usually do. (Now that the experiment is over, I deleted the duplicate blog post with the lower number of visitors recorded.)

After 18 days, I checked the statistics for the number of visitors to each blog post. The version with the blog's own tags (the tagging feature supported by Blogger.com) had 72 visitors, and the version without blog tags but with links to Wikidata tags had 104 visitors. (By contrast, this post "An Open Vocabulary Tagging Experiment for Discoverability" had in the same period attracted 119 visitors, without any tags or links to Wikidata terms during this period.)

The conclusions are not certain, but it appears as if links out to Wikidata may have helped in that post's discoverability, since the post with those links had more visitors. It also appears that blog tags do not seem to help significantly in discoverability, since of the three posts, the one with those tags had the least number of visitors, although the tags are useful for finding specific posts once you are on the blog's home page. The results of my colleague's test of two identical posts with and without tagging were different, though. He concluded the opposite, that coping Wikidtata and VIAF headings into a post with incoming URLs had no effect, but putting metadata into Blogger tagging field did increase visibility. However, his visitor traffic in both cases was very low, so the difference was perhaps not statistically significant.

As for this post, which had no tags, but the highest number of visitors, that could be attributed to a post title with more searched key words and phrases in it.

Search engine optimization is a big and ever-changing field. Rather than try to game the search, I will return to my method of posting about my blog posts on social media and hope my connections will share and repost.

Friday, February 28, 2014

Tagging vs. Indexing

I have blogged before on the difference between tags and categories, but recently someone asked me about the difference between tagging and indexing (the manual kind). It's not a simple answer.

One important way in which tagging and indexing differ is that tagging involves any kind of designation about a piece of content, what it is or what it is about, whereas indexing is restricted to descriptive labels for what content is about. Tagging can include content type, date, creator, source, audience, location, rights, keywords, etc., whereas indexing is for the subjects of the content. In this sense, tagging is sort of the modern word for cataloging or the assignment of metadata.

But what if we are concerned with just the descriptive labeling of content and not other metadata? That might be called tagging or it might be called indexing. In this case, the difference is more nuanced, and to a certain extent it is historical.

When I first entered this field in early 1990s, the notion of "tagging" was not really known. Indexing, on the other hand, was a recognized activity. There are two kinds of indexing:
1) Closed indexing or back-of-the-book indexing, where the index is created based solely on concepts found in a single monograph, and the index is created for that one monograph and is then finished ("closed").
2) Open indexing, or what was then called database indexing, whereby index terms taken from a controlled vocabulary or thesaurus are assigned to multiple individual documents or digital assets, with the content ever growing over time and the same index terms will point to increasingly more documents over time.

Then, with the rise of social media, "tagging" became popular in the form of assigning keywords and names to photos or blogposts or other digital content. Initially, tagging was clearly different from indexing, because:
1) Tagging did not use a controlled vocabulary (aka thesaurus or taxonomy)
2) Tagging was done by creators and consumers of content, and not trained indexers. "Indexer" is a profession; "tagger" is not.

Indexing is also different from tagging by what results from it. If we look to the origin of the word "index", it means to indicate or to point (as with your index finger). So, the result of indexing is an "index" that the user can browse to locate referenced (if in print) or linked (if electronic) content. A thesaurus/taxonomy and an index (a structured list of the terms that had been used for indexing) could be essentially the same thing. Sometimes not the entire index is browsable but rather just a section via a type-ahead scroll-box feature. Tagging, on the other hand, with the lack of controlled vocabulary, does not result in any created work, just a folksonomy, which, with its multiple terms with the same or overlapping meaning, is not suitable for browsing. If displayed, tagging terms are shown by popularity instead, such as in a tag cloud, which is interesting, but not an accurate method for content findability and retrieval.

In time, enterprise software adopted social media methods, user interfaces, and features. As a consequence, tagging became more formalized as an employee task, and folksonomies got edited into controlled vocabularies or taxonomies, if not at least becoming sources for taxonomy terms. So, now tagging may be done with or without a controlled vocabulary, and both consumers and professional editors/content managers (if not “taggers”) do tagging.

"Tags" and "tagging" are now also designated features content management and digital asset management software, and content editors "tag" with terms from a controlled list. As such, the distinctions between "indexing" and "tagging" have become blurred, and what this activity is called may depend on what the software vendor, the industry (publishing may prefer to call it indexing, whereas ecommerce calls it tagging), and the corporate culture prefers to call it.

The designation of “indexing”, as open index creation, is also becoming less common as the full display of indexes has become less common. Search boxes (even if what the user enters into it is matched against a thesaurus) have often replaced long alphabetized lists of subject entries and subentries. We continue to find indexes at the back of books, but online for electronic content the displayed browsable index is less common than it used to be.

Wednesday, July 31, 2013

Tags and Categories

What does a taxonomy comprise and how does it work? Professional taxonomists may speak of “terms,” “nodes,” or “labels,” whereas most other people with a basic understanding of taxonomy might refer to “tags” or “categories.” A category is a well understood concept, and social media sites have made the notion of “tag” well known.

In addition to the different professional level of such jargon, there is also a distinction in meaning. Ironically, it’s the professional terminology that is vague and the layman terminology that is more specific. Taxonomy “terms,” “nodes,” or “labels,” are all pretty generic and can all have various applications for different kinds of taxonomies, both for broad categorization and for specific indexing. “Tags” and “categories,” on the other hand, each tend to have distinct meanings. It’s not so much what they are, or even how they are organized, but rather how they are used.

Tags are for tagging.
That seems obvious. As for what is meant by “tagging,” that implies you put a tag on something. In fact, you can put more than one tag on something, and that’s typically encouraged in tagging. “Something” is typically an electronic file of some form of content, a document, image, video, database record, blog post, etc. Tags tend to be a brief label indicating what something is about. Tags can be very specific or relatively broad. Information professionals might prefer to call them “index terms.” An organized, alphabetized list of tags could serve as an index.

Categories are for categorizing.
This can also be called grouping or classifying. It implies putting something into a category, often represented as a file folder, whether an actual electronic folder path, or just a depiction of a folder icon. While categories have different levels of specificity, the name category implies a collection of things, so there is an implicit understanding that categories don’t get too specific. An organized structure of categories typically constitutes a hierarchical taxonomy.

Can something go into more than one category? In physical folders no (unless you make photocopy of the document for each folder), but in the digital world, often the answer is yes, but not always (again requiring the copying of files). It depends on the system, and it may involve some workaround. Even when it is possible to put a content item into more than one category, unlike tags, it is still preferable to have most content items assigned to only one category and a smaller number of them that may belong in two categories. For example, there may be a breadcrumb trail for the hierarchy of categories, and the breadcrumb trail may only take a single path. The idea is that the categories retain distinct meaning and usage through mostly distinct content.

Tags and categories together
Because tags and categories are different, it is possible to have both at the same time, especially if the categories are deliberately kept broad and the tags are relatively specific. Content management systems and digital asset management systems increasingly offer features of both categories and tags for managing content. In these cases, the challenge is to decide to what degree of classification to use the categories and to what degree to use the tags. That's exactly what I have done as a taxonomist on two recent consulting projects.

For the amateur taxonomist and indexer, one of the most common exposures to tags and categories is through blogs. Blogging software may permit the blog author to assign a tag or category to a blog post. Whether the tags and categories are appropriately named and used is another issue, though. Blogger.com provides only one option, which it calls "Labels" and utilizes an icon for a tag in the blogging interface, but then displays them when published in the right margin under a heading called "Categories." No wonder my "categories" don't look good; I had created them as if they were tags. Furthermore, the very specific subject matter of "The Accidental Taxonomist" blog makes its posts more suited for tagging than for categorizing. WordPress, on the other hand, gives the blogger both tools: tags and categories. If “The Accidental Taxonomist” blog eventually moves, you’ll know why.