The Accidental Taxonomist: Indexing

Showing posts with label Indexing. Show all posts

Thursday, October 31, 2019

Managing Tagging with a Taxonomy

A lot of work can be put into designing and creating a taxonomy, but if it’s not implemented or used properly for tagging or indexing, then that work can be wasted. As the volume of content has grown, many organizations have invested in auto-tagging/auto-categorization solutions utilizing text analytics technologies. However, there remain many situations where manual tagging is still more practical. So, support for correct and efficient manual tagging needs to be considered. This is the topic of my upcoming presentation at the Taxonomy Boot Camp conference, in Washington, DC, on November 4.

A taxonomy can be designed to support manual tagging by including alternative labels (synonyms), hierarchical and associative relationships between terms, and term notes, to guide those doing the tagging to the most appropriate terms, even if these taxonomy features are not fully available to end-users in their user interface. It may be easier to have these features available in a customized manual tagging/indexing tool than it is to make them available in the end-user application. A taxonomy has more than one set of users, and the tagging-users need the full benefits a taxonomy can offer.

It’s very important to develop a customized policy for tagging with a taxonomy, so that it is used correctly and consistently. Any policy for tagging or indexing should include both rules and recommended guidelines. Examples of policy topics include:

Criteria for determining topic or name relevancy for tagging
Depth and level of detail of tagging
Comprehensiveness of aspects (what, who, where, when, how, why, etc.)
Required term types/facets (and any dependencies)
Number of terms (of each type) to tag
Tagging of certain terms in combination (e.g.: a parent/broader term in addition to its narrower/child term)
Other types of metadata that must be entered

It’s often not enough to just provide people with a policy document. Some degree of training on proper tagging can be very beneficial. In a current SharePoint taxonomy project, one of the users who tags uploaded documents said to me, “The problem is that we have not been trained. We are guessing.” Policy and guidelines should initially be delivered as a presentation (live or web meeting) to allow for questions and answers.

With large volume tagging, the initial tagging should be reviewed and feedback should be provided. This is the case for both new and experienced indexers. Even experienced indexers need to become familiar with the content and learn the policies and guidelines that are particular to the organization and project. In a recent taxonomy project that involved indexing hundreds of articles by a professional indexer, even the professional indexer’s initial indexing was reviewed to make sure it was as thorough and accurate as required.

Finally, there needs to me a method of communication and feedback between those doing the tagging and the person (taxonomist) who is managing the taxonomy, which is a controlled vocabulary, after all. The taxonomist should inform those tagging of new terms and changed terms, especially if they are high-profile terms, and may also provide tips for tagging new and trending topics. Meanwhile those doing tagging need a method to contact the taxonomist to request clarifications or the addition of new terms. This could be by email, but collaboration workspaces may also work well. While I, as a consultant, do not stay on as tagging continues, I like to be available at the start of tagging with a new taxonomy, to answer indexing questions, something I did just this past month on my most recent consulting project.

Thursday, January 31, 2019

Indexes and Faceted Taxonomies

I recently completed a project of creating an index for a book. I had done quite a bit of freelance back-of-the-book indexing 2005 – 2013 but had not indexed a book in over four years. Since I also do taxonomy work, whenever I do indexing, I draw comparison between index creation and taxonomy creation. This time I drew some new comparisons.

It is back-of-the-book indexing, rather than the kind of indexing of content items that is done with a taxonomy, that has some similarities with taxonomy creation. That is because they both involve creating taxonomy terms, naming them, coming up with variant names, and relating them to each other. I have written a detailed article “Creating Indexes and Thesauri: Similarities and Differences” published in the journal The Indexer.

During my most recent index project, I thought of comparisons not with thesauri, but with faceted taxonomies. Faceted taxonomies are increasingly common form of taxonomies or controlled vocabularies. Different aspects/dimension/refinements/filter types of a content item and of a query to find it are considered in creating a set of facets from which terms are used in combination. Facets can be for each of such things as named persons, places, person types, events, activities, things, etc. The set of facets, ideally around 4-7, is customized to the set of content. Each facet may contain just a few or hundreds of terms.

An index, of course, is quite unlike a faceted taxonomy, because a single index includes all kinds of terms: named persons, places, person types, events, activities, things, etc. Some books, however, have separate Name and Subject indexes, so that could be like having two facets. Whether it’s a single index or a set of two, however, the user is only looking up one term at a time, unlike a faceted taxonomy, which allows the user to select multiple terms from multiple facets and combine them to limit the search results.

What is significant is that a good index should include all the aspects/dimensions/types of terms. Thus, the intellectual activity of creating a good back-of-the-book index is similar to creating a good faceted taxonomy, because a full set of aspects needs to be considered and created.

The book I recently indexed was a biography of a jazz saxophonist. As I indexed, focusing on the content at the level of a paragraph or a couple of consecutive paragraphs, I found myself making sure I created index terms that covered the different aspects or term types. In this case they tended to be: named persons, named places, person types (different kinds of musicians, music producers, etc.), place types, activities, music groups, music genres, record label companies, names of songs or albums, and music-related topics.

Of course, it is rare that a single paragraph would have more than a couple of distinct index term concepts (not counting synonyms, what in indexes is called “double posts”); a full set of facets is not expected. Rather, though, as I was indexing, after I selected an initial, obvious index term for the paragraph(s), I would then pause to think if there was a different aspect that could also apply as an index term from among potential facet-like categories, as listed above. I felt that being “facet aware” I was able to create a very comprehensive index.

The resulting index is simply an alphabetical arrangement of terms, with the larger concepts further broken down with subentries. It does not appear faceted. However, all the potential facets are included. The variants or synonyms, as “double posts” in the index, help guide different users who think of different words for the same thing to find the text passage of the desired topic. Additionally, the terms of the different aspects, like facets, help guide different users in another way, by serving those who are thinking about different aspects of the book’s content and narrative.

Sunday, May 13, 2018

Creating Subject Terms for a Faceted Taxonomy

Faceted taxonomies—those that allow users to limit or filter search results by selecting terms or attributes from each of several types/aspects—are becoming increasingly common. They are easy and effective for end-users with various abilities in searching. When it comes to designing facets, some of the facets and their terms for a content collection may be obvious: Document or Content Type, Location, Audience, Purpose, etc. Creating a facet for Subject, for tagging topics the content is about, however, can be quite daunting.

Some faceted taxonomies do not have a Subject facet. Product taxonomies, such as for ecommerce, don’t have Subjects, but rather product categories. Enterprise taxonomies, such as those used in enterprise content or document management systems, also typically don’t have a Subject facet, but rather they have detailed terms in the Document Type facet and may have facets for Business Activity/Function, Department, Line of Business, or even something for Life Cycle/Phase/Stage.

The Subject facet is important, and can be quite large, for taxonomies for tagging and retrieving of content in a collection, library, or repository of published articles, research studies or reports, manuals, presentations, speeches, educational/training materials, images, videos, etc. If a large number of terms are needed to adequately cover the breadth and depth of the content, the Subject facet may comprise its own internal hierarchical taxonomy or thesaurus.

Coming up with the numerous Subject terms is more work and may require a different approach than for the terms of the other facets, which may be based on user needs and expectations. The terms in the Subject facet need to be based primarily on the subject of the content items being tagged. Other techniques for developing taxonomy terms, such as stakeholder interviews and search query logs, are helpful for other facets, but not so much for Subjects.

A taxonomy is built in a combination of top-down (identifying facets and top terms) and bottom-up (identifying the individual terms needed for indexing) tasks. The Subjects are developed a little bit top-down but more bottom-up. The top-down approach for Subjects starts with identifying the subject domain and scope and then any primary divisions in that domain, based on familiarity with the subject area and the content collection. The bottom-up approach involves looking a numerous individual content items to determine the main topics they are about and developing terms for these topics.

Determining what content items are about and what terms describe is the activity of descriptive indexing. I prefer to use the word indexing than tagging here, especially in absence of a taxonomy/controlled vocabulary, which has yet to be created, because it is an analytical task. (See my earlier blog post “Tagging vs.Indexing.”) So, at this stage it may help to have someone who has experience as an indexer do test-indexing of a rather large, representative sample of the content. Guidelines should be established at the start, such as each document is to be assigned index terms for the document as a whole, not for each section, and that a document should be assigned no more than three Subject index terms, for example.

The terms will need to be individually reviewed so that similar terms can be considered for merging into a single concept (and alternative labels/synonyms might be created). To keep the number of terms more manageable for review, it’s best to review and edit the terms periodically, before completing the test-indexing of the entire sample set of content. Thus, developing the taxonomy of Subjects by means of test-indexing is an iterative process. You will probably see trends, patterns, and possibly subcategories emerge from the terms as you collect them. The initial terms that come out of test-indexing can be quite specific and then made broader later. It’s easy to edit specific terms into broader terms, while it is not possible to go the other direction without reviewing the content again. In some cases, you can identify the key terms from the tile of a document or the caption/description of an image, but often for text you need to read headings/subheadings and skim the text or look at an image.

Ideally, this test-indexing can be saved, so when actual indexing is done with the final taxonomy, the indexing work does not have to be repeated. But often this is often not possible. So, test-indexing should not be too thorough or laborious. Before I became a taxonomist, I was an indexer, so I am quite efficient at this task, and I enjoy it.

Keep in mind that a taxonomy should continue to get updated even after it is implemented. This is especially the case for the Subjects, as new content will introduce new topics not yet included within the Subject terms. Thus, the test-indexing need not be completely comprehensive. It is understood that more Subject terms will be added as needed later. What is important is that new terms are added only in accordance with established policy.