Thursday, January 31, 2019

Indexes and Faceted Taxonomies

I recently completed a project of creating an index for a book. I had done quite a bit of freelance back-of-the-book indexing 2005 – 2013 but had not indexed a book in over four years. Since I also do taxonomy work, whenever I do indexing, I draw comparison between index creation and taxonomy creation. This time I drew some new comparisons.

It is back-of-the-book indexing, rather than the kind of indexing of content items that is done with a taxonomy, that has some similarities with taxonomy creation. That is because they both involve creating taxonomy terms, naming them, coming up with variant names, and relating them to each other.  I have written a detailed article “Creating Indexes and Thesauri: Similarities and Differences”  published in the journal The Indexer.

During my most recent index project, I thought of comparisons not with thesauri, but with faceted taxonomies. Faceted taxonomies are increasingly common form of taxonomies or controlled vocabularies. Different aspects/dimension/refinements/filter types of a content item and of a query to find it are considered in creating a set of facets from which terms are used in combination. Facets can be for each of such things as named persons, places, person types, events, activities, things, etc. The set of facets, ideally around 4-7, is customized to the set of content. Each facet may contain just a few or hundreds of terms.

An index, of course, is quite unlike a faceted taxonomy, because a single index includes all kinds of terms: named persons, places, person types, events, activities, things, etc. Some books, however, have separate Name and Subject indexes, so that could be like having two facets. Whether it’s a single index or a set of two, however, the user is only looking up one term at a time, unlike a faceted taxonomy, which allows the user to select multiple terms from multiple facets and combine them to limit the search results.

What is significant is that a good index should include all the aspects/dimensions/types of terms. Thus, the intellectual activity of creating a good back-of-the-book index is similar to creating a good faceted taxonomy, because a full set of aspects needs to be considered and created.

The book I recently indexed was a biography of a jazz saxophonist. As I indexed, focusing on the content at the level of a paragraph or a couple of consecutive paragraphs, I found myself making sure I created index terms that covered the different aspects or term types. In this case they tended to be: named persons, named places, person types (different kinds of musicians, music producers, etc.), place types, activities, music groups, music genres, record label companies, names of songs or albums, and music-related topics.

Of course, it is rare that a single paragraph would have more than a couple of distinct index term concepts (not counting synonyms, what in indexes is called “double posts”); a full set of facets is not expected. Rather, though, as I was indexing, after I selected an initial, obvious index term for the paragraph(s), I would then pause to think if there was a different aspect that could also apply as an index term from among potential facet-like categories, as listed above. I felt that being “facet aware” I was able to create a very comprehensive index.

The resulting index is simply an alphabetical arrangement of terms, with  the larger concepts further broken down with subentries. It does not appear faceted. However, all the potential facets are included.  The variants or synonyms, as “double posts” in the index, help guide different users who think of different words for the same thing to find the text passage of the desired topic. Additionally, the terms of the different aspects, like facets, help guide different users in another way, by serving those who are thinking about different aspects of the book’s content and narrative.

Tuesday, December 4, 2018

Taxonomy Licensing

As a taxonomist who designs and creates taxonomies, I have always advocated creating a customized taxonomy for each implementation, which takes into consideration the particular set of content and type of users. Nevertheless, there are situations when licensing a taxonomy (or any kind of controlled vocabulary) created by a third party may be desirable, such as for a start of a taxonomy that is then modified, for a single facet of a faceted taxonomy, or for tagging multi-source research content.

Taking an existing taxonomy created by a third party, without modification, can have several problems. Its scope may be narrower than needed, or it might not be as detailed, so needed concepts would be missing. Its scope may be broader than deeded, or it may be more detailed than needed, so it’s cumbersome and not user friendly, and indexing with it would be inconsistent. Its language style might not suit the new users, so users cannot find what they are looking for. Its terms and even their alternative labels (synonyms), may not match the language of the content, so content may not get indexed properly. Finally, it might not even have the desired structure, such as the difference between a thesaurus and a hierarchical taxonomy

Taxonomy Licensing Uses

Licensing a taxonomy can be done as a starting point, whereby the taxonomy can then be sufficiently modified for its new use. Modifications include removing concepts out of scope and not needed, adding missing concepts and their relationships, creating additional alternative labels to existing or new concepts, and changing the wording of selected preferred labels to conform with the preference of the users. If only a fraction of concepts need changing, and it’s more a matter of adding new concepts, then licensing can be a good way to get a taxonomy up and running more quickly than starting from scratch.

Licensing a controlled vocabulary to serve for just one or two facets or metadata properties of a larger taxonomy set may also be practical option. A faceted taxonomy enables user to filter or limit search results by a combination of concepts selected from multiple facets/filters. For example, for images these could be: geographic place, location type, occasion, person type, time of year, activity, and object. It might be desirable to license a vocabulary for geographic place or person type and create the other vocabularies.  Other examples of a single-facet taxonomy that might be of interest for licensing include product types and industries.  A facet may contain a hierarchical structure or a flat list.

Licensing a taxonomy as is, with little or no modification, is sometimes appropriate if the original purpose and the new purpose are the same and the type of user is the same. This would not be the case for internally created content, but if the content comes from multiple external sources, such as published articles, and the users are conducting external research, then a third-party created taxonomy in the desired discipline or industry might be appropriate. Fields such as medicine, pharmaceuticals, engineering, and the sciences in general may be suitable for licensing a taxonomy with little modification.

Taxonomy Licensing Issues

The licensed taxonomy not only needs to be in the appropriate subject area but needs to have been initially created for a similar audience and purpose, which can be determined by contacting the original creator/publisher of the taxonomy. For example, a subject area of “finance” will have somewhat different concepts depending on whether it was created for academic/research use or for internal enterprise content management use.

The licensed controlled vocabulary should be of the desired type: classification system, taxonomy, thesaurus, ontology, etc. This is not always obvious, since the distinctions between taxonomies, thesauri, and ontologies can be blurred, and the term “taxonomy” is sometimes used for many different kinds. So, it’s important to ask the taxonomy publisher specific questions, such as how many top terms there are, what kinds of relationships there are between concepts, and whether there are classes or categories assigned to concepts.

If modification is going to be done, which is often the case, the license needs to permit modification. An open source and free taxonomy may restrict modification and require attribution to the source of the unaltered taxonomy. An open source and free taxonomy usually prohibits commercial reuse as well. A paid license, on the other hand, typically permits modification, the use of the terms to create a new taxonomy (as a “derivative work”), and commercial use.

A taxonomy that is available for license typically comes in standard interchangeable format, such as CSV, XML, RDF, SKOS, etc., so it can be imported into taxonomy/thesaurus/ontology management software, where it can be further modified. An understanding of the formats is needed to select the most desirable one, when multiple formats are supported.

Taxonomy Licensing Sources

Finding the right taxonomy is important. A good source of taxonomies and other vocabularies for license  is Taxonomy Warehouse, where you can search or browse for taxonomies by subject. Taxonomy Warehouse contains over 760 vocabularies of all kinds in all subject areas in various formats from 330 organizations. It’s the largest listing available of proprietary vocabularies available for commercial-use licenses.

There is also a larger, more international resource, developed and maintained by the University of Basel Library, the Basel Register of Thesauri, Ontologies & Classifications (BARTOC). As a “register,” not all the 2,878 indexed vocabularies are available for license. Each vocabulary is classified and assigned metadata for subject, category, vocabulary type, file format, language, and license type, among other classifications.  It’s quite comprehensive for open source/free vocabularies, and has some, but is not as inclusive yet of, commercially licensed vocabularies, but it’s growing

Some major information publishers who have developed extensive thesauri or taxonomies to index their published content do offer the vocabularies for license, but thee do not promote it, so this is little known, and they reserve the right not to license vocabularies to a party considered a competitor. Examples include the Gale Subject Thesaurus and the Associated Press’ News Taxonomy.

Taxonomy Licensing Trends: A Survey

So, to what extent do organizations seek to license a taxonomy as part of their knowledge or content management strategy? That’s a good question. Thus, I have created a short multiple-choice questionnaire, the results of which will be posted in a future blog post and may perhaps become a conference presentation topic as well. Please take a few minutes (estimated 4 minutes) to fill out my short Taxonomy Licensing Interest Survey.

Tuesday, November 13, 2018

Taxonomy Boot Camp, 2018: AI and Taxonomies

Artificial intelligence (AI) is not new, but it is becoming more ubiquitous, and its applications are growing within other specializations in information management, knowledge management, and content management, including taxonomies. Hence the theme for this year’s Taxonomy Boot Camp conference (November 5-6, 2018, Washington DC) was “Bridging Human Thinking and Machine Learning.”

This was the 14th Taxonomy Boot Camp conference and its 9th year in Washington, DC, which (along with the newer Taxonomy Boot Camp London) is the only conference dedicated to taxonomies. As usual, it is held along with several other co-located conferences of Information Today Inc., which overlap or are consecutive. The format, as in past years, involved an opening keynote, after which the conference breaks in two tracks of sessions the first day, one more basic and one more advanced, then on the second day a joint keynote with KMWorld conference, and a single track for the rest of the second day. By a show of hands, it appeared that 75% of the Taxonomy Boot Camp attendees were first-timers, even more than before. There were 235 attendees, including speakers and sponsors.

While the conference has two tracks the first day, a more basic and a more advanced track, presentations on machine learning and AI were in both tracks. These included “Taxonomy & Machine Learning at the Knot,” “Sandwiches, Categories, Ethics & Machine Learning,” “Taxonomy Skills in the World of AI” (a panel), “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” “Semantic Search Enrichment,” “Taxonomies and AI Chat Boxes,” and “Taxonomy in the Age of Amazon Echo,” and “Applying Taxonomy Skills to Cognitive Computing” (a project involving IBM Watson data privacy research product of Thomson Reuters).
In “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” presenter Andreas Blumauer of the Semantic Web Company said that increasingly companies are adopting knowledge graphs as their IT infrastructure, and leading players are trying to fuse knowledge graphs with machine learning. A knowledge graph has to be stored in a graph database. There are two types of graph database models: property graphs and RDF graphs. RDF graphs are more important for knowledge graphs.

Semantic AI core principles include the following.
      It’s about things not strings.
      It’s more than metadata: it describes the meaning of metadata as an additional, semantic layer.
      The knowledge graph establishes the semantic layer.
      Knowledge graphs can be seen as an input for machine learning.
      AI isn’t always good at understanding questions so a taxonomy/ontology is needed to support it.
      AI should be built upon data quality, data as a service, no black box, a hybrid approach, as structured data meeting text, aiming towards self optimizing machines (a vision, as we are not there yet).

Use cases of knowledge graphs include a recommendation engine. A knowledge graph is the basis behind the recommendation engine providing content, taking into consideration users.
In “Taxonomy & Machine Learning at the Knot,” the presenters of the web media company the XO Group, started with a good introduction to machine learning, starting off with explaining the problems it can solve: predicting behavior, automating tedious steps, and classifying; and that there are two types: supervised and unsupervised. Common applications include clustering, recommendations, and classification, and each of these can involve taxonomies. Specific implementation examples were provided.

As with last year, there was also a lot of talk of auto-categorization (automated or machine-aided indexing) across various session. Three were dedicated to the subject: “Driving Discovery: Combining Taxonomy & Textual AI at Sage” (a case study using Expert System auto-categorization) “Testing for Auto-tagging Success” and “Classification Relevance at Associated Press.” AP has an automated rules-based classification system for Subjects, Geography, and Organizations. Rules based auto-classification was chosen over the statistical method, because it offers transparency and control, breaking news and low frequency terms can be dealt with (don’t need the existing training set), you can scope/disambiguate between terms better, such incident type terms (Violent crime) vs. issue terms (Domestic violence), and semantic rules ensure there is not must passing mention. Entity extraction with disambiguation rules is used for person names and publicly-traded companies.

Knowledge graphs are getting more attention both here and at Taxonomy Boot Camp London. This was, of course, the main topic of the presentation Andreas Blumauer’s talk “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” and Mike Doane, in the introduction of his talk on “Taxonomy in the Age of Amazon Echo  said that the information industry analysis firm Gartner reports that knowledge graphs are on the rise and are discussed more than taxonomies. Gartner is tracking knowledge graphs instead of taxonomies and ontologies.

While the opening keynote did not focus on AI or machine learning, it was presentation by a computational linguist, Deborah McGuinness, a professor of Computer, Cognitive, and Web Sciences, at Rensselaer Polytechnic Institute. Among other things, she spoke of the Data life cycle, whereby a computer understandable specification of meaning (semantics) supports enhanced lifespan and impact of data. She went on to include to specific ontology case examples.

Nearly all session slides are available to download, except the keynotes, without any login credentials at:

Tuesday, October 30, 2018

Taxonomy Boot Camp London, 2018

This October, for the third year in a row, I have enjoyed the opportunity to attend and present at Taxonomy Boot Camp London (TBCL).

Similar in subject area scope, but with unique presentations, to its parent conference Taxonomy Boot Camp (TBC), usually held in Washington, DC, in November, I find it worth my time to attend both conferences. Despite what might be considered a niche topic for select audience, TBCL remains a strong conference with consistent attendance (about 170 participants), comparable to TBC in its earlier years. The size is large enough to offer a choice of two tracks but small enough to easily network with others. The conference speakers and attendees are quite international, representing 22 countries this year.

Conference Format

TBCL continues to differ from TBC by having two tracks on both days, instead of just on the first day as TBC does. It also has a pre-conference workshop day, which TBC lacks, a full-day Taxonomy Fundamentals workshop (which I lead), and two half day workshops on more specialized or advanced taxonomy topics, which are not the same each year. This year the half-day workshops were on text analytics and taxonomies in SharePoint

For the first time, Taxonomy Boot Camp London presented two awards (which Taxonomy Boot Camp in Washington, DC, does not do.) The winner of the Taxonomy Practitioner of the Year award was Tom Alexander, Taxonomy Manager, Cancer Research UK. The winner of the Taxonomy Success of the Year award was SAGE Research Methods Thesaurus, led by Alan Maloney & Martha Sedgwick, SAGE Publishing.


The exhibit/sponsor showcase is very different at TBCL from TBC. TBC has a small dedicated exhibit on its first day, but then shares the much larger KM World exhibit with the four other co-located conferences. TBCL’s exhibit space is similar to that of TBC’s first day, with just three software vendor sponsor-exhibitors (Synaptica, Access Innovations, and Semantic Web Company/PoolParty). However, there was a larger number of organizational supporter-exhibitors: Association for Independent Information Professionals, the Information Retrieval Specialist Group of the British Computer Society, the Danish Union of Librarians, the Knowledge & Information management Special Interest Group of CILIP (Chartered Institute of Library and Information Professionals) of the UK, the Information and Records Management Society of the UK, the UK Chapter of the International Society for Knowledge Organization (ISKO), the Network for Information & Knowledge Exchange of the UK, the SLA (Special Libraries Association) Europe chapter, and the SLA Taxonomy Division. This was a greater number of organizations than last year. The significant involvement of professional associations in TBCL contrasts with the relative lack of professional associations involved in TBC.

TBCL continues to be co-located with another Information Today conference, Internet Librarian International, but their exhibit areas are somewhat separate (although attendees of both conferences can visit booths of either conference), since their audience and market is different. Other than the drinks reception the first day, the two conferences do not share anything, such as keynotes.


There were three keynote presentations, two consecutive contrasting keynotes the first day and one the second day.  

The opening keynote was indeed a keynote style talk, which was on the broader subject of information on the web, rather than on the specifics of taxonomies. “This is the Bad Place: 13 Rules for Designing Better Information Environments,” was presented by Paul Rissen, Product Manager at Springer Nature UK and previously at BBC. In his thought-providing presentation he aimed at establishing “ground rules” for using the web (especially social media) and for public discourse in general.

This was followed by a more down-to-earth state of the profession talk by Dave Clarke, CEO of Synaptica, titled “Catching the Wave: What Tools do Taxonomists Need to do Their Job.” Although Synaptica was the lead sponsor of the conference, this was not promotional talk. Dave started out be summarizing what taxonomists do and enable as organize, categorize, and discover, and explained the different tools for each. More of Dave’s presentation was about what taxonomists are doing based on the results of a survey of taxonomists he has been conducting ( Then Dave turned to what he considered to be the future trends and issues. Artificial intelligence (AI) is relevant to what we do, but it will not replace the need for human-curated taxonomies or ontologies. Rather, taxonomies and ontologies will empower AI with the semantics and log to improve search and categorization and perform machine learning. Ontologies and linked data can help build smarter search and discovery applications by leveraging the logical dependencies. Linked open data is shared openly, and linked enterprise data is behind the firewall where the linked data model also works well.

The second day’s keynote addressed an important topic. “Selling the Benefits of Taxonomy: Numbers and Stories” was presented by taxonomy and text analytics consultant Tom Reamy. Tom’s argument was that return-on-investment (ROI) studies, with their numerical data on time spent, are not sufficient to convince decision-makers of the benefits of taxonomies, and that use case stories and internal advocacy are also needed. Stories can describe the increased richness of knowledge discovery, better decisions, and analysis of complex issues. He also suggested selling the vision of a taxonomy by means of a mini demo. Tom then turned to text analytics as the important means to make taxonomies usable, as he is rather dismissive of manual indexing. He explained that text analytics is often called auto-categorization, because that was the first use of it, but that text analytics can be used for other things, too.

Conference Sessions

The more basic track had sessions on taxonomy development, user validation, taxonomy resources, taxonomy development approaches, information architecture, enterprise information management, tagging, and taxonomy standards and architecture. I attended mostly sessions of the more advanced track, though.

A theme of the conference, as stated in the program was “Making taxonomies go further,” and conference chair Helen Lippell stated in her welcome the opportunity to “push your practice further.” This was especially true of several of the advanced track sessions I attended. “Using Ontologies for more than Information Categorization,” presented by Ahren Lehnart and Jim Sweeney of Synaptica, suggested using ontologies for project and product management and in support of various other business functions in sales, marketing, partner and competitor information management, etc.  “Beyond Taxonomy Classification: Using Knowledge Models and Linked Data to Unlock New Business Models” was presented by Ben Miller of Wiley. He spoke of knowledge models, as comprising content acquisition and content enrichment. Jim Sweeney also presented “Taking Your Show on the Road: Publishing Taxonomies and Ontologies as Linked Data,” which was a good introduction to Linked Data. In this presentation, he also introduced graph databases and their benefits. While not explicitly discussing taxonomies, Rahel Anne Baile’s talk, “Introduction to Information 4.0,” suggested another application for taxonomies which content is in “molecules and objects,” rather than on as documents, or based on pre-determined topics.  Multilingual taxonomies and taxonomy implementation in SharePoint were the topics of other presentations.

I am looking forward to Taxonomy Boot Camp in Washington, DC, next week, and Taxonomy Boot Camp London again next year which has been scheduled for the same venue October 15-16, 2019, with preconference workshops on October 14.