Tuesday, December 4, 2018

Taxonomy Licensing


As a taxonomist who designs and creates taxonomies, I have always advocated creating a customized taxonomy for each implementation, which takes into consideration the particular set of content and type of users. Nevertheless, there are situations when licensing a taxonomy (or any kind of controlled vocabulary) created by a third party may be desirable, such as for a start of a taxonomy that is then modified, for a single facet of a faceted taxonomy, or for tagging multi-source research content.
Taking an existing taxonomy created by a third party, without modification, can have several problems. Its scope may be narrower than needed, or it might not be as detailed, so needed concepts would be missing. Its scope may be broader than deeded, or it may be more detailed than needed, so it’s cumbersome and not user friendly, and indexing with it would be inconsistent. Its language style might not suit the new users, so users cannot find what they are looking for. Its terms and even their alternative labels (synonyms), may not match the language of the content, so content may not get indexed properly. Finally, it might not even have the desired structure, such as the difference between a thesaurus and a hierarchical taxonomy

Taxonomy Licensing Uses


Licensing a taxonomy can be done as a starting point, whereby the taxonomy can then be sufficiently modified for its new use. Modifications include removing concepts out of scope and not needed, adding missing concepts and their relationships, creating additional alternative labels to existing or new concepts, and changing the wording of selected preferred labels to conform with the preference of the users. If only a fraction of concepts need changing, and it’s more a matter of adding new concepts, then licensing can be a good way to get a taxonomy up and running more quickly than starting from scratch.

Licensing a controlled vocabulary to serve for just one or two facets or metadata properties of a larger taxonomy set may also be practical option. A faceted taxonomy enables user to filter or limit search results by a combination of concepts selected from multiple facets/filters. For example, for images these could be: geographic place, location type, occasion, person type, time of year, activity, and object. It might be desirable to license a vocabulary for geographic place or person type and create the other vocabularies.  Other examples of a single-facet taxonomy that might be of interest for licensing include product types and industries.  A facet may contain a hierarchical structure or a flat list.

Licensing a taxonomy as is, with little or no modification, is sometimes appropriate if the original purpose and the new purpose are the same and the type of user is the same. This would not be the case for internally created content, but if the content comes from multiple external sources, such as published articles, and the users are conducting external research, then a third-party created taxonomy in the desired discipline or industry might be appropriate. Fields such as medicine, pharmaceuticals, engineering, and the sciences in general may be suitable for licensing a taxonomy with little modification.

Taxonomy Licensing Issues


The licensed taxonomy not only needs to be in the appropriate subject area but needs to have been initially created for a similar audience and purpose, which can be determined by contacting the original creator/publisher of the taxonomy. For example, a subject area of “finance” will have somewhat different concepts depending on whether it was created for academic/research use or for internal enterprise content management use.

The licensed controlled vocabulary should be of the desired type: classification system, taxonomy, thesaurus, ontology, etc. This is not always obvious, since the distinctions between taxonomies, thesauri, and ontologies can be blurred, and the term “taxonomy” is sometimes used for many different kinds. So, it’s important to ask the taxonomy publisher specific questions, such as how many top terms there are, what kinds of relationships there are between concepts, and whether there are classes or categories assigned to concepts.

If modification is going to be done, which is often the case, the license needs to permit modification. An open source and free taxonomy may restrict modification and require attribution to the source of the unaltered taxonomy. An open source and free taxonomy usually prohibits commercial reuse as well. A paid license, on the other hand, typically permits modification, the use of the terms to create a new taxonomy (as a “derivative work”), and commercial use.

A taxonomy that is available for license typically comes in standard interchangeable format, such as CSV, XML, RDF, SKOS, etc., so it can be imported into taxonomy/thesaurus/ontology management software, where it can be further modified. An understanding of the formats is needed to select the most desirable one, when multiple formats are supported.

Taxonomy Licensing Sources


Finding the right taxonomy is important. A good source of taxonomies and other vocabularies for license  is Taxonomy Warehouse, where you can search or browse for taxonomies by subject. Taxonomy Warehouse contains over 760 vocabularies of all kinds in all subject areas in various formats from 330 organizations. It’s the largest listing available of proprietary vocabularies available for commercial-use licenses.

There is also a larger, more international resource, developed and maintained by the University of Basel Library, the Basel Register of Thesauri, Ontologies & Classifications (BARTOC). As a “register,” not all the 2,878 indexed vocabularies are available for license. Each vocabulary is classified and assigned metadata for subject, category, vocabulary type, file format, language, and license type, among other classifications.  It’s quite comprehensive for open source/free vocabularies, and has some, but is not as inclusive yet of, commercially licensed vocabularies, but it’s growing

Some major information publishers who have developed extensive thesauri or taxonomies to index their published content do offer the vocabularies for license, but thee do not promote it, so this is little known, and they reserve the right not to license vocabularies to a party considered a competitor. Examples include the Gale Subject Thesaurus and the Associated Press’ News Taxonomy.

Taxonomy Licensing Trends: A Survey


So, to what extent do organizations seek to license a taxonomy as part of their knowledge or content management strategy? That’s a good question. Thus, I have created a short multiple-choice questionnaire, the results of which will be posted in a future blog post and may perhaps become a conference presentation topic as well. Please take a few minutes (estimated 4 minutes) to fill out my short Taxonomy Licensing Interest Survey.

Tuesday, November 13, 2018

Taxonomy Boot Camp, 2018: AI and Taxonomies


Artificial intelligence (AI) is not new, but it is becoming more ubiquitous, and its applications are growing within other specializations in information management, knowledge management, and content management, including taxonomies. Hence the theme for this year’s Taxonomy Boot Camp conference (November 5-6, 2018, Washington DC) was “Bridging Human Thinking and Machine Learning.”

This was the 14th Taxonomy Boot Camp conference and its 9th year in Washington, DC, which (along with the newer Taxonomy Boot Camp London) is the only conference dedicated to taxonomies. As usual, it is held along with several other co-located conferences of Information Today Inc., which overlap or are consecutive. The format, as in past years, involved an opening keynote, after which the conference breaks in two tracks of sessions the first day, one more basic and one more advanced, then on the second day a joint keynote with KMWorld conference, and a single track for the rest of the second day. By a show of hands, it appeared that 75% of the Taxonomy Boot Camp attendees were first-timers, even more than before. There were 235 attendees, including speakers and sponsors.

While the conference has two tracks the first day, a more basic and a more advanced track, presentations on machine learning and AI were in both tracks. These included “Taxonomy & Machine Learning at the Knot,” “Sandwiches, Categories, Ethics & Machine Learning,” “Taxonomy Skills in the World of AI” (a panel), “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” “Semantic Search Enrichment,” “Taxonomies and AI Chat Boxes,” and “Taxonomy in the Age of Amazon Echo,” and “Applying Taxonomy Skills to Cognitive Computing” (a project involving IBM Watson data privacy research product of Thomson Reuters).
In “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” presenter Andreas Blumauer of the Semantic Web Company said that increasingly companies are adopting knowledge graphs as their IT infrastructure, and leading players are trying to fuse knowledge graphs with machine learning. A knowledge graph has to be stored in a graph database. There are two types of graph database models: property graphs and RDF graphs. RDF graphs are more important for knowledge graphs.

Semantic AI core principles include the following.
      It’s about things not strings.
      It’s more than metadata: it describes the meaning of metadata as an additional, semantic layer.
      The knowledge graph establishes the semantic layer.
      Knowledge graphs can be seen as an input for machine learning.
      AI isn’t always good at understanding questions so a taxonomy/ontology is needed to support it.
      AI should be built upon data quality, data as a service, no black box, a hybrid approach, as structured data meeting text, aiming towards self optimizing machines (a vision, as we are not there yet).

Use cases of knowledge graphs include a recommendation engine. A knowledge graph is the basis behind the recommendation engine providing content, taking into consideration users.
In “Taxonomy & Machine Learning at the Knot,” the presenters of the web media company the XO Group, started with a good introduction to machine learning, starting off with explaining the problems it can solve: predicting behavior, automating tedious steps, and classifying; and that there are two types: supervised and unsupervised. Common applications include clustering, recommendations, and classification, and each of these can involve taxonomies. Specific implementation examples were provided.

As with last year, there was also a lot of talk of auto-categorization (automated or machine-aided indexing) across various session. Three were dedicated to the subject: “Driving Discovery: Combining Taxonomy & Textual AI at Sage” (a case study using Expert System auto-categorization) “Testing for Auto-tagging Success” and “Classification Relevance at Associated Press.” AP has an automated rules-based classification system for Subjects, Geography, and Organizations. Rules based auto-classification was chosen over the statistical method, because it offers transparency and control, breaking news and low frequency terms can be dealt with (don’t need the existing training set), you can scope/disambiguate between terms better, such incident type terms (Violent crime) vs. issue terms (Domestic violence), and semantic rules ensure there is not must passing mention. Entity extraction with disambiguation rules is used for person names and publicly-traded companies.

Knowledge graphs are getting more attention both here and at Taxonomy Boot Camp London. This was, of course, the main topic of the presentation Andreas Blumauer’s talk “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” and Mike Doane, in the introduction of his talk on “Taxonomy in the Age of Amazon Echo  said that the information industry analysis firm Gartner reports that knowledge graphs are on the rise and are discussed more than taxonomies. Gartner is tracking knowledge graphs instead of taxonomies and ontologies.

While the opening keynote did not focus on AI or machine learning, it was presentation by a computational linguist, Deborah McGuinness, a professor of Computer, Cognitive, and Web Sciences, at Rensselaer Polytechnic Institute. Among other things, she spoke of the Data life cycle, whereby a computer understandable specification of meaning (semantics) supports enhanced lifespan and impact of data. She went on to include to specific ontology case examples.

Nearly all session slides are available to download, except the keynotes, without any login credentials at: http://www.taxonomybootcamp.com/2018/Presentations.aspx

Tuesday, October 30, 2018

Taxonomy Boot Camp London, 2018


This October, for the third year in a row, I have enjoyed the opportunity to attend and present at Taxonomy Boot Camp London (TBCL).

Similar in subject area scope, but with unique presentations, to its parent conference Taxonomy Boot Camp (TBC), usually held in Washington, DC, in November, I find it worth my time to attend both conferences. Despite what might be considered a niche topic for select audience, TBCL remains a strong conference with consistent attendance (about 170 participants), comparable to TBC in its earlier years. The size is large enough to offer a choice of two tracks but small enough to easily network with others. The conference speakers and attendees are quite international, representing 22 countries this year.

Conference Format


TBCL continues to differ from TBC by having two tracks on both days, instead of just on the first day as TBC does. It also has a pre-conference workshop day, which TBC lacks, a full-day Taxonomy Fundamentals workshop (which I lead), and two half day workshops on more specialized or advanced taxonomy topics, which are not the same each year. This year the half-day workshops were on text analytics and taxonomies in SharePoint

For the first time, Taxonomy Boot Camp London presented two awards (which Taxonomy Boot Camp in Washington, DC, does not do.) The winner of the Taxonomy Practitioner of the Year award was Tom Alexander, Taxonomy Manager, Cancer Research UK. The winner of the Taxonomy Success of the Year award was SAGE Research Methods Thesaurus, led by Alan Maloney & Martha Sedgwick, SAGE Publishing.

Exhibits


The exhibit/sponsor showcase is very different at TBCL from TBC. TBC has a small dedicated exhibit on its first day, but then shares the much larger KM World exhibit with the four other co-located conferences. TBCL’s exhibit space is similar to that of TBC’s first day, with just three software vendor sponsor-exhibitors (Synaptica, Access Innovations, and Semantic Web Company/PoolParty). However, there was a larger number of organizational supporter-exhibitors: Association for Independent Information Professionals, the Information Retrieval Specialist Group of the British Computer Society, the Danish Union of Librarians, the Knowledge & Information management Special Interest Group of CILIP (Chartered Institute of Library and Information Professionals) of the UK, the Information and Records Management Society of the UK, the UK Chapter of the International Society for Knowledge Organization (ISKO), the Network for Information & Knowledge Exchange of the UK, the SLA (Special Libraries Association) Europe chapter, and the SLA Taxonomy Division. This was a greater number of organizations than last year. The significant involvement of professional associations in TBCL contrasts with the relative lack of professional associations involved in TBC.

TBCL continues to be co-located with another Information Today conference, Internet Librarian International, but their exhibit areas are somewhat separate (although attendees of both conferences can visit booths of either conference), since their audience and market is different. Other than the drinks reception the first day, the two conferences do not share anything, such as keynotes.



Keynotes


There were three keynote presentations, two consecutive contrasting keynotes the first day and one the second day.  

The opening keynote was indeed a keynote style talk, which was on the broader subject of information on the web, rather than on the specifics of taxonomies. “This is the Bad Place: 13 Rules for Designing Better Information Environments,” was presented by Paul Rissen, Product Manager at Springer Nature UK and previously at BBC. In his thought-providing presentation he aimed at establishing “ground rules” for using the web (especially social media) and for public discourse in general.

This was followed by a more down-to-earth state of the profession talk by Dave Clarke, CEO of Synaptica, titled “Catching the Wave: What Tools do Taxonomists Need to do Their Job.” Although Synaptica was the lead sponsor of the conference, this was not promotional talk. Dave started out be summarizing what taxonomists do and enable as organize, categorize, and discover, and explained the different tools for each. More of Dave’s presentation was about what taxonomists are doing based on the results of a survey of taxonomists he has been conducting (https://twitter.com/DavidClarkeBlog). Then Dave turned to what he considered to be the future trends and issues. Artificial intelligence (AI) is relevant to what we do, but it will not replace the need for human-curated taxonomies or ontologies. Rather, taxonomies and ontologies will empower AI with the semantics and log to improve search and categorization and perform machine learning. Ontologies and linked data can help build smarter search and discovery applications by leveraging the logical dependencies. Linked open data is shared openly, and linked enterprise data is behind the firewall where the linked data model also works well.

The second day’s keynote addressed an important topic. “Selling the Benefits of Taxonomy: Numbers and Stories” was presented by taxonomy and text analytics consultant Tom Reamy. Tom’s argument was that return-on-investment (ROI) studies, with their numerical data on time spent, are not sufficient to convince decision-makers of the benefits of taxonomies, and that use case stories and internal advocacy are also needed. Stories can describe the increased richness of knowledge discovery, better decisions, and analysis of complex issues. He also suggested selling the vision of a taxonomy by means of a mini demo. Tom then turned to text analytics as the important means to make taxonomies usable, as he is rather dismissive of manual indexing. He explained that text analytics is often called auto-categorization, because that was the first use of it, but that text analytics can be used for other things, too.

Conference Sessions


The more basic track had sessions on taxonomy development, user validation, taxonomy resources, taxonomy development approaches, information architecture, enterprise information management, tagging, and taxonomy standards and architecture. I attended mostly sessions of the more advanced track, though.

A theme of the conference, as stated in the program was “Making taxonomies go further,” and conference chair Helen Lippell stated in her welcome the opportunity to “push your practice further.” This was especially true of several of the advanced track sessions I attended. “Using Ontologies for more than Information Categorization,” presented by Ahren Lehnart and Jim Sweeney of Synaptica, suggested using ontologies for project and product management and in support of various other business functions in sales, marketing, partner and competitor information management, etc.  “Beyond Taxonomy Classification: Using Knowledge Models and Linked Data to Unlock New Business Models” was presented by Ben Miller of Wiley. He spoke of knowledge models, as comprising content acquisition and content enrichment. Jim Sweeney also presented “Taking Your Show on the Road: Publishing Taxonomies and Ontologies as Linked Data,” which was a good introduction to Linked Data. In this presentation, he also introduced graph databases and their benefits. While not explicitly discussing taxonomies, Rahel Anne Baile’s talk, “Introduction to Information 4.0,” suggested another application for taxonomies which content is in “molecules and objects,” rather than on as documents, or based on pre-determined topics.  Multilingual taxonomies and taxonomy implementation in SharePoint were the topics of other presentations.

 
I am looking forward to Taxonomy Boot Camp in Washington, DC, next week, and Taxonomy Boot Camp London again next year which has been scheduled for the same venue October 15-16, 2019, with preconference workshops on October 14.

Thursday, September 6, 2018

An Open Vocabulary Tagging Experiment for Discoverability


Does tagging content with terms from a shared, publicly available controlled vocabulary make a difference in increasing content discoverability on the web? A colleague of mine proposed finding out by experimenting with tagging the same content, such as two identical blog posts, differently: one with terms typical for posts on the blog and one with terms from a publicly available controlled vocabulary. Then after a few weeks the statistic of visitor traffic to the two post versions would be compared.

Wikidata  and VIAF, were chosen as the sources of publicly available controlled vocabulary terms. Since VIAF contains only name authorities (proper nouns), I used terms just from Wikidata in my blog tagging experiment, whereas my colleague used terms from both Wikidata and VIAF in his blog post tagging experiment (The Open Web Tagging Experiment on the Ol' Patio Boat Blog).

The preceding blog post on The Accidental Taxonomist blog, "Using Linked and Other Open Vocabularies," had been posted twice identically, except that one version was tagged with terms from Wikidata, linking to them, and one was tagged with terms that have been created and used just for The Accidental Taxonomist blog. I did not linked to either blog post from other social media, as I usually do. (Now that the experiment is over, I deleted the duplicate blog post with the lower number of visitors recorded.)

After 18 days, I checked the statistics for the number of visitors to each blog post. The version with the blog's own tags (the tagging feature supported by Blogger.com) had 72 visitors, and the version without blog tags but with links to Wikidata tags had 104 visitors. (By contrast, this post "An Open Vocabulary Tagging Experiment for Discoverability" had in the same period attracted 119 visitors, without any tags or links to Wikidata terms during this period.)

The conclusions are not certain, but it appears as if links out to Wikidata may have helped in that post's discoverability, since the post with those links had more visitors. It also appears that blog tags do not seem to help significantly in discoverability, since of the three posts, the one with those tags had the least number of visitors, although the tags are useful for finding specific posts once you are on the blog's home page.  The results of my colleague's test of two identical posts with and without tagging were different, though. He concluded the opposite, that coping Wikidtata and VIAF headings into a post with incoming URLs had no effect, but putting metadata into Blogger tagging field did increase visibility. However, his visitor traffic in both cases was very low, so the difference was perhaps not statistically significant.

As for this post, which had no tags, but the highest number of visitors, that could be attributed to a post title with more searched key words and phrases in it.

Search engine optimization is a big and ever-changing field. Rather than try to game the search, I will return to my method of posting about my blog posts on social media and hope my connections will share and repost. 



Using Linked and Other Open Vocabularies

Taxonomy terms assigned to content items makes the content easier to find, whether in an internal system, on the web, or both. To make content easier to find or discover on the web, the use of taxonomy terms or tags is part of the broader application of search engine optimization (SEO). A lot has already been written by others regarding tips for creating and adding terms/labels/tags to web content to support SEO, such as how many and how specific they should be. For the taxonomist, who is interested not only in the terms alone but also in the larger taxonomy to which they belong, another question is whether using terms from shared, publicly available controlled vocabularies makes a difference in increasing content discoverability on the web. 

Linked open data and linked open vocabularies


Shared, publicly available controlled vocabularies may or may not be linked or linkable, as linked open vocabularies. So, just because a controlled vocabulary is publicly available does not mean that it inherently supports linked data on the web.

Linked data,” which usually is linked open data, refers to methods to interlink structured content in a way that can be read automatically by computers to enable the discovery of content on the web. It is described in a set of W3C specifications for web publishing that makes the data or content part of the Semantic Web. This means that instead of manually following individually created hyperlinks, semantic links and computer readable formats support automated relevant linkages among content. Linked data requires the use of named URIs to identify things, HTTP URIs for web lookup, and structured data using controlled vocabulary terms and dataset definitions expressed in an RDF standard framework. “Linked open data” additionally includes open use in accordance with an open license.

Terms in taxonomies can serve as labels to linked content as part of linked data. Additionally, although less common, taxonomy terms themselves can be the content that is linked to, if the taxonomy concepts are individually assigned URIs and HTTP addresses, and are in an RDF format.

Limitations to designating content as linked open data


If you have a document on the web that you want to have discovered as part of the Semantic Web, designating it as linked data is not so simple, because you need to include the machine-readable instructions, such as through a SPARQL endpoint or an API (application programming interface), in addition to the RDF designation. Not only is this technically outside the skills of most individual web content creators and taxonomists, but depending on how the content is managed, standard web content management systems or blog posting software may not even support editing the HTML of the page to insert such instructions

Institutions may register their content with a linked open data repository. The main repository of linked open vocabularies is Linked OpenVocabularies (LOV), hosted by the Ontology Engineering Group of the Computer Science School at Universidad Polit├ęcnica de Madrid. An individual blogger, however, who would like to make an individual blog post linked open data, cannot easily achieve that status.

Simply linking to shared, open vocabularies


Thus, if linked data instructions cannot easily be included and traditional manual links back to the page (as by means of agreed-upon link exchanges) cannot be established for practical reasons, tagging could be done with terms from a publicly available controlled vocabulary that is not part of linked open data and linked open vocabularies. Two good examples are the labels of Wikidata and the Virtual International Authority File (VIAF).

Wikidata  is a free, open, collaborative, multilingual collection of structured data. Its purpose is to support Wikipedia, Wikimedia Commons and other wikis of the Wikimedia movement, as well as anyone who wants to search, use, edit or consume its data. The data contained in the Wikidata repository consists of items, each with a unique name and ID. Currently there are 50,116,886 data items. Each item has a brief glossary definition, equivalent names in other languages, relationships ("statements”) to other data items (such a "subclass of" and "designed by"), and identifiers in other vocabularies (such as Freebase, Library of Congress authorities, and Quora topic). 

VIAF, hosted by OCLC, contains just named entities (proper nouns). But it uniquely brings together and displays as a group the headings that are the authority used by each contributor for that term. So, it’s not exactly a controlled vocabulary. VIAF has over 40 international member-contributors, most of which are national libraries.

Is there any benefit in tagging with and linking to terms that are part of a controlled vocabulary which is publicly available but is not a linked open vocabulary, such a Wikidata or VIAF? A colleague of mine proposed finding out by experimenting with tagging the same content with terms from different sources. Results will be shared in a later blog post.