The Accidental Taxonomist: September 2018

Thursday, September 6, 2018

An Open Vocabulary Tagging Experiment for Discoverability

Does tagging content with terms from a shared, publicly available controlled vocabulary make a difference in increasing content discoverability on the web? A colleague of mine proposed finding out by experimenting with tagging the same content, such as two identical blog posts, differently: one with terms typical for posts on the blog and one with terms from a publicly available controlled vocabulary. Then after a few weeks the statistic of visitor traffic to the two post versions would be compared.

Wikidata and VIAF, were chosen as the sources of publicly available controlled vocabulary terms. Since VIAF contains only name authorities (proper nouns), I used terms just from Wikidata in my blog tagging experiment, whereas my colleague used terms from both Wikidata and VIAF in his blog post tagging experiment (The Open Web Tagging Experiment on the Ol' Patio Boat Blog).

The preceding blog post on The Accidental Taxonomist blog, "Using Linked and Other Open Vocabularies," had been posted twice identically, except that one version was tagged with terms from Wikidata, linking to them, and one was tagged with terms that have been created and used just for The Accidental Taxonomist blog. I did not linked to either blog post from other social media, as I usually do. (Now that the experiment is over, I deleted the duplicate blog post with the lower number of visitors recorded.)

After 18 days, I checked the statistics for the number of visitors to each blog post. The version with the blog's own tags (the tagging feature supported by Blogger.com) had 72 visitors, and the version without blog tags but with links to Wikidata tags had 104 visitors. (By contrast, this post "An Open Vocabulary Tagging Experiment for Discoverability" had in the same period attracted 119 visitors, without any tags or links to Wikidata terms during this period.)

The conclusions are not certain, but it appears as if links out to Wikidata may have helped in that post's discoverability, since the post with those links had more visitors. It also appears that blog tags do not seem to help significantly in discoverability, since of the three posts, the one with those tags had the least number of visitors, although the tags are useful for finding specific posts once you are on the blog's home page. The results of my colleague's test of two identical posts with and without tagging were different, though. He concluded the opposite, that coping Wikidtata and VIAF headings into a post with incoming URLs had no effect, but putting metadata into Blogger tagging field did increase visibility. However, his visitor traffic in both cases was very low, so the difference was perhaps not statistically significant.

As for this post, which had no tags, but the highest number of visitors, that could be attributed to a post title with more searched key words and phrases in it.

Search engine optimization is a big and ever-changing field. Rather than try to game the search, I will return to my method of posting about my blog posts on social media and hope my connections will share and repost.

Using Linked and Other Open Vocabularies

Taxonomy terms assigned to content items makes the content easier to find, whether in an internal system, on the web, or both. To make content easier to find or discover on the web, the use of taxonomy terms or tags is part of the broader application of search engine optimization (SEO). A lot has already been written by others regarding tips for creating and adding terms/labels/tags to web content to support SEO, such as how many and how specific they should be. For the taxonomist, who is interested not only in the terms alone but also in the larger taxonomy to which they belong, another question is whether using terms from shared, publicly available controlled vocabularies makes a difference in increasing content discoverability on the web.

Linked open data and linked open vocabularies

Shared, publicly available controlled vocabularies may or may not be linked or linkable, as linked open vocabularies. So, just because a controlled vocabulary is publicly available does not mean that it inherently supports linked data on the web.

“Linked data,” which usually is linked open data, refers to methods to interlink structured content in a way that can be read automatically by computers to enable the discovery of content on the web. It is described in a set of W3C specifications for web publishing that makes the data or content part of the Semantic Web. This means that instead of manually following individually created hyperlinks, semantic links and computer readable formats support automated relevant linkages among content. Linked data requires the use of named URIs to identify things, HTTP URIs for web lookup, and structured data using controlled vocabulary terms and dataset definitions expressed in an RDF standard framework. “Linked open data” additionally includes open use in accordance with an open license.

Terms in taxonomies can serve as labels to linked content as part of linked data. Additionally, although less common, taxonomy terms themselves can be the content that is linked to, if the taxonomy concepts are individually assigned URIs and HTTP addresses, and are in an RDF format.

Limitations to designating content as linked open data

If you have a document on the web that you want to have discovered as part of the Semantic Web, designating it as linked data is not so simple, because you need to include the machine-readable instructions, such as through a SPARQL endpoint or an API (application programming interface), in addition to the RDF designation. Not only is this technically outside the skills of most individual web content creators and taxonomists, but depending on how the content is managed, standard web content management systems or blog posting software may not even support editing the HTML of the page to insert such instructions

Institutions may register their content with a linked open data repository. The main repository of linked open vocabularies is Linked OpenVocabularies (LOV), hosted by the Ontology Engineering Group of the Computer Science School at Universidad Politécnica de Madrid. An individual blogger, however, who would like to make an individual blog post linked open data, cannot easily achieve that status.

Simply linking to shared, open vocabularies

Thus, if linked data instructions cannot easily be included and traditional manual links back to the page (as by means of agreed-upon link exchanges) cannot be established for practical reasons, tagging could be done with terms from a publicly available controlled vocabulary that is not part of linked open data and linked open vocabularies. Two good examples are the labels of Wikidata and the Virtual International Authority File (VIAF).

Wikidata is a free, open, collaborative, multilingual collection of structured data. Its purpose is to support Wikipedia, Wikimedia Commons and other wikis of the Wikimedia movement, as well as anyone who wants to search, use, edit or consume its data. The data contained in the Wikidata repository consists of items, each with a unique name and ID. Currently there are 50,116,886 data items. Each item has a brief glossary definition, equivalent names in other languages, relationships ("statements”) to other data items (such a "subclass of" and "designed by"), and identifiers in other vocabularies (such as Freebase, Library of Congress authorities, and Quora topic).

VIAF, hosted by OCLC, contains just named entities (proper nouns). But it uniquely brings together and displays as a group the headings that are the authority used by each contributor for that term. So, it’s not exactly a controlled vocabulary. VIAF has over 40 international member-contributors, most of which are national libraries.

Is there any benefit in tagging with and linking to terms that are part of a controlled vocabulary which is publicly available but is not a linked open vocabulary, such a Wikidata or VIAF? A colleague of mine proposed finding out by experimenting with tagging the same content with terms from different sources. Results will be shared in a later blog post.