Thursday, September 6, 2018

An Open Vocabulary Tagging Experiment for Discoverability


Does tagging content with terms from a shared, publicly available controlled vocabulary make a difference in increasing content discoverability on the web? A colleague of mine proposed finding out by experimenting with tagging the same content, such as two identical blog posts, differently: one with terms typical for posts on the blog and one with terms from a publicly available controlled vocabulary. Then after a few weeks the statistic of visitor traffic to the two post versions would be compared.

Wikidata  and VIAF, were chosen as the sources of publicly available controlled vocabulary terms. Since VIAF contains only name authorities (proper nouns), I used terms just from Wikidata in my blog tagging experiment, whereas my colleague used terms from both Wikidata and VIAF in his blog post tagging experiment (The Open Web Tagging Experiment on the Ol' Patio Boat Blog).

The two preceding blog posts on The Accidental Taxonomist blog, "Using Linked and Other Open Vocabularies," are identical, except that one was tagged with terms from Wikidata, linking to them, and one was tagged with terms that have been created and used just for The Accidental Taxonomist blog. I have not linked to either blog post from other social media, as I usually do. If you are going to click on the links for either of these blog posts (rather than just scrolling down the home page), please click on the links for both posts, not just one, so as not to impact the blog post visitor statistics.

Results will be posted on an updated version of this blog post in a few weeks.

Using Linked and Other Open Vocabularies

Taxonomy terms assigned to content items makes the content easier to find, whether in an internal system, on the web, or both. To make content easier to find or discover on the web, the use of taxonomy terms or tags is part of the broader application of search engine optimization (SEO). A lot has already been written by others regarding tips for creating and adding terms/labels/tags to web content to support SEO, such as how many and how specific they should be. For the taxonomist, who is interested not only in the terms alone but also in the larger taxonomy to which they belong, another question is whether using terms from shared, publicly available controlled vocabularies makes a difference in increasing content discoverability on the web. 

Linked open data and linked open vocabularies


Shared, publicly available controlled vocabularies may or may not be linked or linkable, as linked open vocabularies. So, just because a controlled vocabulary is publicly available does not mean that it inherently supports linked data on the web.

Linked data,” which usually is linked open data, refers to methods to interlink structured content in a way that can be read automatically by computers to enable the discovery of content on the web. It is described in a set of W3C specifications for web publishing that makes the data or content part of the Semantic Web. This means that instead of manually following individually created hyperlinks, semantic links and computer readable formats support automated relevant linkages among content. Linked data requires the use of named URIs to identify things, HTTP URIs for web lookup, and structured data using controlled vocabulary terms and dataset definitions expressed in an RDF standard framework. “Linked open data” additionally includes open use in accordance with an open license.

Terms in taxonomies can serve as labels to linked content as part of linked data. Additionally, although less common, taxonomy terms themselves can be the content that is linked to, if the taxonomy concepts are individually assigned URIs and HTTP addresses, and are in an RDF format.

Limitations to designating content as linked open data


If you have a document on the web that you want to have discovered as part of the Semantic Web, designating it as linked data is not so simple, because you need to include the machine-readable instructions, such as through a SPARQL endpoint or an API (application programming interface), in addition to the RDF designation. Not only is this technically outside the skills of most individual web content creators and taxonomists, but depending on how the content is managed, standard web content management systems or blog posting software may not even support editing the HTML of the page to insert such instructions

Institutions may register their content with a linked open data repository. The main repository of linked open vocabularies is Linked OpenVocabularies (LOV), hosted by the Ontology Engineering Group of the Computer Science School at Universidad Polit├ęcnica de Madrid. An individual blogger, however, who would like to make an individual blog post linked open data, cannot easily achieve that status.

Simply linking to shared, open vocabularies


Thus, if linked data instructions cannot easily be included and traditional manual links back to the page (as by means of agreed-upon link exchanges) cannot be established for practical reasons, tagging could be done with terms from a publicly available controlled vocabulary that is not part of linked open data and linked open vocabularies. Two good examples are the labels of Wikidata and the Virtual International Authority File (VIAF).

Wikidata  is a free, open, collaborative, multilingual collection of structured data. Its purpose is to support Wikipedia, Wikimedia Commons and other wikis of the Wikimedia movement, as well as anyone who wants to search, use, edit or consume its data. The data contained in the Wikidata repository consists of items, each with a unique name and ID. Currently there are 50,116,886 data items. Each item has a brief glossary definition, equivalent names in other languages, relationships ("statements”) to other data items (such a "subclass of" and "designed by"), and identifiers in other vocabularies (such as Freebase, Library of Congress authorities, and Quora topic). 

VIAF, hosted by OCLC, contains just named entities (proper nouns). But it uniquely brings together and displays as a group the headings that are the authority used by each contributor for that term. So, it’s not exactly a controlled vocabulary. VIAF has over 40 international member-contributors, most of which are national libraries.

Is there any benefit in tagging with and linking to terms that are part of a controlled vocabulary which is publicly available but is not a linked open vocabulary, such a Wikidata or VIAF? A colleague of mine proposed finding out by experimenting with tagging the same content with terms from different sources. Results will be shared in a later blog post.

Using Linked and Other Open Vocabularies


Taxonomy terms assigned to content items makes the content easier to find, whether in an internal system, on the web, or both. To make content easier to find or discover on the web, the use of taxonomy terms or tags is part of the broader application of search engine optimization (SEO). A lot has already been written by others regarding tips for creating and adding terms/labels/tags to web content to support SEO, such as how many and how specific they should be. For the taxonomist, who is interested not only in the terms alone but also in the larger taxonomy to which they belong, another question is whether using terms from shared, publicly available controlled vocabularies makes a difference in increasing content discoverability on the web. 

Linked open data and linked open vocabularies


Shared, publicly available controlled vocabularies may or may not be linked or linkable, as linked open vocabularies. So, just because a controlled vocabulary is publicly available does not mean that it inherently supports linked data on the web.

“Linked data,” which usually is linked open data, refers to methods to interlink structured content in a way that can be read automatically by computers to enable the discovery of content on the web. It is described in a set of W3C specifications for web publishing that makes the data or content part of the Semantic Web. This means that instead of manually following individually created hyperlinks, semantic links and computer readable formats support automated relevant linkages among content. Linked data requires the use of named URIs to identify things, HTTP URIs for web lookup, and structured data using controlled vocabulary terms and dataset definitions expressed in an RDF standard framework. “Linked open data” additionally includes open use in accordance with an open license.

Terms in taxonomies can serve as labels to linked content as part of linked data. Additionally, although less common, taxonomy terms themselves can be the content that is linked to, if the taxonomy concepts are individually assigned URIs and HTTP addresses, and are in an RDF format.

Limitations to designating content as linked open data


If you have a document on the web that you want to have discovered as part of the Semantic Web, designating it as linked data is not so simple, because you need to include the machine-readable instructions, such as through a SPARQL endpoint or an API (application programming interface), in addition to the RDF designation. Not only is this technically outside the skills of most individual web content creators and taxonomists, but depending on how the content is managed, standard web content management systems or blog posting software may not even support editing the HTML of the page to insert such instructions

Institutions may register their content with a linked open data repository. The main repository of linked open vocabularies is Linked OpenVocabularies (LOV), hosted by the Ontology Engineering Group of the Computer Science School at Universidad Polit├ęcnica de Madrid. An individual blogger, however, who would like to make an individual blog post linked open data, cannot easily achieve that status.

Simply linking to shared, open vocabularies


Thus, if linked data instructions cannot easily be included and traditional manual links back to the page (as by means of agreed-upon link exchanges) cannot be established for practical reasons, tagging could be done with terms from a publicly available controlled vocabulary that is not part of linked open data and linked open vocabularies. Two good examples are the labels of Wikidata and the Virtual International Authority File (VIAF).

Wikidata  is a free, open, collaborative, multilingual collection of structured data. Its purpose is to support Wikipedia, Wikimedia Commons and other wikis of the Wikimedia movement, as well as anyone who wants to search, use, edit or consume its data. The data contained in the Wikidata repository consists of items, each with a unique name and ID. Currently there are 50,116,886 data items. Each item has a brief glossary definition, equivalent names in other languages, relationships ("statements”) to other data items (such a "subclass of" and "designed by"), and identifiers in other vocabularies (such as Freebase, Library of Congress authorities, and Quora topic). 

VIAF, hosted by OCLC, contains just named entities (proper nouns). But it uniquely brings together and displays as a group the headings that are the authority used by each contributor for that term. So, it’s not exactly a controlled vocabulary. VIAF has over 40 international member-contributors, most of which are national libraries.

Is there any benefit in tagging with and linking to terms that are part of a controlled vocabulary which is publicly available but is not a linked open vocabulary, such a Wikidata or VIAF? A colleague of mine proposed finding out by experimenting with tagging the same content with terms from different sources. Results will be shared in a later blog post.

Thursday, August 30, 2018

Taxonomy Hierarchical Relationship Issues


A common feature of taxonomies is the hierarchical relationship between terms. Terms are linked to each other in a relationship that indicates that one is the broader term (BT) of the other, and in the other direction, one is the narrower term (NT) of the other. You don’t need to be a taxonomist to understand this basic principle. However, even taxonomists can be challenged sometimes in determining whether it’s correct two put two terms in a hierarchical relationship.

Standards for Hierarchical Relationships


There are guidelines for the hierarchical relationship provided by the standards of ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and ISO 25964-1: Information and Documentation — Thesauri and Interoperability with other Vocabularies — Part 1: Thesauri for Information Retrieval. The standards say that in a correct hierarchical relationship the term that is narrower to the broader term may be a specific type of the generic broader term, a named instance of the generic broader term, or an integral part of the whole broader term.

These standards, however, are for thesauri, not taxonomies. Thesauri have additionally a non-hierarchical associative relationship between terms, known as “related term” (RT). In taxonomies which lack related-term relationships, the conditions under which the hierarchical relationship is permitted need not be followed quite as strictly. Nevertheless, the thesaurus standards for creating the hierarchical relationship should be the starting point and the default for hierarchical relationships in taxonomies.

Challenges in Coming up with Broader Terms


Hierarchical taxonomies may be created from the top down, the bottom up, or a combination of both approaches. The top-down approach involves creating broadest categories first, then adding narrower terms and then adding narrower terms to narrower terms. This approach makes it easier to create good hierarchical relationships. In reality, though, we don’t always create terms based purely on their broader terms. Rather, analysis of content yields specific terms that are needed, so some degree of bottom-up taxonomy creation takes place. In the bottom-up approach there may be the challenge of determining and creating the appropriate broader term.

When I have been completely challenged in coming up with a broader term, I admit I have looked up the term in Wikipedia to see what are named as “Categories” for that term, listed at the bottom of the page. “Categories” implies a broader term, but these are not necessarily good or correct broader terms. An example of Categories that are not exactly broader terms is for the term Stress management: Stress, Management by type, Psychotherapy, and Psychiatric treatments. Stress management is not exclusively done as (is a part of) Psychotherapy or Psychiatric treatments, so those are not suitable broader terms. “Management by type” is definitely not a good taxonomy term, and the term Management alone has a different meaning of its own. As for the term “Stress,” this is more complicated. Technically, Stress management is not a kind of Stress or a part of Stress, so Stress should not be its broader term.  If this were in a thesaurus, they would definitely be related terms. If your controlled vocabulary is not a thesaurus, and the related-term relationship is not supported, then you may ignore the thesaurus rule in this case, and make Stress the broader term of Stress Management. This relationship is likely to be expected and accepted by users.

Challenges in Special Circumstances


Even creating a taxonomy from the top down taxonomists may encounter challenges or confusions with the hierarchical relationships. One challenging case is the concept of membership. Things and their members could be industries and their companies or international organizations and their member countries. It may seem logical to list the affiliate members “under” the industry or organization of which they are a part, but this is based too much on context and time. Companies can change their industries, and countries can change their international organization affiliation. More significantly, the whole-part hierarchical relationship is about integral parts, not participatory taking “part.” Finally, it may be more practical to put each type (companies, industries, companies, organizations) in a separate facet and not establish any relationship between them in a taxonomy (in contrast to a thesaurus or ontology).

Another potentially confusing case involves occupations and job titles. The subordinate nature of narrower terms should not be confused with the subordinate role of one job title to another. Thus, while a marketing specialist reports to a marketing manager, Marketing managers is not a broader term of Marketing specialists. Furthermore, while a marketing manager reports to a marketing director, we might make the hierarchical relationship in the other direction, with Marketing Directors as a narrower term to Marketing Managers, because directors are a kind of manager. Managers include directors.

Perhaps the most confusing case involves specificity which is not taxonomical specificity. For example, the Syllabi (plural of syllabus), as instructional outlines, in a certain sense are more specific than Curricula (plural of curriculum), which are also kind instructional outlines. Syllabi are for individual courses, and curricula are for a series of courses, such as an entire program of study or degree. Thus, it might seem logical that Syllabi would have the broader term of Curricula. But a syllabus is neither a specific type of curriculum, nor is it part of a curriculum. It is something different. So, it would be better not to have Curricula as a broader term of Syllabi, even in a taxonomy that is lacking related-term relationships.

Parent-Child Confusions


Sometimes the hierarchical relationship is referred to as “parent-child.” While it’s correct that a subsidiary company is a narrower term of its parent company, because it is part of the parent company, a biological child is not a narrower term if its parent, because it is not a part of the parent, but rather an offspring. To avoid confusion, it’s better to describe the relationship as broader/narrower, rather than as parent/child.

Monday, July 30, 2018

Taxonomy Hierarchy Levels


A taxonomy comprises a hierarchy of concepts (terms), and those hierarchies can be considered to be in different levels. In actuality, levels are somewhat artificial, and its important not to think of levels too strictly. In some taxonomies the levels are even named (for example: Domain, Category, Subcategory, Topic), but I would caution against such a practice.


Why we may tend to name levels


The most famous taxonomy, the Linnaean taxonomy of organisms, has well-known names for each of its hierarchical levels: Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species. There are issues, however, with this named-level system, though. In some cases, a Family may contain only a single Genus, and/or a Genus contain only a single Species (such as Homo sapiens). In some cases, a Species may have such variety within it, which we wish to describe, that we have created names for subspecies or other deeper levels (such as for dog breeds). For a digital navigation or information taxonomy of concepts, it would be considered bad style for a term to have only a single narrower term (as Homo sapiens). A term should have no narrower terms or at least two narrower terms, but not just one.

Besides the legacy of the Linnaean taxonomy, we may think of designated levels of a taxonomy, because the most common tool of developing taxonomies is MS Excel. In Excel, each column is used to designate a deeper hierarchical level, broader to more specific, from left to right. People may feel compelled to designate column headers (a typical thing to do in spreadsheets), whether as names or merely as Level 1, Level 2, etc. Excel is not intended to be taxonomy management software, and all dedicated taxonomy management software tools do not support the default naming or numbering of hierarchical levels, since there is no need for it in a taxonomy.


Why we should not name levels


Unlike the Linnaean taxonomy, the goal of a digital navigation or information taxonomy of concepts is not necessarily to classify concepts, but rather to arrange concepts (terms) in logical hierarchical relationships, so as to help guide the user to find the desired concept (which in turn is linked to content). A classification system (such industry classification codes or the Dewy Decimal system), which also has enumerated levels, is often considered a different kind of controlled vocabulary from a taxonomy.

A distinction needs to be made between hierarchical relationships and hierarchies. A good taxonomy or thesaurus design practice is to create hierarchical relationships between terms where they are logical: when one terms is a specific type or an integral part of another term, so users find narrower terms where they expect them. The extension of multiple hierarchical relationships, particularly when terms have both broader-term and narrower-term relationships, naturally results in the manifestation of hierarchies. But the resulting “natural” hierarchies are not consistent. There may be many levels deep in some places and only two levels deep in other places.  Terms that are on the same “level” may have relatively different degrees of specificity. I recently created a taxonomy for a discipline in which terms that were the equivalent of textbook courses ranged everywhere from the top to the fourth level. Fortunately, I was not constrained to have course as the first level.

Sometimes a taxonomy owner wants to set a policy as to how many levels deep the taxonomy should be.  It is understandable to limit the depth of a taxonomy in some cases: a hierarchy of navigation for public site visitors who want to get to content in the fewest clicks, lest they leave the site; a hierarchy of categories whose labels are to be picked up by search engines (supporting search engine optimization); or a hierarchy within a facet with limitations on browsing.  But there is a difference between limiting the total levels of depth and designating what the levels are called and are supposed to represent.


Examples of problems from named levels


Designating the names or types of levels inevitably results in the inaccurate application of level names or terms at inappropriate or inconsistent levels. For example, for a taxonomy of job titles I worked on, the project owner proposed that the top level be called Occupations and the narrower terms to those be called Specializations. This often works, but not always. For example, with the term Electrician and its narrower term Electrician Apprentice. Electrician was called and Occupation, and Electrician Apprentice was called a Specialization. Although an Electrician Apprentice can be a kind of (narrower term of) Electrician, it is not actually a “specialization” of Electrician. Also, a unique specialized job title may not have a broader term type of job title, so it would have to be called an Occupation. For example, Endoscopy Technician was designated as an Occupation, as it lacked a broader term, whereas Nurse Practitioner was a Specialization, since it had the broader term of Registered Nurse.

In another example of a taxonomy of academic areas of study I worked on, I was told that the taxonomy could have only two levels and the top level would be called Discipline and the second level be called Subdiscipline. The levels and designations were based on content management and business needs.  Thus, while Marketing would normally be considered a narrower term to Business, both were Disciplines at the same level. Some of the Disciplines were very specific, such as Real Estate Law (since Law did not exist as a discipline in this case), and some of the Subdisciplines were very broad, such as Computer Science (because it had a broader term of Computing). I resolved that this was not actually a taxonomy, but rather a metadata property with its values structured into two levels.

Taxonomies naturally have hierarchies, but do not naturally have levels, which are an artificial layer that sometimes get imposed.