The Accidental Taxonomist: Linked data

Showing posts with label Linked data. Show all posts

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge. I see two ways in which taxonomies are linked to data:

1. Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2. Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.

Taxonomies and metadata

Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies

Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.

Taxonomies and knowledge graphs

A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists” and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).

Thursday, September 6, 2018

Using Linked and Other Open Vocabularies

Taxonomy terms assigned to content items makes the content easier to find, whether in an internal system, on the web, or both. To make content easier to find or discover on the web, the use of taxonomy terms or tags is part of the broader application of search engine optimization (SEO). A lot has already been written by others regarding tips for creating and adding terms/labels/tags to web content to support SEO, such as how many and how specific they should be. For the taxonomist, who is interested not only in the terms alone but also in the larger taxonomy to which they belong, another question is whether using terms from shared, publicly available controlled vocabularies makes a difference in increasing content discoverability on the web.

Linked open data and linked open vocabularies

Shared, publicly available controlled vocabularies may or may not be linked or linkable, as linked open vocabularies. So, just because a controlled vocabulary is publicly available does not mean that it inherently supports linked data on the web.

“Linked data,” which usually is linked open data, refers to methods to interlink structured content in a way that can be read automatically by computers to enable the discovery of content on the web. It is described in a set of W3C specifications for web publishing that makes the data or content part of the Semantic Web. This means that instead of manually following individually created hyperlinks, semantic links and computer readable formats support automated relevant linkages among content. Linked data requires the use of named URIs to identify things, HTTP URIs for web lookup, and structured data using controlled vocabulary terms and dataset definitions expressed in an RDF standard framework. “Linked open data” additionally includes open use in accordance with an open license.

Terms in taxonomies can serve as labels to linked content as part of linked data. Additionally, although less common, taxonomy terms themselves can be the content that is linked to, if the taxonomy concepts are individually assigned URIs and HTTP addresses, and are in an RDF format.

Limitations to designating content as linked open data

If you have a document on the web that you want to have discovered as part of the Semantic Web, designating it as linked data is not so simple, because you need to include the machine-readable instructions, such as through a SPARQL endpoint or an API (application programming interface), in addition to the RDF designation. Not only is this technically outside the skills of most individual web content creators and taxonomists, but depending on how the content is managed, standard web content management systems or blog posting software may not even support editing the HTML of the page to insert such instructions

Institutions may register their content with a linked open data repository. The main repository of linked open vocabularies is Linked OpenVocabularies (LOV), hosted by the Ontology Engineering Group of the Computer Science School at Universidad Politécnica de Madrid. An individual blogger, however, who would like to make an individual blog post linked open data, cannot easily achieve that status.

Simply linking to shared, open vocabularies

Thus, if linked data instructions cannot easily be included and traditional manual links back to the page (as by means of agreed-upon link exchanges) cannot be established for practical reasons, tagging could be done with terms from a publicly available controlled vocabulary that is not part of linked open data and linked open vocabularies. Two good examples are the labels of Wikidata and the Virtual International Authority File (VIAF).

Wikidata is a free, open, collaborative, multilingual collection of structured data. Its purpose is to support Wikipedia, Wikimedia Commons and other wikis of the Wikimedia movement, as well as anyone who wants to search, use, edit or consume its data. The data contained in the Wikidata repository consists of items, each with a unique name and ID. Currently there are 50,116,886 data items. Each item has a brief glossary definition, equivalent names in other languages, relationships ("statements”) to other data items (such a "subclass of" and "designed by"), and identifiers in other vocabularies (such as Freebase, Library of Congress authorities, and Quora topic).

VIAF, hosted by OCLC, contains just named entities (proper nouns). But it uniquely brings together and displays as a group the headings that are the authority used by each contributor for that term. So, it’s not exactly a controlled vocabulary. VIAF has over 40 international member-contributors, most of which are national libraries.

Is there any benefit in tagging with and linking to terms that are part of a controlled vocabulary which is publicly available but is not a linked open vocabulary, such a Wikidata or VIAF? A colleague of mine proposed finding out by experimenting with tagging the same content with terms from different sources. Results will be shared in a later blog post.

Thursday, August 30, 2018

Taxonomy Hierarchical Relationship Issues

A common feature of taxonomies is the hierarchical relationship between terms. Terms are linked to each other in a relationship that indicates that one is the broader term (BT) of the other, and in the other direction, one is the narrower term (NT) of the other. You don’t need to be a taxonomist to understand this basic principle. However, even taxonomists can be challenged sometimes in determining whether it’s correct two put two terms in a hierarchical relationship.

Standards for Hierarchical Relationships

There are guidelines for the hierarchical relationship provided by the standards of ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and ISO 25964-1: Information and Documentation — Thesauri and Interoperability with other Vocabularies — Part 1: Thesauri for Information Retrieval. The standards say that in a correct hierarchical relationship the term that is narrower to the broader term may be a specific type of the generic broader term, a named instance of the generic broader term, or an integral part of the whole broader term.

These standards, however, are for thesauri, not taxonomies. Thesauri have additionally a non-hierarchical associative relationship between terms, known as “related term” (RT). In taxonomies which lack related-term relationships, the conditions under which the hierarchical relationship is permitted need not be followed quite as strictly. Nevertheless, the thesaurus standards for creating the hierarchical relationship should be the starting point and the default for hierarchical relationships in taxonomies.

Challenges in Coming up with Broader Terms

Hierarchical taxonomies may be created from the top down, the bottom up, or a combination of both approaches. The top-down approach involves creating broadest categories first, then adding narrower terms and then adding narrower terms to narrower terms. This approach makes it easier to create good hierarchical relationships. In reality, though, we don’t always create terms based purely on their broader terms. Rather, analysis of content yields specific terms that are needed, so some degree of bottom-up taxonomy creation takes place. In the bottom-up approach there may be the challenge of determining and creating the appropriate broader term.

When I have been completely challenged in coming up with a broader term, I admit I have looked up the term in Wikipedia to see what are named as “Categories” for that term, listed at the bottom of the page. “Categories” implies a broader term, but these are not necessarily good or correct broader terms. An example of Categories that are not exactly broader terms is for the term Stress management: Stress, Management by type, Psychotherapy, and Psychiatric treatments. Stress management is not exclusively done as (is a part of) Psychotherapy or Psychiatric treatments, so those are not suitable broader terms. “Management by type” is definitely not a good taxonomy term, and the term Management alone has a different meaning of its own. As for the term “Stress,” this is more complicated. Technically, Stress management is not a kind of Stress or a part of Stress, so Stress should not be its broader term. If this were in a thesaurus, they would definitely be related terms. If your controlled vocabulary is not a thesaurus, and the related-term relationship is not supported, then you may ignore the thesaurus rule in this case, and make Stress the broader term of Stress Management. This relationship is likely to be expected and accepted by users.

Challenges in Special Circumstances

Even creating a taxonomy from the top down taxonomists may encounter challenges or confusions with the hierarchical relationships. One challenging case is the concept of membership. Things and their members could be industries and their companies or international organizations and their member countries. It may seem logical to list the affiliate members “under” the industry or organization of which they are a part, but this is based too much on context and time. Companies can change their industries, and countries can change their international organization affiliation. More significantly, the whole-part hierarchical relationship is about integral parts, not participatory taking “part.” Finally, it may be more practical to put each type (companies, industries, companies, organizations) in a separate facet and not establish any relationship between them in a taxonomy (in contrast to a thesaurus or ontology).

Another potentially confusing case involves occupations and job titles. The subordinate nature of narrower terms should not be confused with the subordinate role of one job title to another. Thus, while a marketing specialist reports to a marketing manager, Marketing managers is not a broader term of Marketing specialists. Furthermore, while a marketing manager reports to a marketing director, we might make the hierarchical relationship in the other direction, with Marketing Directors as a narrower term to Marketing Managers, because directors are a kind of manager. Managers include directors.

Perhaps the most confusing case involves specificity which is not taxonomical specificity. For example, the Syllabi (plural of syllabus), as instructional outlines, in a certain sense are more specific than Curricula (plural of curriculum), which are also kind instructional outlines. Syllabi are for individual courses, and curricula are for a series of courses, such as an entire program of study or degree. Thus, it might seem logical that Syllabi would have the broader term of Curricula. But a syllabus is neither a specific type of curriculum, nor is it part of a curriculum. It is something different. So, it would be better not to have Curricula as a broader term of Syllabi, even in a taxonomy that is lacking related-term relationships.

Parent-Child Confusions

Sometimes the hierarchical relationship is referred to as “parent-child.” While it’s correct that a subsidiary company is a narrower term of its parent company, because it is part of the parent company, a biological child is not a narrower term if its parent, because it is not a part of the parent, but rather an offspring. To avoid confusion, it’s better to describe the relationship as broader/narrower, rather than as parent/child.