Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Data dashboard on computer screen
Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge.  I see two ways in which taxonomies are linked to data:

1.  Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2.  Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.


Taxonomies and metadata


Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies


Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.  

Taxonomies and knowledge graphs


A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists”  and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).