The Accidental Taxonomist: Trends in taxonomies

Showing posts with label Trends in taxonomies. Show all posts

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge. I see two ways in which taxonomies are linked to data:

1. Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2. Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.

Taxonomies and metadata

Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies

Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.

Taxonomies and knowledge graphs

A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists” and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).

Friday, April 30, 2021

Taxonomy Trends

Last fall I gave an 8-minute video presentation as part of the SEMANTiCS Video Forum 2020 on the subject of taxonomy trends, but the short talk allowed time to discuss only two of the past year’s trends. More recently, I reflected on longer-term trends in taxonomies when the chair, Jane Dysart, of Computers in Libraries conference suggested that my pre-conference taxonomy workshop last month also include the what’s new with taxonomies and assigned me the workshop title “Taxo Update: Latest in Designing & Maintaining Taxonomies.” While, by their nature and purpose, taxonomies should remain somewhat consistent in their design, I came up with some ideas in various sections of the workshop presentation. Now that the event is past, I’ve collected my observations of taxonomy the trends that I included in that workshop.

Convergence of types

A trend in the broader realm of knowledge organizations is the convergence of different types. We are seeing a convergence of taxonomies and thesauri, which is due to factors including the widespread adoption of the SKOS (Simple Knowledge Organization System), which supports both taxonomies and thesauri fully. Vocabulary management software, which is becoming more widely adopted than just using spreadsheets or the basic taxonomy editing feature of a content management system, supports both taxonomies and thesauri with no distinction. There may also be a growing preference to have the features of both: a dominating hierarchical structure as in taxonomies, and the benefit of additional associative (non-hierarchical) relations as supported in thesauri.

There is also a convergence of taxonomies and ontologies. This is also partly due to software tools, such as PoolParty, that support both taxonomies and ontologies in an integrated manner. There is a growing interest in ontology features, such as semantic relations and custom attributes, without having a large complex ontology, so a simple ontology can be applied as a semantic layer to existing taxonomies. This brings up the fact that there are growing number of taxonomies in existence that can be utilized within an ontology, rather than being replaced by an ontology. Finally, there is an increasing interest in ontologies as they form a basis of knowledge graphs, which are becoming more popular.

Interest in standards

In the past, the focus on taxonomy-related standards was mostly on ANSI/NISO Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies and its related ISO standard, ISO 25964 Thesauri and interoperability with other vocabularies. Their emphasis on best practices for thesauri has perhaps limited these standards somewhat in their broader application to taxonomies. More recently, however, there has emerged a greater interest in interoperability of taxonomies and other controlled vocabularies, which is recognized and addressed the second part of ISO 25964.

Even more significant is the is the increased adoption of SKOS and other W3C (World WideWeb Consortium) guidelines and recommendations, which directly support interoperability and exchange and sharing of vocabularies. As the number of taxonomies and other controlled vocabularies grow, there is a greater interest in re-using parts of them, sharing them, and linking them, which is enabled by representing the data in a standard format. The SKOS model can also be expressed in RDF (Resource Description Framework) triples, which makes it suitable for general Semantic Web sharing and linking, whether on the web or behind the firewall with Semantic Web standards. SKOS has also become the standard supported by most taxonomy management software.

Besides supporting interoperability, another trend coming out of SKOS is a shift in thinking of terms to that of concepts. Terms are strings of text, but concepts are ideas that may have various labels. Thus, people talk about “things, not strings.”

As for the standards for taxonomy and thesaurus best practices design, ANSI/NISO Z39.19 is not forgotten but rather there is sufficient interest in the taxonomy community to review and revise this standard again soon. I expect work on that to start in late 2021 or early 2022, and I hope to be involved. I will report more on that in a future blog post.

Trends in taxonomy structural design

In hierarchical taxonomies, there is the trend that hierarchies are created increasingly for purposes other than fully displayed for end-user browsing. Traditionally, the hierarchical design structure of taxonomies was solely for the purpose of serving end-users who would be browsing and need guidance in going from broad categories to narrower topics. The associative (related term or see also) relationship also guides users who are browsing and those who are doing manual indexing/tagging to identify related concepts of interest. As fully browsable taxonomies are becoming less common (due to their growing size and the availability of alternative methods of search and findability), and more indexing is automated, hierarchical and associative relationships between concepts are less often implemented to support browsing, and are more often used so support auto-tagging, providing context for a concept’s meaning by the presence of broader and related concepts.

When the relationships between concepts are not displayed to end users, the taxonomy structure does not necessarily need to be as consistent, such as always having a set number of hierarchical levels in all places of the taxonomy. A taxonomy does not have to appear as complete and comprehensive, either, but rather it merely represents the content. Associative relationships between concepts may also be implemented more inconsistently. This is another factor that contributes to the convergence of taxonomies and thesauri, since by definition thesauri have associative relationships and taxonomies do not. But you may end up creating a taxonomy/thesaurus with just a few associative relationships.

Despite the trend of less fully displayed hierarchical taxonomies, there are still many taxonomies that are fully displayed, such as in ecommerce applications. A growing trend is to combine different methods of expanding form one level to the next between different levels of the same taxonomy. There is also more sophistication integrating both common and custom facets into different levels of a hierarchy.

Trends in uses

Last, but certainly not least, is the trend in wider adoption of taxonomies for various uses. This was the topic of my prior blog post, Industry Uses for Taxonomies.

Wednesday, November 30, 2016

Saturday, April 25, 2015

Trends in Hierarchical Taxonomy Displays

Taxonomies connect users to content. So, how a taxonomy is displayed to users is very important in its effectiveness. This is a topic about which I gave a conference presentation back in 2011 and will present again next week. As I update my previous presentation, looking at some of the same public websites with taxonomies, I have observed some changes that might be considered as trends.

While faceted taxonomies (used to filter/refine/limit results by certain criteria with choices of taxonomy terms) have become more common on ecommerce or other database websites, they are not suitable in all circumstances, and when a taxonomy has a large number of topical terms, a hierarchical arrangement of those topics might be better.

Displayed full hierarchical taxonomies, however are more difficult to find. They are not as often the default. Some have disappeared entirely such as the Yahoo directory, which was discontinued in December 2014 after 20 years. (Admittedly, trying to classify as many websites as possible into a hierarchy, as the web keeps growing, is a never ending task.) In other cases, the search box is more prominent on the page, and the link browse categories needs to be hunted for.

In the past, I had observed two main different kinds of hierarchical displays: one-level-per-page and expandable hierarchies with plus signs. The first has evolved, the second is has become rare, and a third method has emerged.

One level of taxonomy hierarchy per page was the design of the former Yahoo directory and had been early on the style followed on other sites. An example that closely follows the Yahoo Directory, is the dmoz/Open Directory Project. A list of category labels or topics at each level takes up the entire screen/page display, without the display of other content. Displaying additional content on every page has become important, so hierarchical taxonomy categories now tend to be confined to more compact lists to free up space on the web page for content. This works for some taxonomies, not all. Meanwhile, a list of terms at the same level that take up the entire page is a style that is rarely followed anymore.

Expandable hierarchy “trees,” typically with plus signs next to topics to expand a topic’s subcategories has become quite rare, at least in public web sites. An example are the USA Today topics. This hierarchical taxonomy design had been developed based on the recognizable desktop file folder structure, such as in Windows. In the meantime, users have become familiar with different representations of topic hierarchies on the web, so mimicking expandable file menus is no longer the only way to engage users. Expandable topic hierarchies are not as easy to update and change on websites and, it can take a long time to load the web page. Expandable hierarchies allow the users to have more than one hierarchical level expanded at once, which facilitates exploring the taxonomy. As much as we taxonomists might enjoy browsing a taxonomy, the goal is to get users to content rather than have them spend time exploring the taxonomy.

A third method of displaying multiple levels of a hierarchical taxonomy is through “fly-out” subcategory lists. Examples include Lynda.com (under "Browse the Library") and Books & Authors. I had not noticed this method before, so it seems to be a new trend. They are similar to submenus in website navigation, but rather than for website navigation, the topics are linked to indexed content items, which are listed in a result set for each subtopic. Fly-out subcategories allow the users to still see the parent category list, if the user wanted to back out to it, like in an expandable tree hierarchy. But unlike an expandable tree hierarchy, you cannot have multiple parent categories expanded at the same time, which is not that important anyway. The fly-out subcategory style is thus a positive trend in hierarchical taxonomy displays.