Saturday, November 27, 2021

Attributes in Taxonomies

When I had done consulting for ecommerce taxonomy clients years ago, and they would refer to the taxonomy facets for products as “attributes,” I felt that might be confusing, because I considered “attributes” something else: a characteristic like metadata of a taxonomy term or a feature of an ontology. I have since come to realize that facets in some cases, especially in ecommerce, can be considered attributes, and they can even be managed in an ontology that is combined with the taxonomy.

Facets in a faceted taxonomy are various taxonomy term “types” that function as refinements or filters in the user interface for limiting search results on content that share similar types of terms or attributes. Users refine or filter their searches by selecting a term or value from each of two or more facets. In a periodical article research database offered by a library, facets might be subject, geographic place, named person, named organization, article type, publication name. Within an enterprise intranet of enterprise content management system facets might be topic, department, office location, and document type. In a health information service, facets might be symptom, body part, patient age, and  patient gender. In a corporate knowledge base for customer service, facets might be product type, brand name, market, region, issue type, and customer type. In most of these cases, a topical taxonomy is one of the facets, even if that topical taxonomy is hierarchical. The primary taxonomy design challenge in such cases is deciding what kind of information would be useful if separated out in its own facet, and what can remain in the generic topics facet. Using the SKOS (Simple Knowledge Organization System) model, each concept scheme serves as a facet.

In a product, ecommerce or marketplace taxonomy, the hierarchical taxonomy of product types is not one of the facets. This large, hierarchical taxonomy is typically the starting point for user browsing. While not constituting a facet, this hierarchy is linked to the facets. The user navigates or drills down through a hierarchical tree of product categories, until a specific product type is found, and then the facets (attributes) relevant to that product type are made available to the user, allowing the user to refine the search further, by selecting from each of several attributes, such as size, color, material, price, and features. This requires a different approach to the taxonomy design than for the faceted taxonomies described above, and t us these post-hierarchical-browse refinements that are better known as and more appropriately called attributes.

Ecommerce taxonomy attributes

Attributes can serve as refinements/filters in taxonomies for purposes other than ecommerce, such as job board taxonomies (attributes for job location, skill level, salary range, job type, employer/company, industry, date posted, etc.), an internal enterprise expert-finder (attributes for job title, department, office/work location, skills, subject areas of interest, academic degree, languages, etc.) or taxonomies of movies (attributes for genre, production company, production country, language, award winner type, release date, etc.)

The attributes generally pertain to specific named entities, such as the name of a product offered by a specific seller, the name of a job title at a specific employer, or the title of a movie. There can be attributes for more than one kind of named entity in the same set of taxonomies, such as for employer name in addition to job title in a job board taxonomy. Subjects, which are not named entities, usually do not have attributes, but some do, especially, in the fields of science and medicine, where they would be attributes on the names of species, chemicals, viruses, diseases, etc. I will discuss named entities in more detail in my next blog post.

Issues to consider in creating attributes

In a taxonomy where attributes are important, there are various issues to consider. First, shall there be a hierarchical topical taxonomy presented as an initial browse interface to the users? While this is typical for product/ecommerce taxonomies, it is not usually the case for job board taxonomies nor for a taxonomy for movies. However, it may be desired for a taxonomy of nonfiction books or periodicals, which users more often would categorize my subject. A producer or publisher of educational content will likely have a hierarchical taxonomy of disciplines or subject areas. A research-focused organization would also likely have a hierarchical subject taxonomy in addition to the facet-attributes dealing with location, type, funding source/sponsor, researcher name, etc. Having a hierarchical taxonomy outside of the attributes tends to be a user experience design decision, but it has an important impact on how the overall taxonomy is designed and managed.

More attributes may be created than usable for filtering/refining results. For example, products will likely have SKU numbers among their attributes, which can be displayed and perhaps even made searchable, but would not be one of the filtering-facets presented in the user interface. In a taxonomy for finding internal experts, contact information, such as an email address and phone number, would be attributes on each person’s profile, but these would not be searchable fields. Rather, they would display when the person profile is selected. Thus, another issue when creating attributes is determining which will display and function as filters/refinements and which will display only as additional metadata on a selected item.

If an initial hierarchical topical taxonomy is presented to the users, there arises the question of at what point in the hierarchy should the hierarchy of categories should end and further details should be treated instead with attributes? This question often comes up when designing ecommerce taxonomies. For example, to distinguish gas and electric stoves, should each of these types be a narrower term of stoves, or should energy source be an attribute of stoves?

Another issue to resolve is determining which attributes should be generic across all categories, and which should be category specific. For example, on which product categories in an ecommerce taxonomy is it appropriate to have an attribute for gender (as for clothing or a gift for a woman or for a man)? Related to that is the question of which categories should have their own unique attributes. Are some attributes relevant to major (the broadest) categories, and other attributes relevant only the most specific categories, and yet other attributes apply to various miscellaneous products? For example, color might be relevant for products in different parts of the hierarchy.

Attributes should be managed as belonging to different types based on their values, such as whether they are of controlled vocabularies, dates, currency, numbers, or a simply a Boolean yes/no, such as Remote being an attribute of jobs. If a hierarchical taxonomy resides outside of the attributes, then controlled vocabulary attributes are an additional part of a larger set of taxonomies/controlled vocabularies. How this is managed varies based on the taxonomy/ontology management tool. For example, such term lists might need to be managed as separate concept schemes in a SKOS taxonomy, even though they are used in ontology-based attributes. It can start getting complicated when an attribute type has different values for different categories in the same hierarchical taxonomy implementation. For example, the attribute of Material could have different values for tables than for clothing, and both categories are offered by the same ecommerce seller.

Attributes add another level or layer of expressivity to a taxonomy or set of controlled vocabularies, which brings it closer to an ontology. The distinction between taxonomy and ontology is not necessarily clear. It’s fine to have just some ontology-like features, such as attributes, but it is recommended to use a taxonomy/ontology management tool, such as PoolParty, which manages taxonomies and ontologies (and anything in between) according to Semantic Web/World Wide Web Consortium (W3C) standards.

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Data dashboard on computer screen
Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge.  I see two ways in which taxonomies are linked to data:

1.  Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2.  Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.

Taxonomies and metadata

Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies

Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.  

Taxonomies and knowledge graphs

A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Upcoming are my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists”  and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas on January 22 in Austin, TX.

Friday, October 1, 2021

Taxonomies for Human Resources

I just attended HRHR Technology conference opening Technology Conference this week, my first time at an industry or functional specialty conference, so it was interesting to learn how taxonomies could be positioned within this specialized sector. I usually speak or write about taxonomies as useful in general knowledge and information management, with the only specialization discussed in ecommerce. 

Human resources technology is a broad category, which includes software for such functions as benefits, compensation, engagement and recognition, learning management, onboarding, payroll, recruitment, screening, time and attendance, wellness, etc. Taxonomies are not particularly relevant for most of these areas, but are for some, such as talent management systems, job boards, and intranets. Performance management and training management may also benefit from taxonomies. 

In the conference opening keynote “HR Technology Reinvented: The Big Shift Towards Work Tech” presented by Josh Bersin, I was pleased to hear that this HR technology industry analyst had as #3 among his industry trends: “Skills taxonomies are the next big thing,” and he had a slide illustrating how a “taxonomy is more complex than you think.” Reasons Bersin gave for the complexity: a skill is not well defined, skills differ even in the same industry, and companies cannot trust black box skills.  
Taxonomies may also be implemented as part of a knowledge graph solution that links data in multiple applications, systems, and repositories, which is a typical scenario for HR technology, despite the existence of some degree of integration of functions within HR management systems (HRMS) or human capital management (HCM) software.

Another point that Bersin made was that the talent marketplace has become a category. It’s become more important to recruit and hire internally, so an internal marketplace for employees and jobs can be created. I find this also an interesting application for taxonomies. Taxonomies in business and industry are well established and known for ecommerce, which is B2C, but more recently taxonomies have been implemented in B2B and C2C marketplaces, such as Etsy. In an employee-job marketplace, taxonomies can be used to tag employee skills, interests, and locations, along with the job openings.


The talent marketplace was also discussed by the second day’s keynote spaker Ravin Jesuthan, who additionally explained how the internal talent marketplace can connect workers to projects, assignments, and tasks, rather than simply job openings.  He also referred to a market relationship and to matchmaking. On the subject of matchmaking, I found a vendor of a platform to match employees to coaches or mentors an interesting use case for a taxonomy. 

Another trend is that employee learning or training has a more important role in the flow of people to work. There is also a potential for taxonomies to support this endeavor. Depending on the volume, the findability of training materials could benefit by being tagged with terms from a taxonomy. A taxonomy can also support the recommendation of appropriate training courses to employees.

Finally, there is a lot of emphasis placed on employee experience, which was the number one trend in Bersin’s keynote.  One way to improve the employee experience, which was not mentioned in the keynote, is to have a single user-interface that, with a single, consistent taxonomy, links content and data in different systems. So, the users have only a single place to go to find answers to all of their employment-related questions. 

Tuesday, August 31, 2021

Knowledge Engineering and Taxonomies

My next conference workshop (at SEMANTiCS September 7) on taxonomies and ontologies has in its title “knowledge engineering.” I figured this may resonate more with the audience of computer scientists, data scientists, and Semantic technology and AI experts. People come (often accidentally) to the field of designing taxonomies, ontologies, and knowledge organization systems in general from different backgrounds, and may work in different disciplines or departments. They may have very different training, job titles, job descriptions. 

Also, my job title now, at Semantic Web Company, is knowledge engineer, although there is not much agreement on what that job title means. I had once before, over 10 years ago, applied for a position with the job title knowledge engineer, and the role focused on writing rules for rules-based auto-classification. This involves using a taxonomy with logic rules and regular expressions for each taxonomy term to support automated indexing, rather than using training sets and machine learning. My current job, however, involves designing taxonomies and ontologies, often in combination.

Creating a taxonomy or thesaurus alone is not knowledge engineering. This is because  a taxonomy does not describe all aspects of a knowledge domain, just the concepts and their hierarchical relationships, or in the case of a thesaurus, some additional nonspecific related (See also) relationships. Furthermore, there already exist published guidelines/standards for taxonomies and thesauri, as ANSI/NISO Z39.19 and ISO 25964-1 that specify best practices for design. 

It makes more sense to call ontology design a form of knowledge engineering. Ontologies have a much higher level of semantics or expressiveness, which needs to be defined by the ontologist or knowledge engineer. There are customized, semantic relationships (such as “is located in” and “contains”), which are to be applied between designated classes (such as organizations and places), any number of customized attributes (such as address or latitude/longitude) that can be specified for a class. Standards for ontologies, such as OWL, which are from the World Wide Web Consortium (W3C), are only for machine readability and interoperability, but not for best practices, so there is more room for interpretation and innovation when it comes to designing an ontology, than there is for a taxonomy or thesaurus

Knowledge engineering may involve more than designing on ontology but may include all the various kinds of controlled vocabularies for the content and data of an organization. This includes determining what kind of vocabularies are needed and how they are related to each other.

Knowledge engineering is also very similar to knowledge modeling, which I blogged about before in the post "Knowledge Modeling." Knowledge engineering is a more general function, whereas knowledge modeling is a more specific activity. 

Knowledge engineering also goes beyond taxonomy/ontology design and creation to include the follow-through application, which is namely the management of tagging or classification of content with the taxonomy. This is, after all, how a useful knowledge base is created, with content tagged and available for retrieval. Definitions of knowledge engineering sometimes refer to it as a field within artificial intelligence (AI) to build knowledge bases. While I might not agree that this is always part of the definition of knowledge engineering, AI is used for automated tagging of content with a taxonomy. 

It's probably better to define knowledge engineering more broadly as methods to support the development and transmission of knowledge, specifically by by transforming data to information and information to knowledge, as the frequently depicted pyramid on the right suggests. This transformation is specifically done by designing and creating links between data, which is supported by taxonomies and ontologies.

Saturday, July 31, 2021

Taxonomies and Sitemaps

I was recently asked if a website’s sitemap of company’s website could serve as the start of a taxonomy for an organization. The sitemap, after all, includes all the relevant topics pertaining to an organization’s business offerings, and they are arranged in a hierarchy.  I have previously blogged on the subject of why a website’s navigation is not a taxonomy in Navigation Schemes and Taxonomies. A sitemap is similar to a website’s navigation, but it goes deeper by including the titles or topics of web pages which are not included in the website’s menu, and it is not necessarily intended for user browsing. A sitemap may go five or six levels deep, whereas the website menu navigation menus are usually only two levels. Therefore, a sitemap may seem as if it’s a taxonomy. However, just because a sitemap is as large and detailed as a taxonomy needs to be does not make it suitable as a taxonomy.

Different purposes

We need to understand what a taxonomy is for. It’s to aid users in locating desired content by topic-terms, which reflect both the terminology use of the users and of the content. Taxonomy terms are tagged/indexed to content that is relevant to the term. The starting point when creating a taxonomy is to identify the topics of the content and identify the topics of user interest or search, and then merge those topics into a taxonomy by bringing together different names for the same concept. The concepts are then structurally arranged to show the relationships between the terms, especially hierarchical relationships. The primary purpose of the hierarchy of terms in a taxonomy is to aid the users in finding the appropriate term. When browsing the taxonomy, they may find a broader term or narrower term that better describes their search goals. Then they can select that term to retrieve content that was tagged with the term.  

A sitemap, on the other hand, lists all or most pages of a website, usually by page title and organized in the hierarchical structure of the website. The hierarchical structure of the website was designed to organize information in a logical manner for users to browse and explore, as considered by the information architect who designed the website. The sitemap thus reflects pages, which are often topics but not always. A page may have multiple topics of interest that a user might want to look up. A page is sometimes for performing a function or activity and not necessarily just a topic of information.

A sitemap is typically automatically generated from the page titles, and its primary purpose is not for user but for machines: they tell search engines about pages that are available for crawling on websites and can thus support search engine optimization (SEO). Sitemap are useful in planning the further development or organizational improvement of a website. Whether a sitemap should even be displayed to end users as a tool to find information on a website is questionable. If automatically generated, it's not designed for that purpose, but users could find it helpful, especially users who understand that it is merely the aggregation of page titles organized in the file structure of the website. Some website make it available, and some do not. Some websites have displayed a simplified sitemap instead  that is designed to be a guide to the users, but then it do not include all pages.

Different labels

The title names of pages and thus of sitemap entries often do not correspond to taxonomy terms. They could start out with verb for an activity, they could be commands or questions, or they could be complete sentences. Taxonomy terms are topics or names only represented by nouns or noun phrases, or proper nouns. Examples of sitemap entries that are not good taxonomy terms may include:

How to use…
Get started with…
Help with…
Pay a bill
Shop for…

As with navigation, the entries of a sitemap reflect pages in a one-to-one relationship, in contrast to taxonomy terms, each of which may retrieve multiple pages or content sources, and each page or content item can be tagged with multiple taxonomy terms. As such, entries in a sitemap may actually be more specific than would be needed in a taxonomy.  The user’s selection of multiple taxonomy terms in combination, through filters/refinements, achieves the result of obtaining an appropriate list of relevant content.


Sitemaps should not be used as taxonomies, but their topics (not their labels) may be considered as a good source for a taxonomy. Sitemaps might not even be suitable as a basis or starting point for a taxonomy, but rather as a source for developing taxonomy terms. Rather, it is recommended that a taxonomy be created separately from a sitemap based on a review of content, search log data, and stakeholder and user interviews, and the sitemap is yet one other source for consideration when taxonomy terms. The hierarchy of the sitemap should also not be too closely followed, although parts of its hierarchical structure may be taken into consideration for creating taxonomy relationships.