Friday, December 17, 2021

Named Entities in Taxonomies

I have long felt that there is some uncertainty as to where named entities (names of specific people, places, organizations, products, etc.) fit into taxonomies. Standards suggest one way, and practice tends to follow different way in dealing with these proper nouns. As taxonomy trends evolve so does the position on these named entities. The fact that taxonomies are not well-defined leaves it open to question as whether to taxonomies should have any named entities in them, or if taxonomies should comprise only topics."Hello my Name Is" badge

Historical trends

A historical perspective is needed. Modern, digital information retrieval taxonomies evolved out of thesauri. Thesauri, which originally came out in print format, first appeared in the 1960s and then were formalized by various standards published in the 1970s. The thesaurus standards state clearly that the relationships between a named instance and its type is one of the three kinds of hierarchical relationships permitted and supported in thesauri (the other two being generic-specific and whole-part). While taxonomies may omit the associative (related term) relationship of thesauri, they tend to follow the hierarchical standards of thesauri. Thus, named entities could be included in the taxonomy as the narrowest terms, narrower to a term for whatever “type” they are. But should it always be this way?

Then faceted taxonomies started being implemented in the early 2000s, first in ecommerce and then by the end of the decade in intranets, content management systems, digital asset management systems, and various content-rich websites. Once facets became adopted in information retrieval applications (aside from ecommerce), it became obvious from a user design perspective that named entities belonged in a different facet than the subjects. Facets are for refining a complex search query by different aspects. Sometimes these aspects follow the types of questions: What? Who? Where? When? “What” is usually for a subject,” but “who,” “where,” and “when” (for taxonomy terms naming events, not date ranges) refer to named entities. Sometimes people start a query about a subject, and sometimes  people start a query about a named entity, and facets allow people to start off searching any way they wish.

Then in 2009 the World Wide Web Consortium published the Simple Knowledge Organization System (SKOS) recommendation for taxonomies, thesauri, and other controlled vocabularies, which over the following decade became adopted as the standard model for building machine-readable taxonomies. One of the elements described in SKOS is that of the concept scheme, which is defined merely as “an aggregation of one or more SKOS concepts.” There is nothing comparable in the thesaurus standards. While a taxonomist may choose what to do with an “aggregation” of concepts, it has proven practical to separate out different kinds of named entities into concept schemes separate from concept schemes for topics. Thus, the widespread adoption of SKOS has contributed to the trend of separating different named entity sets, which had already started with faceted taxonomies.

My initial, and longest, experience in the domain of taxonomies and controlled vocabularies was as a controlled vocabulary editor at the library database vendor Gale. At Gale (and its predecessor company), named entity controlled vocabularies ("name authorities") have been separate from the subjects, but there were reasons for this. The named entities (named persons, companies, organizations and agencies, named works, products, laws, events, and fictional characters), each have had different sets of attributes and rules for maintenance.  Some even have different customized relationships with other controlled vocabularies. Interestingly, it was not always this way. Before I joined in the mid-1990s, some of these named entities (agencies, organizations, works, geographics, and events) were mixed in with the “descriptors” in a Subject MegaFile. But eventually specific attributes and relations, not to mention the growing number of terms and a new vocabulary management system, combined to make it more logical to split off each of the named entity vocabularies. The Events were the last to be split out of the Subjects.  So, it’s not because the controlled vocabularies were named entities per se, but rather their growing specialized maintenance needs due to an increase in specific attributes that led to managing them as separate controlled vocabularies. Attributes include, for example, birth date and place for a person, latitude and longitude for a location, and website URL and address for companies and organizations, among many more.

Taxonomies and ontologies

This feature of attributes brings us to the most recent trend in taxonomies, which is the occasional, but growing, convergence of taxonomies and ontologies. Ontologies divide up a knowledge domain into classes, and each class (like the Gale named-entity controlled vocabularies) has its own set of attributes and customized relationships with other classes. Ontologies, according to the Web Ontology Language (OWL) standard, however, have a different perspective on named entities. Ontologies are comprised of classes and subclasses, in hierarchies, which, in turn contain “instances” or “individuals,” which are unique named entities. The relationships between an instance and a class (or subclass) is not, however, considered hierarchical, but rather of a “member” type. Thus, while thesauri make no distinction for named entities, and taxonomies separate out name entities when it’s practical, ontologies make a strict distinction.

Furthermore, for ontologies, which originated in the domains of philosophy and computer science, a named entity as a proper noun is not what matters. Rather, it’s the fact that the instance is unique, and there is only one. This is true for people, companies/organizations, and places. It is not true for brand name products, though. A named product is a proper noun, such as MacBook Pro or Honda Accord, but it is not a unique instance, because there are millions of individual MacBook Pros and Honda Accords in existence. It’s a similar matter for named works, such as books, where one title has millions of copies. “Named entities” or “proper nouns” are grammatical or linguistic designations, which are OK for taxonomies and thesauri, but are not a feature of ontologies, with their philosophical origins.

Fortunately, you don’t have to worry about this philosophical problem if you choose to follow the approach of applying a high-level ontology model to an existing taxonomy or set of controlled vocabularies to extend the ontology with specific terms and named entities (or, from the other direction, to extend the taxonomy with semantic relations and attributes). The OWL-based ontology then may comprise only as many classes and subclasses needed to designate the usage of distinct custom relations and attributes.  With this approach, a different ontology class is mapped to each subset or hierarchy or SKOS concept scheme of a larger taxonomy. Each named entity type would typically correspond to a different ontology class, based on the named entity’s own attributes and relations. So, each named entity type would be in its own controlled vocabulary or SKOS concept scheme.

Just because OWL ontologies may include named instances as members of a subclass, does not mean you have to set up your knowledge model that way. This is similar to the idea of the thesaurus standard, which permits named entities to be narrower terms to generic subjects, but you don’t have to set it up that way. Omitting an option described in the thesaurus or ontology standards does not mean you are not in compliance with those standards.  

So, in conclusion, while some things about taxonomies have remained constant, other things, such as where to put named entities, have changed over time.

Saturday, November 27, 2021

Attributes in Taxonomies

When I had done consulting for ecommerce taxonomy clients years ago, and they would refer to the taxonomy facets for products as “attributes,” I felt that might be confusing, because I considered “attributes” something else: a characteristic like metadata of a taxonomy term or a feature of an ontology. I have since come to realize that facets in some cases, especially in ecommerce, can be considered attributes, and they can even be managed in an ontology that is combined with the taxonomy.

Facets in a faceted taxonomy are various taxonomy term “types” that function as refinements or filters in the user interface for limiting search results on content that share similar types of terms or attributes. Users refine or filter their searches by selecting a term or value from each of two or more facets. In a periodical article research database offered by a library, facets might be subject, geographic place, named person, named organization, article type, publication name. Within an enterprise intranet of enterprise content management system facets might be topic, department, office location, and document type. In a health information service, facets might be symptom, body part, patient age, and  patient gender. In a corporate knowledge base for customer service, facets might be product type, brand name, market, region, issue type, and customer type. In most of these cases, a topical taxonomy is one of the facets, even if that topical taxonomy is hierarchical. The primary taxonomy design challenge in such cases is deciding what kind of information would be useful if separated out in its own facet, and what can remain in the generic topics facet. Using the SKOS (Simple Knowledge Organization System) model, each concept scheme serves as a facet.

In a product, ecommerce or marketplace taxonomy, the hierarchical taxonomy of product types is not one of the facets. This large, hierarchical taxonomy is typically the starting point for user browsing. While not constituting a facet, this hierarchy is linked to the facets. The user navigates or drills down through a hierarchical tree of product categories, until a specific product type is found, and then the facets (attributes) relevant to that product type are made available to the user, allowing the user to refine the search further, by selecting from each of several attributes, such as size, color, material, price, and features. This requires a different approach to the taxonomy design than for the faceted taxonomies described above, and thus these post-hierarchical-browse refinements are better known as and more appropriately called attributes.

Ecommerce taxonomy attributes

Attributes can serve as refinements/filters in taxonomies for purposes other than ecommerce, such as job board taxonomies (attributes for job location, skill level, salary range, job type, employer/company, industry, date posted, etc.), an internal enterprise expert-finder (attributes for job title, department, office/work location, skills, subject areas of interest, academic degree, languages, etc.) or taxonomies of movies (attributes for genre, production company, production country, language, award winner type, release date, etc.)

The attributes generally pertain to specific named entities, such as the name of a product offered by a specific seller, the name of a job title at a specific employer, or the title of a movie. There can be attributes for more than one kind of named entity in the same set of taxonomies, such as for employer name in addition to job title in a job board taxonomy. Subjects, which are not named entities, usually do not have attributes, but some do, especially, in the fields of science and medicine, where they would be attributes on the names of species, chemicals, viruses, diseases, etc. I will discuss named entities in more detail in my next blog post.

Issues to consider in creating attributes

In a taxonomy where attributes are important, there are various issues to consider. First, shall there be a hierarchical topical taxonomy presented as an initial browse interface to the users? While this is typical for product/ecommerce taxonomies, it is not usually the case for job board taxonomies nor for a taxonomy for movies. However, it may be desired for a taxonomy of nonfiction books or periodicals, which users more often would categorize my subject. A producer or publisher of educational content will likely have a hierarchical taxonomy of disciplines or subject areas. A research-focused organization would also likely have a hierarchical subject taxonomy in addition to the facet-attributes dealing with location, type, funding source/sponsor, researcher name, etc. Having a hierarchical taxonomy outside of the attributes tends to be a user experience design decision, but it has an important impact on how the overall taxonomy is designed and managed.

More attributes may be created than usable for filtering/refining results. For example, products will likely have SKU numbers among their attributes, which can be displayed and perhaps even made searchable, but would not be one of the filtering-facets presented in the user interface. In a taxonomy for finding internal experts, contact information, such as an email address and phone number, would be attributes on each person’s profile, but these would not be searchable fields. Rather, they would display when the person profile is selected. Thus, another issue when creating attributes is determining which will display and function as filters/refinements and which will display only as additional metadata on a selected item.

If an initial hierarchical topical taxonomy is presented to the users, there arises the question of at what point in the hierarchy should the hierarchy of categories should end and further details should be treated instead with attributes? This question often comes up when designing ecommerce taxonomies. For example, to distinguish gas and electric stoves, should each of these types be a narrower term of stoves, or should energy source be an attribute of stoves?

Another issue to resolve is determining which attributes should be generic across all categories, and which should be category specific. For example, on which product categories in an ecommerce taxonomy is it appropriate to have an attribute for gender (as for clothing or a gift for a woman or for a man)? Related to that is the question of which categories should have their own unique attributes. Are some attributes relevant to major (the broadest) categories, and other attributes relevant only the most specific categories, and yet other attributes apply to various miscellaneous products? For example, color might be relevant for products in different parts of the hierarchy.

Attributes should be managed as belonging to different types based on their values, such as whether they are of controlled vocabularies, dates, currency, numbers, or a simply a Boolean yes/no, such as Remote being an attribute of jobs. If a hierarchical taxonomy resides outside of the attributes, then controlled vocabulary attributes are an additional part of a larger set of taxonomies/controlled vocabularies. How this is managed varies based on the taxonomy/ontology management tool. For example, such term lists might need to be managed as separate concept schemes in a SKOS taxonomy, even though they are used in ontology-based attributes. It can start getting complicated when an attribute type has different values for different categories in the same hierarchical taxonomy implementation. For example, the attribute of Material could have different values for tables than for clothing, and both categories are offered by the same ecommerce seller.

Attributes add another level or layer of expressivity to a taxonomy or set of controlled vocabularies, which brings it closer to an ontology. The distinction between taxonomy and ontology is not necessarily clear. It’s fine to have just some ontology-like features, such as attributes, but it is recommended to use a taxonomy/ontology management tool, such as PoolParty, which manages taxonomies and ontologies (and anything in between) according to Semantic Web/World Wide Web Consortium (W3C) standards.

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Data dashboard on computer screen
Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge.  I see two ways in which taxonomies are linked to data:

1.  Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2.  Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.


Taxonomies and metadata


Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies


Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.  

Taxonomies and knowledge graphs


A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists”  and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).

Thursday, September 30, 2021

Taxonomies for Human Resources

I just attended HRHR Technology conference opening Technology Conference this week, my first time at an industry or functional specialty conference, so it was interesting to learn how taxonomies could be positioned within this specialized sector. I usually speak or write about taxonomies as useful in general knowledge and information management, with the only specialization discussed in ecommerce. 


Human resources technology is a broad category, which includes software for such functions as benefits, compensation, engagement and recognition, learning management, onboarding, payroll, recruitment, screening, time and attendance, wellness, etc. Taxonomies are not particularly relevant for most of these areas, but are for some, such as talent management systems, job boards, and intranets. Performance management and training management may also benefit from taxonomies. 

In the conference opening keynote “HR Technology Reinvented: The Big Shift Towards Work Tech” presented by Josh Bersin, I was pleased to hear that this HR technology industry analyst had as #3 among his industry trends: “Skills taxonomies are the next big thing,” and he had a slide illustrating how a “taxonomy is more complex than you think.” Reasons Bersin gave for the complexity: a skill is not well defined, skills differ even in the same industry, and companies cannot trust black box skills.  
 
Taxonomies may also be implemented as part of a knowledge graph solution that links data in multiple applications, systems, and repositories, which is a typical scenario for HR technology, despite the existence of some degree of integration of functions within HR management systems (HRMS) or human capital management (HCM) software.


Another point that Bersin made was that the talent marketplace has become a category. It’s become more important to recruit and hire internally, so an internal marketplace for employees and jobs can be created. I find this also an interesting application for taxonomies. Taxonomies in business and industry are well established and known for ecommerce, which is B2C, but more recently taxonomies have been implemented in B2B and C2C marketplaces, such as Etsy. In an employee-job marketplace, taxonomies can be used to tag employee skills, interests, and locations, along with the job openings.

 

The talent marketplace was also discussed by the second day’s keynote spaker Ravin Jesuthan, who additionally explained how the internal talent marketplace can connect workers to projects, assignments, and tasks, rather than simply job openings.  He also referred to a market relationship and to matchmaking. On the subject of matchmaking, I found a vendor of a platform to match employees to coaches or mentors an interesting use case for a taxonomy. 


Another trend is that employee learning or training has a more important role in the flow of people to work. There is also a potential for taxonomies to support this endeavor. Depending on the volume, the findability of training materials could benefit by being tagged with terms from a taxonomy. A taxonomy can also support the recommendation of appropriate training courses to employees.


Finally, there is a lot of emphasis placed on employee experience, which was the number one trend in Bersin’s keynote.  One way to improve the employee experience, which was not mentioned in the keynote, is to have a single user-interface that, with a single, consistent taxonomy, links content and data in different systems. So, the users have only a single place to go to find answers to all of their employment-related questions.