Friday, December 17, 2021

Named Entities in Taxonomies

I have long felt that there is some uncertainty as to where named entities (names of specific people, places, organizations, products, etc.) fit into taxonomies. Standards suggest one way, and practice tends to follow different way in dealing with these proper nouns. As taxonomy trends evolve so does the position on these named entities. The fact that taxonomies are not well-defined leaves it open to question as whether to taxonomies should have any named entities in them, or if taxonomies should comprise only topics."Hello my Name Is" badge

Historical trends

A historical perspective is needed. Modern, digital information retrieval taxonomies evolved out of thesauri. Thesauri, which originally came out in print format, first appeared in the 1960s and then were formalized by various standards published in the 1970s. The thesaurus standards state clearly that the relationships between a named instance and its type is one of the three kinds of hierarchical relationships permitted and supported in thesauri (the other two being generic-specific and whole-part). While taxonomies may omit the associative (related term) relationship of thesauri, they tend to follow the hierarchical standards of thesauri. Thus, named entities could be included in the taxonomy as the narrowest terms, narrower to a term for whatever “type” they are. But should it always be this way?

Then faceted taxonomies started being implemented in the early 2000s, first in ecommerce and then by the end of the decade in intranets, content management systems, digital asset management systems, and various content-rich websites. Once facets became adopted in information retrieval applications (aside from ecommerce), it became obvious from a user design perspective that named entities belonged in a different facet than the subjects. Facets are for refining a complex search query by different aspects. Sometimes these aspects follow the types of questions: What? Who? Where? When? “What” is usually for a subject,” but “who,” “where,” and “when” (for taxonomy terms naming events, not date ranges) refer to named entities. Sometimes people start a query about a subject, and sometimes  people start a query about a named entity, and facets allow people to start off searching any way they wish.

Then in 2009 the World Wide Web Consortium published the Simple Knowledge Organization System (SKOS) recommendation for taxonomies, thesauri, and other controlled vocabularies, which over the following decade became adopted as the standard model for building machine-readable taxonomies. One of the elements described in SKOS is that of the concept scheme, which is defined merely as “an aggregation of one or more SKOS concepts.” There is nothing comparable in the thesaurus standards. While a taxonomist may choose what to do with an “aggregation” of concepts, it has proven practical to separate out different kinds of named entities into concept schemes separate from concept schemes for topics. Thus, the widespread adoption of SKOS has contributed to the trend of separating different named entity sets, which had already started with faceted taxonomies.

My initial, and longest, experience in the domain of taxonomies and controlled vocabularies was as a controlled vocabulary editor at the library database vendor Gale. At Gale (and its predecessor company), named entity controlled vocabularies ("name authorities") have been separate from the subjects, but there were reasons for this. The named entities (named persons, companies, organizations and agencies, named works, products, laws, events, and fictional characters), each have had different sets of attributes and rules for maintenance.  Some even have different customized relationships with other controlled vocabularies. Interestingly, it was not always this way. Before I joined in the mid-1990s, some of these named entities (agencies, organizations, works, geographics, and events) were mixed in with the “descriptors” in a Subject MegaFile. But eventually specific attributes and relations, not to mention the growing number of terms and a new vocabulary management system, combined to make it more logical to split off each of the named entity vocabularies. The Events were the last to be split out of the Subjects.  So, it’s not because the controlled vocabularies were named entities per se, but rather their growing specialized maintenance needs due to an increase in specific attributes that led to managing them as separate controlled vocabularies. Attributes include, for example, birth date and place for a person, latitude and longitude for a location, and website URL and address for companies and organizations, among many more.

Taxonomies and ontologies

This feature of attributes brings us to the most recent trend in taxonomies, which is the occasional, but growing, convergence of taxonomies and ontologies. Ontologies divide up a knowledge domain into classes, and each class (like the Gale named-entity controlled vocabularies) has its own set of attributes and customized relationships with other classes. Ontologies, according to the Web Ontology Language (OWL) standard, however, have a different perspective on named entities. Ontologies are comprised of classes and subclasses, in hierarchies, which, in turn contain “instances” or “individuals,” which are unique named entities. The relationships between an instance and a class (or subclass) is not, however, considered hierarchical, but rather of a “member” type. Thus, while thesauri make no distinction for named entities, and taxonomies separate out name entities when it’s practical, ontologies make a strict distinction.

Furthermore, for ontologies, which originated in the domains of philosophy and computer science, a named entity as a proper noun is not what matters. Rather, it’s the fact that the instance is unique, and there is only one. This is true for people, companies/organizations, and places. It is not true for brand name products, though. A named product is a proper noun, such as MacBook Pro or Honda Accord, but it is not a unique instance, because there are millions of individual MacBook Pros and Honda Accords in existence. It’s a similar matter for named works, such as books, where one title has millions of copies. “Named entities” or “proper nouns” are grammatical or linguistic designations, which are OK for taxonomies and thesauri, but are not a feature of ontologies, with their philosophical origins.

Fortunately, you don’t have to worry about this philosophical problem if you choose to follow the approach of applying a high-level ontology model to an existing taxonomy or set of controlled vocabularies to extend the ontology with specific terms and named entities (or, from the other direction, to extend the taxonomy with semantic relations and attributes). The OWL-based ontology then may comprise only as many classes and subclasses needed to designate the usage of distinct custom relations and attributes.  With this approach, a different ontology class is mapped to each subset or hierarchy or SKOS concept scheme of a larger taxonomy. Each named entity type would typically correspond to a different ontology class, based on the named entity’s own attributes and relations. So, each named entity type would be in its own controlled vocabulary or SKOS concept scheme.

Just because OWL ontologies may include named instances as members of a subclass, does not mean you have to set up your knowledge model that way. This is similar to the idea of the thesaurus standard, which permits named entities to be narrower terms to generic subjects, but you don’t have to set it up that way. Omitting an option described in the thesaurus or ontology standards does not mean you are not in compliance with those standards.  

So, in conclusion, while some things about taxonomies have remained constant, other things, such as where to put named entities, have changed over time.

No comments:

Post a Comment