Thursday, December 19, 2024

Ontologies vs. Knowledge Graphs

At the Connected Data London (CDL) conference I attended last week, ontologies were humorously referred to as the “O” word. The thought was that, until recently, experts preferred not to mention “ontology,” lest they alienate their audience, customers, or stakeholders. The word comes across as too technical. It is a term from philosophy, after all, and it does not help that it sounds very similar to “oncology” (as “taxonomy” has been confused with “taxidermy”). The term “knowledge graph” on the other hand, is more user friendly, and even if it is not perfectly understood, its general meaning can be guessed. Thus, people would refer to knowledge graphs regardless of whether they meant a knowledge graph or an ontology.

At the conference, however, it was discussed that there is a growing acceptance of the word “ontology,” not just among experts but also among varied stakeholders who need to implement them. This was noted by several conference speakers, especially in the wrap-up panel session for the Data Modeling track, which was titled “The ‘O’ Word: How Ontologies Drive Interoperable Data and Business Innovation.” The panel moderator Katariina Kari explained that this recent shift has happened because of LLMs, explaining: “We need a reliable natural language repository. LLMs works on a network of mimicking language, LLMs are primed for language.” So, now use of the word ontology can even help a startup get funding from venture capitalists, she observed.

However, there remains some confusion over what an ontology is. At one end there is the difference between ontologies and taxonomies, and at the other end the difference between ontologies and knowledge graphs. I clarified the distinction between taxonomies and ontologies in a prior blog post, “Taxonomies vs. Ontologies” (January 2023). While knowledge graphs are a relatively new concept, and ontologies have existed for much longer, it is the varied understanding of ontologies that has given rise to confusion.

An ontology is defined as a model of a domain of knowledge, which comprises classes (sets of things), attributes (types of characteristics of things) and relationships between classes. According to this definition, an ontology is a somewhat generic model of a domain, and it does not include all of the individual members or instances of each class (such as the names of individual companies in the class called Company) nor the specific attributes of each attribute type (such as the address of each specific company for the attribute type called Address).

However, the W3C recommendation for ontologies, OWL (Web Ontology Language) includes the designation “individuals,” and ontology software tools, such as Protégé, support the inclusion of individuals and their specific attributes. Thus, it is easy to think that an ontology, by definition, includes all specific individuals. But just because OWL covers the recommendation for how to include instances of a class, and software supports the inclusion of instances of classes does not necessarily mean that the instances or individuals are actually a component of an ontology. The ontology experts on this CDL conference panel confirmed that an ontology is the upper-level semantic model.

Then, what do we call an ontology plus all of the individual members (instances) of classes and their specific attributes? That is essentially what a knowledge graph is. This is especially true when individuals are specific to an organization or enterprise, such as names of individual customers, products, employees, etc., and we call that an “enterprise knowledge graph.”

The first applications of ontologies in information/data science were in biomedicine, in which individuals included such things as names organisms (including bacteria and viruses) and chemicals, etc. Thus, the notion of an individual in science is not quite the same as in business, which has also been a source of confusion over what an individual is and the inclusion of individuals in an ontology. In enterprise knowledge graphs, the instances can be very numerous and specific, including individual “events,” such as interactions or transactions.

In conclusion, an ontology is typically a defining feature and component of a knowledge graph, but it is not all of what goes into a knowledge graph. A knowledge graph also includes individuals, which may be named entity instances or they may be specific taxonomy concepts (abstract things that are not unique named entities, such as the concepts “Data ethics” or “Performance measurement”), and a knowledge graph also includes specific attributes of individuals. It may be said that a knowledge graph is the instantiation of an ontology, and an ontology is the knowledge model. Katariina further explained: “knowledge graphs that actually follow an ontology will have an LLM perform better than just a KG that is unharmonized, not yet adhering to a clear ontology.”

Thursday, October 31, 2024

The Semantic Data Conference

I was honored to be accepted to speak at the first “Semantic Data” conference in New York, a one-day event held on October 23, following the inaugural event held in London on June 27. Semantic Data, organized by Henry Stewart (HS) Events, is co-located with its better-known DAM (Digital Asset Management) conference, which has been running for over 20 years in New York, London, and Los Angeles.

The full name of the conference was “Semantic Data: Taxonomy, Ontology, and Knowledge Graphs,” so the conference was less focused on data then on what you can do with data and content when combined with the semantics of taxonomies and ontologies. There was no presentation dedicated to knowledge graphs this time, with only sessions in the single-day one-track event. Less of a focus on knowledge graphs was fine, since the Knowledge Graph Conference, held in New York in May covers that topic very thoroughly over multiple days. The emphasis on “semantics,” though, is welcome, since there is no conference dedicated to that subject in the United States. (There is the SEMANTiCS conference in Europe, but it is semi-academic.)

 

Presentations at Semantic Data, New York

The topics of the sessions for the “Semantic Data” included: securing taxonomy and ontology strategy buy-in, why and how to connect taxonomies and ontologies, use of MS Copilot in taxonomy development, a use case in leveraging an LLM-based for content integration and a consumer-based semantic layer, and how to apply semantic models (taxonomies and ontologies) that reduce biases, especially for machine learning models. The opening keynote by Lulit Tesfaye was on realizing the semantic layer keynote, and the closing keynote by Gary Carlison and Bramm Wessel of the lead sponsor, Factor, was on building an organization semantic mindset. Additional sponsored talks were on how ontologies accelerate innovation in the life sciences, as done by the sponsor SciBite, and how semantics enhances modern data platforms, such as the sponsor Datavid.

I presented “Taxonomies to Ontologies: How When and Why to Connect or Extend.” I summarized the benefits of taxonomies and ontologies, including what you could or could not do with each alone, but what you could do with both combined. The fact that both taxonomies and ontologies are now based on compatible Semantic Web standards, which are supported by many tools, makes it easy to combine or extend them. Whether you are “combining” a taxonomy with an ontology or “extending” a taxonomy into an ontology depends merely on your starting point and definition of ontology. Now that I am again vendor neutral, I included screenshots from four different commercial tools for combined taxonomy/ontology management.

About the Semantic Data Conference 2024

Semantic Data New York was similar to Semantic Data Europe (London) in its format and organization. Both provided a combination of session types: instructional talks, industry use cases, round table participant discussions, and thought leadership panels. Both events were chaired by Madi Weland Solomon and featured the same keynote presentation by Lulit Tesfaye on the subject of the semantic layer. The rest of the speakers were different at both events, and each event had different sponsors, based on geographic location. While there were only three sponsors of Semantic Data in New York and only two in London, they shared the same exhibit hall with the main DAM (digital asset management) and thus reached a wider audience.

Attendees of both the London and New York events had a similar number of registrants, about 50. Although the larger co-located DAM conference had separate registration, some registrants of the DAM conference were also seen in Semantic Data sessions. Registrants of Semantic Data represented diverse industries, including financial services, healthcare, software/technology, media, entertainment, publishing, travel and tourism, education, government, and consulting. Roles were also diverse, including company leadership, project and program managers, IT, and content/DAM/taxonomy/information architecture practitioner roles.

I find that the distinction between the roles and activities of taxonomists, ontologists, information architects, digital asset managers, etc. overlaps, so a conference dedicated to semantics brings them together for shared knowledge sharing. This way, their projects can also be broadened and shared within their organizations. I hope the Semantic Data conference can grow in the future to fill this need, and I look forward to next year.

Monday, September 30, 2024

Topical Taxonomies for Filtering Searches

PoolParty GraphSearch
We taxonomists have long been advocating how a taxonomy of disambiguated concepts tagged to content retrieves more accurate results than search algorithms alone. But if users prefer simply entering text strings into a search box and not browsing taxonomies, how best to support users with a taxonomy can be a challenge.

A faceted taxonomy with taxonomy aspects as filters for refining search results has become a common taxonomy solution, especially for intranets, partner portals, and knowledge bases. For these purposes, certain facets, such as Content type, Product/Service, Location, and Department, are common and logical. When it comes to the designating “Topics,” however, it’s not so easy.

Specific Terms Gathered from Analysis

When gathering information and sources for terms, most sources will yield highly specific terms. These include terms arising from search log analysis, brainstorming sessions with sample users, automated text analytics term extraction from a large corpus of content and manual review a representative sample of documents/pages. These are all standard methods for taxonomy design, which I conduct as a consultant.

The difficulty is that there are often so many specific topics, so the new topical taxonomy could potentially have many hundreds of terms. Some may be relevant to only one or two documents or occurred in only a couple of searches out of thousands. They would not serve the purpose to refine searches.

Another problem is that many of the terms suggested from these methods are not even topical. Often, the top searches found in search logs of enterprise/intranet searches are for commonly used named tools, platforms, or services.

The main issue, however, in deriving terms for a topical facet/filter based on search terms is that the objective of the topical facet, like all facets, is to limit searches, not to duplicate searches. What is really needed in the topical facet are topical categories that are broader than the search terms. How to identify these broader topical categories can be more challenging.

Identifying Broader Topical Categories

Identifying broader terms or categories for topic filters is not as simple as identifying specific search terms, nor as straightforward as identifying the set of facets. Typical methods of obtaining candidate terms from both users and from the content need to be done, but with a focus on identifying broader terms or categories.

Categories from Stakeholder Engagement

Engaging stakeholders or other sample users in activities to brainstorm taxonomy terms will result in a mix of specific and broad terms. It is then the task of the taxonomist-facilitator to help guide the participants to identify which terms are broader and which are narrower within the same topical facet. Involving stakeholders/sample users is important, because if a single taxonomist or an external consulting team tries to do this on their own, their designated broader terms, while hierarchically correct, might not suit the intended users. The taxonomist-facilitator may suggest broader terms and then obtain immediate validation from the participants of the appropriateness of those suggestions.

Categories from Content Analysis

Analyzing content for broad topics is more effectively done manually than with automated methods. Manual content analysis will yield both specific and potentially broader concepts. A taxonomist or content strategist experienced in content analysis for identifying meaning will be able to determine the main concept for a piece of content.

Automated methods, based on text analytics technologies, tend to focus on term extraction, and will extract terms even more specific and less useful than search log results.  However, if a list of derived search terms is large enough (as may search logs or automated term extraction lists tend to be), another, newer option is to make use of LLM and generative AI technologies to categorize the specific terms and thus generate broader terms. The LLMs should be trained on the same or similar content, which is internal enterprise content, not the public web, to provide the correct context. Even then, the identified broader terms or categories will not always be correct and will require an experienced taxonomist to review.

Other Topical Facets

Topical terms, however, do not all have to be in a single “Topics,” facet. Depending on the use case, there could be other topical facets, which are not the usual named entities, departments, locations, or product/service types. These could be for Function, Activity, Issue Type, Technology, Research Field/Discipline, etc. If and how to break out these facets can be a challenge and should involve extensive discussions or other research with stakeholders and user representatives.

Finally, a topical facet for filtering search results could even be based on the existing navigation menu’s top levels, especially on an intranet or an enterprise content management system. Facets as filters are available to refine searches only, but if users choose instead to navigate the site menu, then they have no options to use other facets/aspects to help restrict what they are looking for. By duplicating the navigation menu’s one or two top levels into a facet, perhaps called “Topic Area,” users can limit a search with the categories for the areas with which they are familiar, and they can also restrict the search further by filtering on terms selected from any of the other facets.

I will be discussing the wider activity of coming up with terms for a taxonomy in my upcoming Taxonomy Boot Camp presentation, “The Complete Guide to Sourcing Terms” November 18, in Washington, DC. 


Sunday, August 18, 2024

Taxonomies and Ontologies as Semantic Models

In describing what taxonomies and ontologies are and what they can do, we are hearing the word “semantics” more often. “Semantics” means “meaning,” which is nothing new, and taxonomies and ontologies are not new. What is new is that taxonomies and ontologies are now combined more, and we need a way to describe them together, and that involves the description of “semantic.” Furthermore, taxonomies and ontologies are being implemented in new and expanded applications, where the word semantic(s) has significance.

Semantics in Taxonomies and Ontologies

Taxonomies have semantics in their concepts. A taxonomy is not just a term base or a term list, but rather is an organized set of concepts, each with its own unambiguous meaning. The concepts bring together different labels, like “synonyms” for the same thing, and their meaning and usage is further clarified by their arrangement in a hierarchy. It’s often said that a taxonomy comprises “things” (concepts), not mere “strings” (of text).

Ontologies have a higher level of semantics than taxonomies. Even if they don’t contain synonyms, the relationships between concepts (entities) and sets
(classes) of entities have additional semantics. The relationships in an ontology are convey meanings beyond mere hierarchy or a generic “related term.” For example, relationships between entities may be “is located in,” “has customer,” and “sells product.” Furthermore, entities in an ontology may have various types of attributes, such as contact information for offices and people, which is another application of semantic data.

Bringing Together Taxonomies and Ontologies

Taxonomies and ontologies have different origins, but now they are increasingly based on shared Semantic Web data models and guidelines, which enables them to be integrated seamlessly. Taxonomies have their origins in library science structures, including thesauri, subject headings, and classification schemes. Ontologies have their origins in computer science and data science with a focus on data models.

Combining them brings the benefits of both: the linguistic aspect of controlled terminology and their synonyms with hierarchical structure in taxonomies and the custom semantic relationships and other additional properties provided by ontologies. This allows users to search for concepts/things, not just text strings while also linking to others things related in a specific way and being able to create complex multi-step queries.

Taxonomies are considered a kind of “controlled vocabulary” or “knowledge organization system.” Ontologies are considered a kind of “knowledge model,” and as a knowledge
representation system, rather than a knowledge organization system. When we combine taxonomies and ontologies or speak of them collectively, it’s logical to use the word “semantic,” whether as semantic structures or semantic models, because they both involve semantics and both are usually based on Semantic Web guidelines.

Taxonomies are increasingly based on the Semantic Web recommendation (published by the World Wide Web Consortium) of SKOS (Simple Knowledge Organization System), which is based on RDF (Resource Description Framework). Most ontologies are based on RDF-Schema, an extension of RDF, and OWL (Web Ontology Language), another Semantic Web recommendation. The data models of SKOS, RDF, RDF-S, and OWL may all be integrated into the same knowledge model for a combined taxonomy-ontology. Most software for dedicated taxonomy-ontology management uses these data models.

Semantic Search and Semantic Tagging


Taxonomies support semantic search and tagging. “Semantic search” is the third-ranked autocomplete suggested search phrase in a Google search I did recently on “semantic,” so this is clearly a popular application of semantics. Semantic search refers to search that focuses on concepts and meaning rather than just strings of text. This is not new, but since search that is based on text strings and statistical algorithms is so common, improving search results with the focus on semantics is getting more attention.

Semantic search is best enabled with the tagging of taxonomy concepts, which we may call “semantic tagging” (which I first heard of when asked to write a article on it in 2008). Advanced text analytics technologies, going beyond entity recognition and natural language processing to include natural language understanding so as to analyze sentence structure, syntax, and sentiment, can also yield search results based somewhat on meaning and not just words.

Semantic Data

Taxonomies are traditionally for tagging and retrieving content, whereas ontologies are traditionally for exploring and retrieving data. The combination of a taxonomy and an ontology enables users to retrieve both content and data that are related to each other. Semantics for content is a given, because content (whether text, image, or other media), by its very nature, has meaning. Data by itself may not have much meaning, unless it is related to other data and that relationship has meaning, too. Thus, “semantic data” is significant. We hear reference to “semantic data” much more often than to “semantic content.

You don’t need to add a taxonomy to content to make it “semantic” and understood (rather a taxonomy helps you find the content). However, depending on how data is presented, you may need to add an ontology or at least a semantic data model (a method to describe objects in a database and their relationship to one another) to make data “semantic.” Experts can analyze raw data, but the data is more valuable if non-experts can understand it, too, and that’s why “semantic data” is important. There is also a lot of attention on “semantic data models.”

Semantic Layer

The idea of a “semantic layer” as a framework or approach to make an organization’s information, both data and content, more structured, findable, and actionable, has been gaining popularity recently. Whether the “semantic layer” is new or just a new way of describing something is arguable.

A semantic layer is a standardized framework that organizes and abstracts organizational data and serves as a connector for all knowledge assets. It’s a method to bridge content and data silos through a structured and consistent approach to connecting instead of consolidating data, which data warehouses do. The idea of a “layer” is that it is part of an enterprise-wide architecture of information, data and content, that connects horizontally across siloed content and data repositories. Taxonomies and ontologies, in addition to potentially other knowledge organization systems, such as a business glossary, are key components of a semantic layer.

More Talk of Semantics with Taxonomies and Ontologies

I’ve definitely been hearing of “semantics” more in the world of taxonomies and ontologies, and now I am bringing the word more into my own presentations. Following are some past and future examples.

Wednesday, July 31, 2024

Subject Headings vs. Taxonomies

When I spoke about taxonomies at the recent SLA (Special Librarians Association) annual conference, I was asked how a taxonomy differs from a subject heading scheme. Librarians are very familiar with subject headings, which are used to catalog books and other library materials. This is an interesting question, which I answered briefly in my presentation session, but I’d like to explain further.

I have previously written about how a taxonomy differs from a classification in “Classification Systems vs. Taxonomies Taxonomies are more similar to subject heading schemes. Libraries use both classification systems (such as the Dewey Decimal), which are for determining the physical location of books and other library materials on shelves based on their codes, and subject heading schemes (such as Library of Congress Subject Headings), which are used to identify books and other materials by their specific subject matter.  The same subject could be used to catalog books and materials of different types (nonfiction, fiction, sound recordings, children’s) with very different classifications.

How Taxonomies and Subject Heading Schemes are Similar

Taxonomies and subject heading schemes are both considered types of controlled vocabularies, and they share similar uses and features. They both serve users who are looking up subjects to find information or resources available on the subject, rather than (or not yet) for identifying the physical location of the resource. In addition, they both:

  • have structures, but their focus is on the concepts
  • can be both searched and browsed
  • exist for both general and specific subject domains (Medical Subject Headings (MeSH) published by the National Library of Medicine is an example of a specific subject-domain subject heading scheme.)
  • have some structured, thesaurus-type of relationships between terms, including broader/narrower, and related.
  • bring together different names, as synonyms/alternative labels/nonpreferred terms/used for terms
  • may include named entities (proper nouns for people, organizations, or geographic places) alongside topical subjects
  • may have scope notes on select terms

How Taxonomies and Subject Heading Schemes Differ

With so many similarities, one might wonder if there are any differences between subject heading schemes and taxonomies.

Subject heading schemes and taxonomies have different histories and originally different formats. Subject heading schemes were designed for the print format and have been adapted to digital environments, whereas information “taxonomies” as we know them have existed only after the emergence of digital navigation and search systems.

Structural Differences with Subdivisions

The name “subject headings” refers to the traditional browsable display of headings in an index, and under headings may appear sub-headings or subdivisions to further refine multiple references/citations/linked results. This structure is the main difference between subject heading schemes and taxonomies. The heading-subheading/subdivision structure is characteristic of back-of-the book indexes and indexes to articles when such indexes previously appeared in print, although it is still used online.

A subject heading may be subdivided by the addition of different types of subdivisions: topical, geographical (such as a country name), chronological (such as century, decade or war time), and form (for the content type, such a Periodicals). Some topical subdivisions are rather generic and can be applied to many headings, such as “Management,” “Research,” or “Law and legislation,” but most are specific to only a limited number of headings. For example, the subdivision “Lighting” is to be used under headings for structures, rooms, vehicles, installations, etc. See the full list of Library of Congress subdivisions.

The way that subdivisions refine a heading can be compared to the function of facets in a faceted taxonomy, which was noted by someone in the audience of my conference session. (See also the post “Faceted Classificationand Faceted Taxonomies.”) Subdivisions and facets are both aspects of something. That does not mean, however, that a faceted taxonomy and a subject heading scheme are the same.

  • The structure of a faceted taxonomy has facets at the top-level, and the facets are relevant to a specific set of content, so they are aspects of the content, rather than aspects of a heading term.

  • There can be hierarchies of terms within a facet of a faceted taxonomy, but subdivisions do not have internal hierarchy. Instead, subdivisions may subdivide each other, but this is more like a prescribed navigation path, and they must follow a standard sequence. For example:
    English literature—20th century—History and criticism

Application Differences of Subdivisions vs. Attributes

Another facet-like implementation of taxonomies is to have attributes to refine the search results of a specific term within a hierarchical taxonomy. Attributes are common in e-commerce taxonomies, which involve a hierarchical taxonomy for product categories and attributes for product features. Attributes are more like subdivisions, in the way that they refine topics from the hierarchical taxonomy, but they are applied (tagged) differently than subdivisions.

The combination of a subject heading and a subdivision is done at the time of indexing an article or cataloging a book, and there are rules about which combinations are permitted. The combinations are indexed as if they were a single compound concept. Catalogers are required to use established heading-subdivision combinations and cannot just make up their own. Any string of multiple subdivisions must be applied in a prescribed order, such as geographic-topical-chronological-form for Library of Congress Subject Headings that are topics authorized for geographic subdivision.

Unlike the practice of cataloging or indexing with subject headings and subdivision taxonomy terms and attributes for refinement are:

  • assigned more independently of each other, although the type of taxonomy term may restrict which attributes are available

  • have a greater number of attribute types available and tag a piece of content with values from most or all of the attribute types

  • may even have more than one attribute value of the same type may be applied (such as an item having two colors)

  • have no ranked order to apply attributes or to search on them

Convergence of Subject Headings Schemes and Taxonomies

While subject heading schemes and taxonomies have traditionally had different styles, they have become more similar in more recent decades.

Many subject heading schemes and taxonomies have both adopted thesaurus features. Originally, the Library of Congress Subject Headings had only See (Use) and See also relationships (like in an index), but in 1987 it adopted thesaurus relationships of broader term/narrower term, and related term in place of See also. Meanwhile the differences between taxonomies and thesauri have also been blurred, as taxonomies may have related-term relationships, and thesauri may have an over-arching hierarchical structure. The leading reason taxonomies and thesauri are difficult to distinguish, in my opinion, is because the same software tools are used to develop and manage both, and the software makes no distinction between “taxonomy” and “thesaurus.”

Another way in which subject headings have become more like taxonomies is that subject headings may be used without subdivisions. This is increasingly common as subject headings get reused in search and retrieval systems which do not support the complexity of subdivisions. For example, newer online publishers of medical information have adopted Medical Subject Headings without their subdivisions, which are still used by the National Library of Medicine. Additionally, auto-tagging is not easily done with multiple levels indexing. Without subdivisions, subject heading schemes are essentially the same as taxonomies, as long as they have a hierarchical structure.

Conclusions

Taxonomies have similarities and differences to both classification systems and to subject heading schemes. In fact, I would say that the modern information taxonomies have inherited features of both. Taxonomies are not always well defined, but they are flexible and adaptable to business needs.

Controlled vocabularies have existed for a long time, but their applications are becoming more varied. This has led to differences and also convergences of their features. Nevertheless, certain controlled vocabularies are more common in certain implementations. Subject heading schemes remain common in libraries, whereas taxonomies are more common in business and commercial implementations.