The Accidental Taxonomist: Knowledge graphs

Showing posts with label Knowledge graphs. Show all posts

Thursday, December 19, 2024

Ontologies vs. Knowledge Graphs

At the Connected Data London (CDL) conference I attended last week, ontologies were humorously referred to as the “O” word. The thought was that, until recently, experts preferred not to mention “ontology,” lest they alienate their audience, customers, or stakeholders. The word comes across as too technical. It is a term from philosophy, after all, and it does not help that it sounds very similar to “oncology” (as “taxonomy” has been confused with “taxidermy”). The term “knowledge graph” on the other hand, is more user friendly, and even if it is not perfectly understood, its general meaning can be guessed. Thus, people would refer to knowledge graphs regardless of whether they meant a knowledge graph or an ontology.

At the conference, however, it was discussed that there is a growing acceptance of the word “ontology,” not just among experts but also among varied stakeholders who need to implement them. This was noted by several conference speakers, especially in the wrap-up panel session for the Data Modeling track, which was titled “The ‘O’ Word: How Ontologies Drive Interoperable Data and Business Innovation.” The panel moderator Katariina Kari explained that this recent shift has happened because of LLMs, explaining: “We need a reliable natural language repository. LLMs works on a network of mimicking language, LLMs are primed for language.” So, now use of the word ontology can even help a startup get funding from venture capitalists, she observed.

However, there remains some confusion over what an ontology is. At one end there is the difference between ontologies and taxonomies, and at the other end the difference between ontologies and knowledge graphs. I clarified the distinction between taxonomies and ontologies in a prior blog post, “Taxonomies vs. Ontologies” (January 2023). While knowledge graphs are a relatively new concept, and ontologies have existed for much longer, it is the varied understanding of ontologies that has given rise to confusion.

An ontology is defined as a model of a domain of knowledge, which comprises classes (sets of things), attributes (types of characteristics of things) and relationships between classes. According to this definition, an ontology is a somewhat generic model of a domain, and it does not include all of the individual members or instances of each class (such as the names of individual companies in the class called Company) nor the specific attributes of each attribute type (such as the address of each specific company for the attribute type called Address).

However, the W3C recommendation for ontologies, OWL (Web Ontology Language) includes the designation “individuals,” and ontology software tools, such as Protégé, support the inclusion of individuals and their specific attributes. Thus, it is easy to think that an ontology, by definition, includes all specific individuals. But just because OWL covers the recommendation for how to include instances of a class, and software supports the inclusion of instances of classes does not necessarily mean that the instances or individuals are actually a component of an ontology. The ontology experts on this CDL conference panel confirmed that an ontology is the upper-level semantic model.

Then, what do we call an ontology plus all of the individual members (instances) of classes and their specific attributes? That is essentially what a knowledge graph is. This is especially true when individuals are specific to an organization or enterprise, such as names of individual customers, products, employees, etc., and we call that an “enterprise knowledge graph.”

The first applications of ontologies in information/data science were in biomedicine, in which individuals included such things as names organisms (including bacteria and viruses) and chemicals, etc. Thus, the notion of an individual in science is not quite the same as in business, which has also been a source of confusion over what an individual is and the inclusion of individuals in an ontology. In enterprise knowledge graphs, the instances can be very numerous and specific, including individual “events,” such as interactions or transactions.

In conclusion, an ontology is typically a defining feature and component of a knowledge graph, but it is not all of what goes into a knowledge graph. A knowledge graph also includes individuals, which may be named entity instances or they may be specific taxonomy concepts (abstract things that are not unique named entities, such as the concepts “Data ethics” or “Performance measurement”), and a knowledge graph also includes specific attributes of individuals. It may be said that a knowledge graph is the instantiation of an ontology, and an ontology is the knowledge model. Katariina further explained: “knowledge graphs that actually follow an ontology will have an LLM perform better than just a KG that is unharmonized, not yet adhering to a clear ontology.”

Monday, July 31, 2023

Knowledge Graphs and Taxonomies

Knowledge graphs have recently emerged as an additional and growing use of taxonomies. A knowledge graph comprises data extracted and stored typically in a graph database with an ontology to semantically link types of data, but usually a knowledge graph also includes a taxonomy, thesaurus, or set of controlled vocabularies to provide consistent labeling. As a result of this combination, people involved in knowledge graphs are taking an interest in taxonomies, and people involved in taxonomies are taking an interest in knowledge graphs.

The traditional and still primary use of taxonomies is to consistently and comprehensively tag and retrieve content, whereas the focus of knowledge graphs is to access and make connections among disparate data. Content tagged and retrieved with taxonomies includes pages in websites, intranets, content management systems; documents in document management systems; and images and video files in digital asset management systems. Knowledge graphs link together data which includes records in databases, customer relationship management systems, product information management systems, and other enterprise systems, and the values in cells in spreadsheets, referenced by their row and column headers. By integrating a taxonomy into a knowledge graph, users can then retrieve both content and data on the same subject together.

What is a knowledge graph? The first non-sponsored definition that pops up today with a Google search not from a vendor is from the the Alan Turning Institute, the U.K. national institute for data science and artificial intelligence, which provides the following explanation on its Knowledge graphs interest group page:

Knowledge graphs (KGs) organise data from multiple sources, capture information about entities of interest in a given domain or task (like people, places or events), and forge connections between them. In data science and AI, knowledge graphs are commonly used to:
Facilitate access to and integration of data sources;
Add context and depth to other, more data-driven AI techniques such as machine learning; and
Serve as bridges between humans and systems, such as generating human-readable explanations, or, on a bigger scale, enabling intelligent systems for scientists and engineers.

From the taxonomy perspective, a knowledge graph is a combination of controlled vocabularies or a taxonomy with the semantic layer of an ontology, which adds custom semantic relations and attributes, plus specific instance data, which is stored in a graph database. A knowledge graph thus extends the use of a taxonomy beyond content to also include data. From the graph data perspective, a knowledge graph is the gathering of disparate data, which has been extracted, transformed, and loaded (ETL) into a graph database, where it is linked with semantic relations provided by an ontology and described by terms in a taxonomy, and it can be queried and analyzed all in one place.

GraphViews of SWC ESG Knowledge Graph

It is an important to the definition of a knowledge graph to include its purpose and not just its components. The purposes include providing a unified view of data, easy availability of information, easy integration of new data, secure interoperability, visualization of entities and relations, the possibility of discovery and insights through semantic relations, and the support for complex multi-part queries with quick results. With inclusion of a taxonomy, a knowledge graph can bring together both data and content on in and organization.

With such lofty goals, knowledge graphs should be an area of interest not just of data scientists and ontologists, but also of information professionals (including taxonomists) and knowledge managers. This is gradually becoming the case. Knowledge graphs emerged in the 2010, and became popularized with the Google Knowledge Graph introduced in 2012. Knowledge graphs were first introduced at the KMWorld (Knowledge Management) conferences in 2017 as "semantic knowledge graphs,” and were also first mentioned at the Taxonomy Boot Camp conference that year. This November, the KMWorld conference has more talks on knowledge graphs than before. When I proposed multiple topics for this spring’s Information Architecture Conference, the conference chair chose the presentation on an introduction knowledge graphs. I also delivered a similar presentation this year to the joint Special Libraries Association and Medical Libraries Association conference.

I will be giving an updated version of those talks, “Knowledge Graphs for Information Professionals” as a free PoolParty webinar on Thursday, August 17, 11:00 – 12:00 EDT, after which the recording will also be available.

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge. I see two ways in which taxonomies are linked to data:

1. Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2. Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.

Taxonomies and metadata

Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies

Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.

Taxonomies and knowledge graphs

A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists” and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).

Thursday, May 30, 2019

Knowledge Graphs and Ontologies

Schema DBpedia 2010 from Wikimedia Commons attributed to Charles Sturt University (Creative Commons license)

I’ve been hearing a lot about knowledge graphs recently. Corporate and academic implementations have been increasing in recent years, and now the taxonomy community is also taking an interest. Taxonomy software vendors are talking about knowledge graphs in webinars, blogs, and conferences, and knowledge graphs was on the list of suggested presentation proposal topics for this fall’s Taxonomy Boot Camp London conference.

Knowledge graph purposes and definitions

A knowledge graph is the organization and representation of a knowledge base as a graph, with a network of nodes and links, not as tables of rows and columns. As such, it is generally based on data in a graph database, rather than on a relational database, and graph databases are becoming more popular. A knowledge graph usually includes (but is not limited to) visualizations of data, such as of an output of graph analytics, a display of interconnected nodes and links, or a display of linked data in a “fact box.”

Knowledge graphs can serve various roles and provide many benefits. They support search, recommendation engines, e-commerce, and enterprise knowledge management. They can integrate knowledge, serve data governance, provide semantic enrichment to content, bring structured and unstructured data together, provide a unified view of varied unconnected data sources, provide a semantic layer on top of the metadata layer, improve search results beyond mere algorithms, and answer complex user queries instead of merely returning content on a specified topic. An example of a complex query, which can easily be handled by a knowledge graph linked to the right data, but would be very time-consuming if not impossible by traditional search and query methods would be: “Which of the top 10 scholarly journals (by most often cited), published in Europe in the past 3 years discuss knowledge graphs in the context of knowledge organization systems.”

Google Knowledge Graph example

Like “taxonomy” or “ontology,” the definition of “knowledge graph” is not clear or agreed upon. Knowledge graphs have different meanings from different perspectives, such as those with a computer science vs. information management backgrounds. Sometimes a knowledge base, or at least a knowledge base that is represented as a graph, is considered the same as a knowledge graph. There was even a conference presentation, turned into an article, dedicated to this topic of defining knowledge graphs: "Towards a Definition of Knowledge Graphs," by Lisa Eherlinger and Wolfram Wöß, CEURWorkshop Proceedings.

A Google search with Wikipedia results at top returns the article describing Google’s own “Knowledge Graph” (introduced in 2012 and displayed as fact boxes, as in the example screenshot here for Boston) and a see also “Knowledge graph” (lower case), which redirects to the Wikipedia article “Ontology (information science).”

Knowledge graphs and taxonomies, ontologies, and other knowledge organization systems

Knowledge graphs, like taxonomies, comprise things/nodes/concepts and relationships between them. Knowledge graphs may comprise multiple domains and thus contain multiple taxonomies, thesauri, ontologies, or other knowledge organization systems. Knowledge graphs can link together disparate sources of controlled vocabularies and data.

RDF Triple example

Knowledge graphs resemble ontologies (a kind of knowledge organization system that is based on taxonomies, but is more complex), but, despite what Wikipedia claims, they are not the same. Knowledge graphs and ontologies both are represented by nodes (things, concepts) and have customized semantic relationships between them. As they both can be visually represented in the same way of nodes and relationships, they may look the same in visualizations. They are both based on RDF (Resource Description Framework) triples (comprising subject-predicate-object), and are usually also based on the Semantic Web standard OWL. All nodes must have their own unique URIs. Specialized software tools are available to create knowledge graphs and ontologies.

Knowledge graphs can be considered ontologies and more. According to the authors, Eherlinger and Wöß, “A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge.” A knowledge graph may comprise multiple domain ontologies, or an ontology and another vocabulary/knowledge organization system. A certain kind of very general ontology called an upper ontology or foundation ontology can also serve as the data model for a knowledge graph.

Conferences including knowledge graphs

There are many conferences that now have sessions on knowledge graphs. I cannot explore all of them, but I have attended and will attend several conferences this year that include knowledge graphs in their programs. VOGIN-IP-lezing 2019 "Search and Findability" at which I spoke in Amsterdam in March had a session on a fashion retailer's knowledge graph and a 2-hour workshop “Enterprise Knowledge Graphs." Data Summit, which I attended earlier this month in Boston, had several sessions that mentioned knowledge graphs, one focused on the topic, "From Structured Text to Knowledge Graphs," but not as something new to be defined, but rather as an accepted technology. I'm excited to be co-presenting (presenting the first part on taxonomies and ontologies) in a pre-conference full-day workshop "Fast Track to Knowledge Graphs and Semantic AI," at the SEMANTiCS conference in Karlsruhe, Germany, on September 9. Then I will be presenting a "A Brief Introduction to Knowledge Graphs," among other presentations, at Taxonomy Boot Camp London in October.