The Accidental Taxonomist: Artificial intelligence

Showing posts with label Artificial intelligence. Show all posts

Thursday, November 30, 2023

Generative AI at Taxonomy Boot Camp Conference

Generative AI and large language models (LLMs), the technology behind ChatGPT, have been topics of presentations, keynotes, and attendees’ conversations at all the varied conferences I had the fortune to attend this year, including the Taxonomy Boot Camp conference held November 6-7, in Washington, DC. Taxonomy Boot Camp is the only conference dedicated to taxonomies.

Opening and Keynotes

Right from the beginning in the opening welcome, the conference chair Stephanie Lemieux mentioned uses of ChatGPT for taxonomy creation, such as asking prompts: What is a category for a following list of terms?, What label for a concept might be better for scientists, or better for parents?, and What are alternative labels for a specific content? It has become clear that generative AI is a tool to assist taxonomists with specific tasks of a project but is not appropriate for automating the entire creation of a taxonomy. Thus, the Taxonomy Boot Camp theme this year, “Humans in the Loop,” was quite apt for the new era of generative AI, even if not specific to it.

The Taxonomy Boot Camp opening keynote, “Ontologies in the New Age of AI” by Dean Allemang, was on this subject. Dean is more of an ontologist than a taxonomist, hence the title, but he discussed both taxonomies and ontologies. Allemang made the statement that Generative AI “understands” why we need a taxonomy (even if managers do not). He explained that Schema.org has put RDF on many websites, which ChatGPT “reads.” Allemang has found that ChatGPT also performs perfectly on SPARQL queries, the query language for data, including taxonomies, that is in RDF. Allemang gave ChatGPT query examples, such as “Return all the claims we have by claim number, open date, and close date,” and “What is the total loss of each policy where loss is the sum of loss payment, loss reserve, expense, payment, and expense reserve amount?” Allemang advised taxonomists to identify uses for taxonomies that have not been fully delivered on and use generative AI to deliver it, and if people argue that generative AI does not understand their language, taxonomists should build in a link to the taxonomy that makes generative AI understand it.

On the second day, Taxonomy Boot Camp registrants attend the same shared keynote presentations with all of the KMWorld co-located conferences, and this year these mostly dealt with generative AI, including the opening keynote by Dion Hinchcliffe “Tech-Driven Enterprise Thrills & Chills: The Future of Work.”

Regular Sessions

In addition to being mentioned in various talks, generative AI was also the subject of a session, “ChatGPT, Taxonomist: Opportunities & Challenges in AI-Assisted Taxonomy Development,” which comprised two separate presentations.

In this session, Xia Lin presented in “Chat GPT and Generative AI for Taxonomy Development” in which he discussed the steps involved in using ChatGPT in two case studies. In one, a taxonomy for data analytics projects of a small business was developed by providing ChatGPT with the scope of the first level of the taxonomy and then asking ChatGPT to expand individual categories by adding subcategories and then to add definitions of terms and categories. The results were reviewed and revised by experts. But Lin did not stop there. He showed the results of asking ChatGPT to provide stakeholder interview questions around a category, and (for those more technically inclined) how to create a ChatGPT plug-in for various defined functions of taxonomy creation, using ChatGPT’s APIs.

Also in “ChatGPT and Generative AI for Taxonomy Development” Marjorie Hlava and Heather Kotula jointly presented on issues of the use of ChatGPT to create taxonomies and in general. They explained the risks of bias, plagiarism, ethics, data quality, matching the generated taxonomy to the content, and the amplification of errors upon repeating a prompt. In plagiarism, for example, if you ask ChatGPT to return a complete taxonomy on a subject domain in may return a copyrighted taxonomy that cannot be reused without a license.

Generative AI also impacts the topics of other presentations. For example, in the presentation “In Taxonomy We Trust: Building Buy-In for Taxonomy Projects,” Bonnie Griffin mentioned the importance of “continually re-introducing the value of taxonomy, as generative AI captures attention.” It was also the subject of a debate question in somewhat humorous closing sessions “Taxonomy Showdown—Point/Counterpoint With Taxonomy Experts.”

More on Taxonomies and AI

Of course, there is more to AI than just generative AI. Other sessions dealt with machine learning for auto-categorization. These included presentations by each Bob Kasenchak and Rachael Maddison in the session “Machine Learning Is Coming forYour Taxonomy,” (link to Bob’s slides) and Wytze Vlietstra’s presentation of “Vision for Modular Taxonomy Product at Elsevier,” in which the program included “shared infrastructure supported by AI-based decision support tools.” In fact, AI has been a theme of Taxonomy Boot Camp in the past, in 2018. It is generative AI based on large language models that is new.

For some more details on how this technology may be used for taxonomy development, see my prior blog post this spring “Taxonomies and ChatGPT.” To get another perspective on this conference, check out the recent blog post by Taxonomy Boot Camp speaker Mary Katherine Barnes “Integrating AI: Insights from KMWorld 2023.”

Monday, May 29, 2023

Taxonomies and ChatGPT

ChatGPT, generative AI, and large language models (LLMs) are hot topics of interest in fields of data, information, and knowledge management. LLMs dominated the keynote presentations at the networking conversations at Knowledge Graph Conference in New York and were also discussed in presentations and panels of this conference and Data Summit in Boston, both of which I attended this month. The technology is relevant to taxonomies as well.

ChatGPT is the user interface application on top of GPT (Generative Pre-Trained Transformer), a publicly available LLM developed by OpenAI, which is now in version 4. ChatGPT is thus a form of generative AI, in how it generates answers. There are many other LLMs (Neural network-based AI, trained with deep learning on very large volumes of text), including those which are proprietary, restricted, or for non-commercial research, but only some have generative AI user interfaces. Although we may think of generative AI for providing answers to questions, it can do a lot more, including tasks related to taxonomies.

Organizing terms into hierarchies

Building a taxonomy is a combination of top-down design (identifying the top concepts or facets) and bottom-up building (identifying specific concepts from content analysis). The top-level of a taxonomy is designed to serve user needs and thus should be based on stakeholder interviews, surveys, and brainstorming workshops, which is not something ChatGPT can do. The bottom-up building a taxonomy, based on terms extracted content or search log terms, may benefit from some AI involvement.

I have made a few test requests of ChatGPT for “Put the following list of terms into a hierarchical taxonomy…,” and the results are bulleted lists with indented narrower concepts. ChatGPT can also generate a taxonomy in a machine-readable SKOS in a requested RDF serialization format, as Bob DuCharme explained in his May 20 blog post “Getting ChatGPT to turn a flat vocabulary list into a hierarchical taxonomy.”

Like card sorting exercises, you can specify the top categories/concepts (like a “closed card sort”), or you can let ChatGPT create the top categories (like an “open card sort”). In any case, better results are with context, of course, so you should also tell ChatGPT what the subject domain or context is. Asking for a hierarchical taxonomy results in a third level of hierarchy sometimes, and not just a single level of grouping. Near duplicates usually appear next to each other in the list, and the taxonomist can then decide if and how to merge them into a single concept.

It is particularly for long lists of terms, where automated methods can save the taxonomist’s time. If a taxonomist comes up with terms based on manual content analysis, stakeholder interviews, or submitted lists from subject matter experts, the term lists tend not to be very long, and even the process of coming up with the terms tends to include some thoughts toward categorization at the same time. Longer term lists (such several hundred) are derived from automated term extraction (using text analytics technologies) across a corpus of dozens or hundreds of documents and from search log reports. ChatGPT is practical for putting these long lists of terms into draft hierarchies. There are inevitably some taxonomic errors in the results, which should be obvious to any taxonomist. For example, I have seen duplicated terms on different levels of the hierarchy.

In both lists of extracted terms and search log lists, terms occur that are not suitable as concepts for a taxonomy, such as verbs and adjectives or vague words. ChatGPT understands grammatical rules, so my prompt also says “Include in the taxonomy only nouns and noun phrases and omit the other terms.”

Generating alternative labels (“synonyms”) for concepts

Asking ChatGPT to “provide a list of synonyms for…” a given term can also be helpful for coming up with alternative labels for taxonomy concepts. Alternative labels should be customized for the context of the content and users, so alternative labels for a concept will vary from one taxonomy to another, and an external source, such as ChatGPT should not relied upon as the only source for alternative labels, but merely as a supplemental source of suggestions to be considered.

Again, context can help and should be provided. I asked “Provide a list of synonyms for “healthcare” and got 20 terms. But then when I asked “Provide a list of synonyms for health care, meaning the industry,” I received a slightly more focused list of 15 terms. Interestingly, the two-word variant “health care” was not on the list, so “synonyms” is understood by ChatGPT to mean different words with the same meaning and not orthographic variations. Nevertheless, even 15 terms are too many, and the taxonomist should select from the list of suggestions. It might be a good idea to then test search the suggested alternative labels in the content and system being used.

Although by strict definition a “synonym” is a single word with the same meaning as another word, ChatGPT provides acceptable synonyms for terms which are multi-word phrases, or synonymous multi-word phrases, such as “Chemical manufacturing and distribution” provided as a synonym for “chemical industry.”

Other taxonomy-related uses of ChatGPT

Getting help in designing an ontology (a more complex, yet high-level semantic model with defined classes of concepts, customized relationships, and attributes) is also possible with ChatGPT or other LLMs. Again, submitting the request multiple times with slight variations will yield multiple different responses for the ontologist to consider and select ideas from. Ontologies are not expressed in simple text, though, so the prompt request should specify it, such as RDF TTL. Dean Allemang, author of Semantic Web or the Working Ontologist, has written multiple articles (medium.com/@dallemang) recently on ChatGPT and ontologies/knowledge graphs.

ChatGPT can also be used for comparing lists of terms, data conversion, and basic coding, which may be useful for taxonomists who lack coding skills. It can convert taxonomy or ontology data from one data format to another (although taxonomy/ontology management software also imports/exports in multiple formats). Taxonomies and ontologies in their raw data format are most commonly expressed in the RDF (Resource Description Framework) data model which has various serialization format: RDF/XML, JSON, JSON- LD, .ttl (Turtle), etc., and ChatGPT can convert data from one to another. Data extraction can also be done with ChatGPT. For example, knowledge management professional Camille Mathieu recently shared in a LinkedIn post how she used ChatGPT to write a Python script to extract text & metadata from PDFs.

Perhaps what is most intriguing as a future implementation of taxonomies and ChatGPT is to go in the other direction and have knowledge organization systems, such as taxonomies, support the creation and use of queries (as called “prompts”) for generative AI, to obtain better results. This requires some back-end development, though, and is not merely a matter of putting a taxonomy into a prompt. Since a taxonomy is created for a specific subject domain, the questions need to be confined to the domain of the taxonomy. Semantic Web Company has developed a simple publicly accessible demo “PoolParty Meets Chat GPT,” whereby you can compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. The semantically enriched questions generate answers that have more detail, better accuracy, and even web links to definitions and other articles.

Conclusions

While it’s arguable whether ChatGPT alone is a good way to obtain “facts,” there is no doubt that it is a good way to get suggestions and ideas. These suggestions can support the work of taxonomists and ontologists, and taxonomies and ontologies in turn can support the results of ChatGPT and other LLMs. Because there will be errors from ChatGPT, it should not be used to generate taxonomies by those who are not already knowledgeable with taxonomy requirements and best practices, nor should it be used as a substitute for the expertise of taxonomists.

I hope to experiment more with ChatGPT for taxonomies and share additional details in future blog posts.

Tuesday, November 13, 2018

Taxonomy Boot Camp, 2018: AI and Taxonomies

Artificial intelligence (AI) is not new, but it is becoming more ubiquitous, and its applications are growing within other specializations in information management, knowledge management, and content management, including taxonomies. Hence the theme for this year’s Taxonomy Boot Camp conference (November 5-6, 2018, Washington DC) was “Bridging Human Thinking and Machine Learning.”

This was the 14^thTaxonomy Boot Camp conference and its 9th year in Washington, DC, which (along with the newer Taxonomy Boot Camp London) is the only conference dedicated to taxonomies. As usual, it is held along with several other co-located conferences of Information Today Inc., which overlap or are consecutive. The format, as in past years, involved an opening keynote, after which the conference breaks in two tracks of sessions the first day, one more basic and one more advanced, then on the second day a joint keynote with KMWorld conference, and a single track for the rest of the second day. By a show of hands, it appeared that 75% of the Taxonomy Boot Camp attendees were first-timers, even more than before. There were 235 attendees, including speakers and sponsors.

While the conference has two tracks the first day, a more basic and a more advanced track, presentations on machine learning and AI were in both tracks. These included “Taxonomy & Machine Learning at the Knot,” “Sandwiches, Categories, Ethics & Machine Learning,” “Taxonomy Skills in the World of AI” (a panel), “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” “Semantic Search Enrichment,” “Taxonomies and AI Chat Boxes,” and “Taxonomy in the Age of Amazon Echo,” and “Applying Taxonomy Skills to Cognitive Computing” (a project involving IBM Watson data privacy research product of Thomson Reuters).

In “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” presenter Andreas Blumauer of the Semantic Web Company said that increasingly companies are adopting knowledge graphs as their IT infrastructure, and leading players are trying to fuse knowledge graphs with machine learning. A knowledge graph has to be stored in a graph database. There are two types of graph database models: property graphs and RDF graphs. RDF graphs are more important for knowledge graphs.

Semantic AI core principles include the following.

• It’s about things not strings.

• It’s more than metadata: it describes the meaning of metadata as an additional, semantic layer.

• The knowledge graph establishes the semantic layer.

• Knowledge graphs can be seen as an input for machine learning.

• AI isn’t always good at understanding questions so a taxonomy/ontology is needed to support it.

• AI should be built upon data quality, data as a service, no black box, a hybrid approach, as structured data meeting text, aiming towards self optimizing machines (a vision, as we are not there yet).

Use cases of knowledge graphs include a recommendation engine. A knowledge graph is the basis behind the recommendation engine providing content, taking into consideration users.

In “Taxonomy & Machine Learning at the Knot,” the presenters of the web media company the XO Group, started with a good introduction to machine learning, starting off with explaining the problems it can solve: predicting behavior, automating tedious steps, and classifying; and that there are two types: supervised and unsupervised. Common applications include clustering, recommendations, and classification, and each of these can involve taxonomies. Specific implementation examples were provided.

As with last year, there was also a lot of talk of auto-categorization (automated or machine-aided indexing) across various session. Three were dedicated to the subject: “Driving Discovery: Combining Taxonomy & Textual AI at Sage” (a case study using Expert System auto-categorization) “Testing for Auto-tagging Success” and “Classification Relevance at Associated Press.” AP has an automated rules-based classification system for Subjects, Geography, and Organizations. Rules based auto-classification was chosen over the statistical method, because it offers transparency and control, breaking news and low frequency terms can be dealt with (don’t need the existing training set), you can scope/disambiguate between terms better, such incident type terms (Violent crime) vs. issue terms (Domestic violence), and semantic rules ensure there is not must passing mention. Entity extraction with disambiguation rules is used for person names and publicly-traded companies.

Knowledge graphs are getting more attention both here and at Taxonomy Boot Camp London. This was, of course, the main topic of the presentation Andreas Blumauer’s talk “Semantic AI: Fusing Machine Learning with Knowledge Graphs,” and Mike Doane, in the introduction of his talk on “Taxonomy in the Age of Amazon Echo” said that the information industry analysis firm Gartner reports that knowledge graphs are on the rise and are discussed more than taxonomies. Gartner is tracking knowledge graphs instead of taxonomies and ontologies.

While the opening keynote did not focus on AI or machine learning, it was presentation by a computational linguist, Deborah McGuinness, a professor of Computer, Cognitive, and Web Sciences, at Rensselaer Polytechnic Institute. Among other things, she spoke of the Data life cycle, whereby a computer understandable specification of meaning (semantics) supports enhanced lifespan and impact of data. She went on to include to specific ontology case examples.

Nearly all session slides are available to download, except the keynotes, without any login credentials at: http://www.taxonomybootcamp.com/2018/Presentations.aspx