The Accidental Taxonomist: Metadata

Showing posts with label Metadata. Show all posts

Monday, May 5, 2025

Taxonomies and Attribute Data

In the past (such as my 2021 blog post "Attributes in Taxonomies"), I have explained that “attributes” serve as filters to refine search results on content, results that have already been narrowed by a hierarchical taxonomy concept or category. As such, the attributes available for filtering can vary based on a taxonomy concept or category that had been selected. To the end user, high-level taxonomy facets and attributes both function similarly as filters, and the distinction between facets and attributes may not be apparent. If the distinction is not noticeable to end users, then then facets and attributes may be confused. It’s best to describe attributes for what they are, and not merely by what they can do. That’s that this blog post aims to do.

Attributes

Data is information in the form of specific values that are relevant to something such as an asset, object, product, person, event, or transaction. Since data is relevant to something else, we can refer to data as an “attribute “of something. When attributes are standardized and used in information/data management, then attributes are metadata. Metadata schema are structures to organize data.

Examples of attribute metadata are:

for people: birth date, gender, occupation, nationality, phone number
for products: brand, price, color, size, SKU number
for documents: title, author, publication date, language, word count, publication status, file type

Almost all metadata, both descriptive and administrative, are attributes of something. (Only structural metadata, that which is used to mark up text, would not be an attribute.) Attributes, as metadata, can serve various purposes, including identification, comparison, sorting, filtering, and finding something based on its attributes.

Attribute values may be of different types: text, numbers, dates, or yes/no (also called “Boolean”). As text strings, attribute values may be uncontrolled free text or terms from a controlled list.

Taxonomies

Taxonomies are structures of concepts, which are used primarily for tagging and retrieval of content, although there are secondary uses. The concepts include subjects and named entities. In all cases, the concepts are of controlled vocabularies. The structures may be primarily hierarchical or primarily faceted, although a combination, such as limited hierarchies within a facet, is also possible. The structure of the taxonomy provides context for tagging and supports interaction by users.

When a taxonomy is structured into facets, typically each facet serves also as a metadata property. A hierarchical topical taxonomy can also provide values for a metadata property. Taxonomies are structures to organize controlled vocabulary concepts.

Examples of taxonomy facets include:

Topics
Activities
Industries
Product/service types
Brand names
Companies
Organizations
Names of people
Types of people/Roles
Events/Occasions

Thus, the types of things that are facets are usually not the same types of things that are considered attributes.

Metadata schema are structures to organize data, whereas taxonomies are structures to organize controlled vocabulary concepts that can populate metadata properties.

Where Attributes and Taxonomies Overlap

Considering again the examples of different types of attributes for different things, there are some attributes that could be managed in a “taxonomy” instead of merely as “attributes”:

For people: Name
For products: Product type/category
For documents: Subject/topic

Technically, each of these characteristics is also an attribute, but it is usually more practical to manage them as taxonomies so that they can support the implemented benefits of a taxonomy, such as semantic tagging, searching (including type-ahead search suggest), and browsing.

Thus, when we talk about “attributes” in the context of taxonomies, we mean those characteristics of something that are better managed as attributes and not managed as taxonomies. The decision is one of knowledge modeling.

For example, to support the refinement of searches, a taxonomy of expert people for an organization may have the following taxonomy facets:

Name
Subject of expertise
Organizational unit
Location

Then in addition to the facets, the taxonomy may have the following attributes associated with each record of a person:

Job title
Academic degree
Email address
Phone number
URL of headshot image

This is selected data of interest, but not values that are used in initial search or browsing for finding and retrieving content. Attributes are metadata, and taxonomy facets are also metadata, but that does not mean that they are the same, because different metadata can have different functions or purposes.

Ontologies: Bridging Taxonomies and Attributes

When we enrich a taxonomy with features of an ontology, not only can we add semantic relationships, but we can also add attributes to taxonomy concepts. Usually, when taxonomists first learn about ontologies, they think primarily of the addition of customized relationships between concepts, and they might not be aware of the importance of the addition of attributes.

In ontologies, semantic relationships are formally called “object properties,” and attributes are called “datatype properties.” Both are equally important. Meanwhile, the feature of “classes” in an ontology typically corresponds to taxonomy concept schemes or facets.

To add attributes to a taxonomy, the best way to do it is through adding an ontology, which can be very simple and not even include semantic relationships. As the availability of different attributes may vary based on a hierarchy branch of concepts, this can be managed by creating classes, which are assigned to hierarchical branches, facets, or concept schemes. Then, attributes (datatype properties) are applied and used with concepts based on the class the concept belongs to.

Conclusion

The following table summarized the differences between taxonomy facets and attributes.

Taxonomy Facets	Attributes
Basic structure of many taxonomies	Additional data added to taxonomies
Controlled vocabularies	Controlled or uncontrolled terms, text, numbers, dates, Boolean options, etc.
Concepts as nouns or noun phrases	If text, any kind of text string
Top organizational level of a taxonomy	Values relevant to any taxonomy concept
Concept Schemes in SKOS, or Classes in an OWL ontology	Metadata on a concept, or datatype properties in an OWL ontology

Thursday, August 24, 2023

Taxonomies for Digital Asset Management (DAM)

Taxonomies, with their origin in thesauri and library subject heading systems, have traditionally been associated with the tagging and retrieving of text content. The management and retrieval of multimedia content (images, video, audio, or other graphics files), on the other hand, has traditionally been served by metadata schema, reflecting the various attributes of the content, including digital rights.
Metadata for text content has become increasingly important to make it “structured” and easier to manage. Meanwhile, taxonomies, with their richness in topical detail, hierarchical structure, and synonyms, have become increasingly important in making multimedia content, especially digital assets, easier to identify and retrieve.

However, the features and uses of taxonomies and descriptive metadata have somewhat converged, now that faceted taxonomies have become common. A facet is an aspect or attribute, by which the user may limit, filter, or refine a search or initiate a search selection. (Several of my past blog posts discuss facets, including "Customizing Taxonomy Facets.")

Why taxonomies for multimedia content and digital assets

There is considerable overlap between multimedia content and digital assets, although they are not identical. A digital asset is something that is created and stored in a digital form that has value. The word “asset” implies it has value. So, not everything that is in digital form is an asset. Creative works in digital form, whether by in-house producers or licensed, are considered digital assets. Multimedia content tends to have value, so it tends to be considered as digital assets. If it needs to be managed and made available for retrieval and reuse, it can probably be considered a digital asset. If it needs to be managed and made available for retrieval and reuse, then assigning metadata and taxonomy terms is probably important.

1. Growing volume of digital assets

The main reason to move beyond simple controlled lists of terms/values in metadata properties (such as Type, Location name, Location type, Event/Occasion, Person type, Season, etc.) and include relatively large topical taxonomies for digital assets is to provide the ability to better limit search results in large volumes of content. The number of digital assets owned or managed by organizations has grown immensely, as varied media sources have become more common, not just for brand content but also for marketing, instructional, and technical content. Limiting search results from only a few broad topic categories is often not sufficient, and too many digital assets are retrieved.

A taxonomy provides further granularity of subjects which a digital asset depicts or describes. A granular hierarchical taxonomy could provide the terms for a single metadata property, such as “Subject,” or there could detailed taxonomies in more than one metadata property, to also include “Activity,” “Product category,” or “Occasion,” depending on the use case.

2. Varied audience for digital assets and the use of synonyms

Another reason to use taxonomies for digital assets is to better suit a varied audience of users. While it is digital asset managers who rely on metadata to manage the digit assets, various other users need to find the same assets: product and brand managers, web content editors, art designers, partnership and licensing specialists, and perhaps even customers. Assets are most valuable when they have wider uses, but in order to be reused by different people and departments, a detailed taxonomy helps.

A taxonomy is not only more detailed than a list of a few categories, but it is also usually enriched with synonyms (also called alternative labels or variant terms). This way, different people who may describe the same thing by different names will find the same concept and its tagged content. For example, synonyms could be “Bridal” and “Wedding”; “Infant” and “Baby”; “Botanical” and “Plants”; “DIY” and “How to.” Internal users and external users often have different preferred names for things.

3. Connecting both text and multimedia content across the enterprise

Applying a taxonomy to tag digital assets can also allow digital assets to be retrieved along with other content, text content, in other content management systems (CSMs). This would require that the taxonomy be a centrally managed enterprise taxonomy, and not just a siloed taxonomy within a single DAM system, and that more than one system are connected to each other (such as through APIs or integrations) or that a dedicated front-end enterprise search application is linked to content in their source repositories.

While users often look only for digital assets that they know are located within a specific DAM system, other times users want to conduct a more exhaustive search on a subject. While most images and videos are expected to be in the DAM, along with some PDF files, other PDF files, presentations, and documents, and even some images and videos from other sources may be located in other systems. Taxonomies that can be linked to each other or a single master taxonomy managed centrally in a dedicated taxonomy management system, such as PoolParty, serving as "middleware," connected to the content in each of the systems, can enable comprehensive search and retrieval across the organization, especially if all the data is managed in a knowledge graph (explained in my last blog post "Knowledge Graphs and Taxonomies").

Tagging or keywording multimedia content and digital assets

Finally, there is the tagging component of taxonomies, which is often called keywording with respect to images. Digital asset managers must assign descriptive metadata to the assets they manage, which is not difficult if the controlled lists of available values are short. A taxonomy, however, may be large, so it can be a challenge to determine which subject terms to tag.

For text-only content, the technologies of text analytics, including entity extraction and natural language processing, can be applied to enable auto-tagging. Image, video, and audio content had previously been considered unsuitable for auto-tagging, and thus less suitable for large taxonomies, but this is no longer the case.

There are new technologies and methods to enable auto-tagging of digital assets. Audio-to-text technologies enable transcripts to be created from audio and video files, and these texts can automatically analyze and tagged. Improvements in image recognition technology can enable images to be auto-tagged for their subjects. Human review of auto-tagging is still recommended, but that’s easier than tagging from scratch.

Taxonomy is what powers DAM

DAM systems do support taxonomies, so you should not hold back from creating a suitable taxonomy for your DAM content. To learn more about creating taxonomies for digital assets, attend the session “Taxonomy is What Powers DAM” on September 14, 2023, at the HS Events DAM New York conference. I will join three other panelists to discuss taxonomies for digital asset management: what taxonomies are, how to develop a taxonomy, how to do research for a taxonomy, and how to manage a taxonomy, especially for DAM applications. Register with the code SPEAKER100 for $100 off.

Saturday, November 27, 2021

Attributes in Taxonomies

When I had done consulting for ecommerce taxonomy clients years ago, and they would refer to the taxonomy facets for products as “attributes,” I felt that might be confusing, because I considered “attributes” something else: a characteristic like metadata of a taxonomy term or a feature of an ontology. I have since come to realize that facets in some cases, especially in ecommerce, can be considered attributes, and they can even be managed in an ontology that is combined with the taxonomy.

Facets in a faceted taxonomy are various taxonomy term “types” that function as refinements or filters in the user interface for limiting search results on content that share similar types of terms or attributes. Users refine or filter their searches by selecting a term or value from each of two or more facets. In a periodical article research database offered by a library, facets might be subject, geographic place, named person, named organization, article type, publication name. Within an enterprise intranet of enterprise content management system facets might be topic, department, office location, and document type. In a health information service, facets might be symptom, body part, patient age, and patient gender. In a corporate knowledge base for customer service, facets might be product type, brand name, market, region, issue type, and customer type. In most of these cases, a topical taxonomy is one of the facets, even if that topical taxonomy is hierarchical. The primary taxonomy design challenge in such cases is deciding what kind of information would be useful if separated out in its own facet, and what can remain in the generic topics facet. Using the SKOS (Simple Knowledge Organization System) model, each concept scheme serves as a facet.

In a product, ecommerce or marketplace taxonomy, the hierarchical taxonomy of product types is not one of the facets. This large, hierarchical taxonomy is typically the starting point for user browsing. While not constituting a facet, this hierarchy is linked to the facets. The user navigates or drills down through a hierarchical tree of product categories, until a specific product type is found, and then the facets (attributes) relevant to that product type are made available to the user, allowing the user to refine the search further, by selecting from each of several attributes, such as size, color, material, price, and features. This requires a different approach to the taxonomy design than for the faceted taxonomies described above, and thus these post-hierarchical-browse refinements are better known as and more appropriately called attributes.

Attributes can serve as refinements/filters in taxonomies for purposes other than ecommerce, such as job board taxonomies (attributes for job location, skill level, salary range, job type, employer/company, industry, date posted, etc.), an internal enterprise expert-finder (attributes for job title, department, office/work location, skills, subject areas of interest, academic degree, languages, etc.) or taxonomies of movies (attributes for genre, production company, production country, language, award winner type, release date, etc.)

The attributes generally pertain to specific named entities, such as the name of a product offered by a specific seller, the name of a job title at a specific employer, or the title of a movie. There can be attributes for more than one kind of named entity in the same set of taxonomies, such as for employer name in addition to job title in a job board taxonomy. Subjects, which are not named entities, usually do not have attributes, but some do, especially, in the fields of science and medicine, where they would be attributes on the names of species, chemicals, viruses, diseases, etc. I will discuss named entities in more detail in my next blog post.

Issues to consider in creating attributes

In a taxonomy where attributes are important, there are various issues to consider. First, shall there be a hierarchical topical taxonomy presented as an initial browse interface to the users? While this is typical for product/ecommerce taxonomies, it is not usually the case for job board taxonomies nor for a taxonomy for movies. However, it may be desired for a taxonomy of nonfiction books or periodicals, which users more often would categorize my subject. A producer or publisher of educational content will likely have a hierarchical taxonomy of disciplines or subject areas. A research-focused organization would also likely have a hierarchical subject taxonomy in addition to the facet-attributes dealing with location, type, funding source/sponsor, researcher name, etc. Having a hierarchical taxonomy outside of the attributes tends to be a user experience design decision, but it has an important impact on how the overall taxonomy is designed and managed.

More attributes may be created than usable for filtering/refining results. For example, products will likely have SKU numbers among their attributes, which can be displayed and perhaps even made searchable, but would not be one of the filtering-facets presented in the user interface. In a taxonomy for finding internal experts, contact information, such as an email address and phone number, would be attributes on each person’s profile, but these would not be searchable fields. Rather, they would display when the person profile is selected. Thus, another issue when creating attributes is determining which will display and function as filters/refinements and which will display only as additional metadata on a selected item.

If an initial hierarchical topical taxonomy is presented to the users, there arises the question of at what point in the hierarchy should the hierarchy of categories should end and further details should be treated instead with attributes? This question often comes up when designing ecommerce taxonomies. For example, to distinguish gas and electric stoves, should each of these types be a narrower term of stoves, or should energy source be an attribute of stoves?

Another issue to resolve is determining which attributes should be generic across all categories, and which should be category specific. For example, on which product categories in an ecommerce taxonomy is it appropriate to have an attribute for gender (as for clothing or a gift for a woman or for a man)? Related to that is the question of which categories should have their own unique attributes. Are some attributes relevant to major (the broadest) categories, and other attributes relevant only the most specific categories, and yet other attributes apply to various miscellaneous products? For example, color might be relevant for products in different parts of the hierarchy.

Attributes should be managed as belonging to different types based on their values, such as whether they are of controlled vocabularies, dates, currency, numbers, or a simply a Boolean yes/no, such as Remote being an attribute of jobs. If a hierarchical taxonomy resides outside of the attributes, then controlled vocabulary attributes are an additional part of a larger set of taxonomies/controlled vocabularies. How this is managed varies based on the taxonomy/ontology management tool. For example, such term lists might need to be managed as separate concept schemes in a SKOS taxonomy, even though they are used in ontology-based attributes. It can start getting complicated when an attribute type has different values for different categories in the same hierarchical taxonomy implementation. For example, the attribute of Material could have different values for tables than for clothing, and both categories are offered by the same ecommerce seller.

Attributes add another level or layer of expressivity to a taxonomy or set of controlled vocabularies, which brings it closer to an ontology. The distinction between taxonomy and ontology is not necessarily clear. It’s fine to have just some ontology-like features, such as attributes, but it is recommended to use a taxonomy/ontology management tool, such as PoolParty, which manages taxonomies and ontologies (and anything in between) according to Semantic Web/World Wide Web Consortium (W3C) standards.

Saturday, October 30, 2021

Taxonomies for Data

Coming from an editorial content background, I have always valued taxonomies for making content findable, but more recently I have come to appreciate how taxonomies can also play a role in making data accessible and useful.

Taxonomies have successfully aided people in finding and retrieving desired content since the 1990s and even decades earlier, if we consider thesauri within the scope of taxonomies. Nevertheless, the focus had always been on content: originally printed content such as periodical articles, web pages, intranet or CMS pages and attached documents, etc., and then multimedia content, such as images, animation or video clips, audio files. Each content item gets tagged with taxonomy terms of different types for what is about and for kind of content it is. Taxonomies have become increasingly important as content volume and types have grown, especially as more people in varied roles create content.

Meanwhile, data has grown even faster in its volume and potential value. We are hearing more and more about big data, data warehouses, data lakes, data fabrics, data catalogs, data analytics, master data management, data governance, FAIR data, data-centric architecture, data-driven enterprises, and data science in general. Tools and technologies to make use of the data have included programming/scripting, machine learning, algorithms, natural language processing, and other forms of artificial intelligence.

These tools and technologies for data do not replace taxonomies and other controlled vocabularies, though, which still have an important role to play in connecting people to the desired data and information, and ultimately knowledge. I see two ways in which taxonomies are linked to data:

1. Managing and understanding the data in a standardized way with better metadata, which depends on controlled vocabularies.

2. Connecting the data with graph databases, knowledge graphs, ontologies, and ultimately taxonomies.

Taxonomies and metadata

Metadata refers to the standardized data types, properties, fields, or elements, and the specific individual values that populate those types or properties. From a content perspective, we think of metadata as serving content management and retrieval, such as the content’s format type, title, source, creator, date, language, subjects, category, audience, etc. But metadata exists in databases and spreadsheets, too, where column headers are the metadata properties. For example, contact metadata would include name, phone number, email address, city/state, country, contact type, initial contact date, contact owner, etc. Product metadata would include SKU number, product name, product type/category, price, color, features, supply source, retail availability, etc. Transactional metadata would include purchased product name, purchaser, purchase date, purchase price, purchase location.

Data can be better managed and analyzed if the metadata properties and values are standardized and controlled. Controlled vocabularies should be used to standardize the metadata for many of the properties: format type source, subjects, category, purpose, country, contact type, product name, product category, color, features, availability, etc. Hierarchical taxonomies serve some of this metadata, such as product categories.

As an example, I’m planning to attend a conference in Austin, TX, and I wanted to look up contacts in the Austin area in my CRM (customer relationship management) system. Filtering results by city, I found some with the city of Austin, but others had the city of Round Rock. Filtering on Austin, I would have missed those, had I not known that Round Rock was a suburb of Austin. What was needed was a metadata property for “Metropolitan area,” rather than “City,” a controlled list of metropolitan areas, and Round Rock as an alternative label for Austin area in that controlled vocabulary.

Taxonomies and ontologies

Taxonomies, controlled vocabularies, and metadata alone are good for filtering or queries to find content that meets a set of criteria (based on metadata properties or faceted taxonomy selections). But what if you want to discover and explore relationships across the data? Instead of merely looking for all the contacts in the Austin area that have the customer or sales-qualified-lead status and have a contact owner, I want to limit that further to contacts whose employers in turn meet certain criteria, such as belonging to specific industries or meeting an annual revenue minimum. Another query example would be to find the locations in the past 10 years of industry events in which a specific organization has participated. These connections across different metadata types, vocabularies, or categories, are made with an ontology.

An ontology has, besides any hierarchical relationships characteristic of a taxonomy, additional semantic relationships that connect across types or classes of entities. Classes may be for metropolitan area, company name, person name, industry event name, etc. Semantic relationships across these classes may include is-employed-by-company/employs-employee, sponsors-event/has-sponsor, is-located-in/is-location-of. Attributes are additional metadata for the entities of each class, such as address. “Ontology” typically refers to just the knowledge model of classes, relationships and attribute types. But to become useful in information retrieval and data analysis, an ontology is connected to a taxonomy or other controlled vocabulary to extend those semantic relationships and attributes to all the concepts/terms.

Taxonomies and knowledge graphs

A growing use of ontologies is in knowledge graphs. Knowledge graphs extend the ontology+taxonomy knowledge organization system further by integrating instance data that is a of set too large to fit into controlled vocabularies and tends to reside in databases or spreadsheet cells. This could be the 10,000s of contacts in a CRM or products and product parts in a PIM (product information management) system. The knowledge graph brings, actually or virtually, the data from these different systems into a graph database. A graph database is structured of nodes and edges (connections between nodes), rather than of tables of rows and columns characteristic of a relational database. Data entities are at the nodes and connections of relations or property types are designated along the connecting edges. The graph structure thus supports the model of the applied ontology, which has classes and individuals at the nodes and semantic relations or attribute types describing the edges.

Why knowledge graphs? Taxonomies, controlled vocabularies, and metadata alone are good for finding information in a single content/data repository, database, or content management system. But often the same, similar, or related information exists in multiple different sources or systems, as data or as content “silos,” such as product information residing in the PIM, the web ecommerce platform, the marketing content management system, and the sales management system. By extracting the data from these different sources and storing it in a single graph database, the connections between the data from all sources can be made.

Knowledge graphs link data that is in different repositories and systems, both structured and unstructured data and as such provide a unified view of the data. Furthermore, with taxonomies tagged additionally to content, relevant data and content and be linked to each other.

Opportunities for taxonomies and data together

In conclusion, taxonomies alone are focused on content, but if you combine taxonomies with ontologies and/or diverse metadata, you extend the use of taxonomies to data. I am also seeing the connections of taxonomies and data in more places.

My current job title is Data and Knowledge Engineer, which reflects the combination of the knowledge management and data science realms. Actually, I am not a data engineer at all, but my department at Semantic Web Company has standardized the job titles, as we knowledge engineers and data engineers work very closely together on the same teams. This is to provide combined services and solutions to our customers.

In other ways data and taxonomy are combined in jobs. Last year I had a contract taxonomy job that was heavily into data (managed in spreadsheets). In the other direction, data related job postings have taxonomies in their job descriptions. A search today on “taxonomy” in descriptions of LinkedIn jobs brought up Data Governance Consultant, Data Analyst II - Taxonomy, Taxonomy Data Architect, Data Custodian, Data Governance Lead in the top 25 results, and on Indeed it brought up Data Analyst, Junior Data Analyst, Data Annotator, and Data Entry Specialist in the top 15 results.

I have most keenly noticed this combination of taxonomies and data by participating in more data-related conferences recently. In 2021, among other conferences, I have spoken on taxonomies at Data-Centric Architecture Forum in February, the European Data Conference on Reference Data and Semantics (ENDORSE) in March, the Knowledge Graph Conference in May, and Data Con LA in September. Others include my masterclass “Foundation for a Knowledge Graph: Taxonomy Design Best Practices” at the virtual Connected Data World conference on December 2, and a tutorial “Introduction to Taxonomies for Data Scientists” and presentation “The Future of Taxonomies – Linking Data to Knowledge” both at Data Day Texas in Austin, TX, in late spring 2022 (postponed from January 22, 2022).

Tuesday, August 29, 2017

Taxonomies in SharePoint

Controlled vocabulary metadata, including hierarchical taxonomies, has been supported in SharePoint since its 2010 version, and its use and features have been enhanced is succeeding versions of SharePoint. While it’s not technically difficult for users to create taxonomies and apply their terms to content items in SharePoint, developing a metadata/taxonomy design and application strategy is definitely a challenge.

The distinction and overlap between metadata and taxonomies was the topic of my previous blog post, "Metadata and Taxonomies," and it is very relevant to SharePoint. Controlled vocabularies or taxonomies used to tag content are referred to in SharePoint as “managed metadata.” This designation is indeed accurate and fitting. Some, but not all, metadata is in taxonomy form (hierarchical structures), and in SharePoint it is managed/controlled in a central way, where permissions on who can change or add to the metadata terms may be limited to a smaller set of users than those who may tag content with the metadata. “Managed Metadata” is something you will hear about in documentation, but in the SharePoint application itself, what you want to work with its “Term store management,” grouped with other Site Administration settings under “Site Settings” (under the gear symbol in Office 365 SharePoint). Terms are grouped into “Term Sets” (top-term hierarchies or facets).

Questions to consider in taxonomy design in SharePoint include:

To what extent will document libraries (virtual folders) be used to categorize content within a site, and would proposed subfolder names be better suited as metadata terms for tagging?
Will the primary use be for filtering lists of documents in place, within an open document library, based on metadata selected for the various “columns,” or will the primary use be for refining search results, based on metadata selected in the left-hand margin refinement panel after executing a search?
How many Term Sets should be created and how many and which metadata fields in total should display to the users, either in columns or as search refinements.
When should a Term Set be a flat list and when should it be created as a hierarchy, and how deep should the hierarchy be?

Use of document libraries vs. metadata tagging

SharePoint supports the creation of a hierarchy of nested folders within libraries within sites. So, it may be tempting to start of creating such a “taxonomy” of categories for content, especially if migrating content over from a shared drive where such folders had been used. However, tagged metadata has many advantages over categories of folders for finding and retrieving content.

A content item may be tagged with more than one term from the same Term Set if it deals with more than one topic or if it falls into more than one category type, whereas putting a document in more than one folder can lead to version control issues. (It’s true that you can put a document in one folder and a link to it from within another folder, but this is not easily remembered to do, nor does it look as “clean.”)

You can create Term Sets (as facets) each for multiple ways to categorize, such as by document type, function, audience, topic, etc., serving as facets, and then tag a content item with terms from each, whereas folders don’t deal well with mixed methods of categorization, and you are forced to choose one method of categorization.

Tagged metadata allows you to filter a large set of content in place to quickly narrow results to what you want, whereas folders require clicking down through multiple paths, taking more time to find desired content, which is in different places.
Tagged metadata can also be implemented as search refinement filters, also known as faceted search.
Tagged metadata terms can have synonyms, helping users find what they want by different names, whereas folder names cannot have synonyms.

Thus, what had been labels for folders on a shared drive should most likely be changed to terms in taxonomy Term Set. Whether you should have any document libraries, or just a few without subfolders, depends on the preferences of your users, but I don’t recommend the creation of subfolders.

Filtering on columns vs. refining searches

The same Term Sets may be used for both column filters and search refinements. But typically, the implementation of managed metadata in SharePoint is either primarily for one or the other purpose, and the other use may not even be set up. Generally, if the managed metadata is going to be applied to documents within a single library or one site, on the order of tens or hundreds, then column filters are desired; if the managed metadata is going to be applied across multiple sites on thousands of documents, then search refinements would be used.

If unsure whether to promote filtering on columns or refinements on search, consider that filtering columns will always get more accurate results, but metadata has to be consistently applied. Out-of-the-box search in SharePoint will retrieve documents with the word or phrase anywhere in the document. The idea behind this is to get search set that is larger than needed and not miss anything, and then the user can refine the search result with the various refinements. So, the results of search are not as accurate, but there will be results, even if metadata tagging is incomplete.

Knowing how the Term Sets will be used can have an impact on the wording of terms and the extent of use of hierarchy. Both columns and refiners have limited width for term names to display. The user can easily adjust column widths to accommodate long names, but the refinement panel width cannot be widened by the user. The use of columns also makes it desirable to keep terms to a limited length within a given Term Set. Refiners indicate hierarchies of terms by default to the user who is searching content, whereas columns do not indicate any hierarchy in the default view.

Number of terms sets and metadata fields

There is no point in creating a Term Set if it’s not going to be displayed to the users for filtering or refining, and too many metadata fields take up horizontal or vertical space, are a burden to tag, and make the user experience of searching or filtering too complicated. So, you need to consider what would be truly useful, and not merely possibly nice to have. Just two Term Sets, such as Document Type and Topic, may be sufficient. In addition to the managed metadata that you create, there will be other metadata fields desired for filtering or refining, such as date, author, and format type, and perhaps uncontrolled keyword tags applied by users. In the case of columns, there will always be the document title taking up a column and considerable horizontal space as well.

The default columns in SharePoint are “Type” (file format) “Name” (filename), “Modified” (the date the file was uploaded or the last time any of its properties were updated, not the date the file itself was modified) and “Modified By” (the person who uploaded the file or last updated its properties, but not necessarily who actually modified the file). The default search refinements are the same, excluding title: “Result type” (file format), “Author” (an even worse misnomer for Modified by”), and “Modified date” (often displayed in a graph form). If you believe such information is not that valuable, you can remove these columns/refinements, especially when you plan to add other columns/refinements, which will take up horizontal or vertical space.

I would recommend no more than 4-7 total metadata fields, including those that are not based on managed metadata. You should avoid having more metadata as columns, along with the document titles, than can fully display horizontally, so as not to require horizontal scrolling. Search refinements, on the other hand, by default display sample high-use terms under each refiner, so typically no more than three refiners display in the left margin without vertical scrolling. Vertical scrolling is expected and acceptable to a limited degree.

Term Sets as flat lists or hierarchical taxonomies

SharePoint makes it easy to create hierarchies within Term Sets by simply right-clicking on a selected term and selecting “Create Term” from the context menu. Some people might thing that since the Term Store is for taxonomies, and taxonomies are hierarchical, hierarchies should be created if applicable. However, hierarchies are only helpful for navigating the taxonomy if the taxonomy is sufficiently large. If you set up multiple Term Sets, each used as a facet in combination with others, they each don’t need to be very large. Furthermore, the types of content most people store in SharePoint tends not to need extensively large and detailed taxonomies as might be needed in a content management system or digital asset management system

My rule of thumb is up to 12 terms should be on the same level before considering creating any hierarchy, but it could go up to 20 or so, and even more if the list of named entities/proper nouns. Also, if you do have hierarchies, consider keeping them relatively shallow, such as to only two levels, instead of three. Even if a hierarchy is technically correct, it does not mean you have to set it up that way.

If you need only a short flat list of terms, you might consider not using the Term Store at all, but rather create the list as "Choice" type of column. This is easier to implement, but the terms would limited in their use to filtering and sorting the column, and could not also be applied to search and navigation.

Monday, July 17, 2017

Metadata and Taxonomies

Metadata and taxonomies are related. In The Accidental Taxonomist, 2^nd edition (pp. 15-18), I explain that most, but not all, taxonomies (not purely navigational taxonomies) serve to populate terms/values in metadata fields/elements; and some, but definitely not all, metadata fields are populated by terms/values from controlled vocabularies or, more specifically, taxonomies (in contrast to free text or key words).

The question remains whether to start with creating the overall metadata strategy and schema and then build taxonomies as part of it as needed, or to start with creating a taxonomy and then, in the process, identify the various descriptive metadata. Ideally the two are developed for implementation combination, as part of an integrated strategy. However, an expert in taxonomy development (a taxonomist) and an expert in metadata design (a metadata architect) are usually not the same person.

A metadata architect can become an accidental taxonomist, and a taxonomist can become an accidental metadata architect, or the two experts can work together on the same project, although it is not so common for an organization to have both such experts on staff. Whether an organization has a metadata architect or taxonomist depends on the nature of the organization’s content and content organization needs.

Organizations that start with the metadata expertise and approach to information management tend to be those with significant needs in digital asset management (with image or other media collections), records management (in highly regulated industries), publishing, or cultural preservation (museums or libraries). Organizations that start with the taxonomy expertise and approach include product or service providers, distributors and retailers (especially in ecommerce), and organizations focused on providing information resources.

A hierarchical taxonomy can be integrated with metadata, when one of the metadata fields is for “Topic” or “Subject,” and there is a hierarchical taxonomy of subject terms associated with that field. However, it is the faceted type of taxonomy in particular that unites the tasks of taxonomy creation and metadata design.

Faceted Taxonomies and Metadata

A faceted taxonomy comprises a set of facets, each an individual controlled vocabulary, whose terms are generally not linked/related to terms in the other controlled vocabulary facets, but the combination of terms from a combination of facets are used to tag the same set of content, and users search/filter on terms in combination from various facets. Examples of facets may be Product/Service, Market Segment, Location, Document Type, Supplier, etc. A faceted taxonomy is a common type for both enterprise taxonomies and ecommerce or product review taxonomies, and it’s a type of taxonomy that taxonomists are familiar with creating. It’s called a “taxonomy” even though it differs from the classical hierarchical “tree” type of taxonomy, because it involves controlled vocabulary and classification. The name for each facet and the terms within the facet constitutes a simply two-level hierarchy.

Each facet is also a metadata field/element. The taxonomist designing a faceted taxonomy is thus also designing metadata, at least some of it. There are usually more metadata fields to describe the content beyond those which comprise the taxonomy facets. For a faceted taxonomy to best serve the user who is trying to find/discover content based on what it is and what it is about, the number of facets should be limited. (See my earlier post "How Many Facets.") Metadata, however, can serve additional purposes beyond helping users find content. Metadata may describe content for purposes of full identification, source citation, or information on how the content can be used, including rights data. The taxonomist or metadata architect needs to decide which metadata fields will constitute a displayed faceted taxonomy for the end-user to utilize in search/discovery, and which metadata fields will not but will rather display on a selected content record.

On the other hand, there may be additional metadata fields beyond the scope and definition of “taxonomy” that are nevertheless made available to the end-user to filter/refine results alongside the other, taxonomy facets. These could be for author/creator, date, title keyword, text keyword, file format, etc. Sometimes the distinction between taxonomy facet and other metadata in this case is not so clear, such as for Document/Content Type, Audience, or Language, when these fields utilize controlled vocabularies. Due to this overlap and blurred distinction between taxonomy facets and displayed metadata for filtering, it is a good idea to design the taxonomy and metadata together as an integrated strategy.

Sunday, June 18, 2017

Standards for Taxonomies

Since “taxonomies” are rather loosely defined, standards specifically for taxonomies do not exist, but there are standards that are relevant to taxonomies. A taxonomy is a kind of controlled vocabulary, and there are standards for controlled vocabularies. There are also standards specifically for thesauri, a kind of controlled vocabulary with which taxonomies typically share many features.

Standards serve various purposes. Two leading purposes for standards are:

To ensure consistency and ease of use across different products or systems used by different users.
To ensure interoperability, the sharing or exchange of products/services/information.

Standards for Consistency

Standards aimed at ensuring consistency and ease of use would include buttons on devices, menus in user interfaces, pedals in cars. With such standards, users can expect the same experience from manufacturers or service providers and thus they are able to easily use products or systems from different manufacturers/providers/vendors. In the case of information systems, this kind of standard includes those for the design and style of book indexes and thesauri. These “standards” tend to be guidelines, recommendations, or accepted conventions, and not exactly strict standards, even if issued from a standards body. For thesauri, the “standard” is issued by the NationalInformation Standards Institute (NISO), but it is called a "guideline”: ANSI/NISOZ.39.19 Guidelines for the Construction of Monolingual Controlled Vocabularies. The corresponding ISO standard is ISO 25964 Part 1: Thesauri for Information Retrieval.

These guidelines cover style and form of terms, circumstances for creating the various kinds of relationships between terms, use of notes on terms, etc. They are all about how to create well-formed thesauri with consistent design features that are then easy and intuitive to use. For example, when a user sees that two terms are in a hierarchical relationship, the user understands that the narrower term is a kind of, instance of, or integral part of the broader term, and not merely an aspect of or some other related concept of the broader term. In fact, the end-user of a thesaurus does not even need to know and understand thesaurus principles to be able to make use of a thesaurus to find desired concepts and content.

Standards for Interoperability

The other kind of standards, those aimed at ensuring interoperability, would include standards for size and units of measure, data exchange, and communications protocols. Interoperability standards are important for those controlled vocabularies which are intended to be shared or reused. Thus, the content to which controlled vocabularies link can be accessed by third parties or made publicly accessible over the Web. Controlled vocabularies may be “reused”, if the original creator of a controlled vocabulary decides to license the vocabulary (without linked content) to other publishers to use on their own content, so that the second publisher does not have to reinvent a controlled vocabulary that already exists in same subject area.

Interoperability standards for controlled vocabularies include ZThes (a thesaurus schema for XML, which is has since gone out of style), World Wide Web Consortium (W3C) specifications for the Semantic Web including SKOS (Simple KnowledgeOrganization System) and the Web Ontology Language (OWL) for ontologies, and ISO 25964 Part 2: Interoperability with other vocabularies. Indeed, ISO 25964 covers consistency standards in its first part and interoperability standards in its second part.

Metadata Schema

Since taxonomies or other controlled vocabularies may be used to provide terms that fill a certain metadata element/property/field within a larger set of metadata, the use of a standard metadata schema or model is yet another way in which interoperable standards involve taxonomies. If structured content is to be shared or exchanged, the metadata fields need to be standardized with the same names, abbreviations, and purposes.

Examples of standard metadata schema include MARC for library materials, Dublin Core (DCMI) for generic online networked resources, IPTC (International Press Telecommunications Council) for photographs and other media, DDI (Data Documentation Initiative) for describing data from the social sciences, and PREMIS (Preservation Metadata: Implementation Strategies) for repositories of digital objects. Adopting such a metadata schema would be another way to enable sharing of content tagged with the metadata.

I was pleased to have the opportunity to learn more about information and publishing standards recently at the Society for Scholarly Publishing conference in attending pre-meeting seminar “All About Standards.”