The Accidental Taxonomist

Wednesday, May 4, 2016

Taxonomy Design for Content Management Systems

A very common implementation for taxonomies is in content management systems (CMS). The content managed in this kind of software can be diverse: office application files, PDF documents, image files, audio files, video files, and, in the case of web content management systems, also HTML and any kind of file to be published to the Web. The “management” this kind of software supports is also diverse: enhancing, annotating, tagging, categorizing, reviewing, approving, sharing, assigning, publishing, archiving, and deprecating of content. Finally, the users can be diverse: content creators, content managers, and anyone in an organization who needs access to a subset of the content.

Due to the diversity of content types and purposes, the metadata associated with each content item obviously plays a very important role in a CMS. As for taxonomies, in the context of a CMS, it is probably best to consider a taxonomy as a subset of metadata, although the distinction between taxonomy and metadata can get blurred. Metadata about content can be descriptive, structural, or administrative. Descriptive metadata comprises the attributes that help make the content item retrievable or findable, including title, author, source, date, audience, document type, and also metadata for what the content is about (abstract, keywords, subjects, etc.) Many of these metadata fields should be populated with terms that are on controlled vocabulary lists for each field. In some cases, such as the “subject,” the controlled vocabulary may be rather large and thus organized into a hierarchy, and thus constitutes a hierarchical taxonomy of subjects. In other cases, various aspects of what content is about might be categorized into different metadata fields with controlled vocabularies, such as: industry, process, specialty, department, location, etc. As a result, a set of controlled vocabularies for each field, could be considered as a faceted taxonomy, with each of these descriptive metadata field functioning as a facet.

With this mind, the task of actually defining the descriptive metadata fields or taxonomy facets need to involve various stakeholders, including both users and other experts and managers. Users include the various people who upload content and will tag the content with metadata and taxonomy terms, and the various end-users who will browse and search for the content using the metadata and taxonomy. Other stakeholders to involve from the beginning may include content managers, metadata architects, content strategists, business analysts, and user experience designers.

A CMS tends to offer two methods of classification: folders and tags. Folders (which in a CMS tend to be “virtual” folders, not actual file directory paths) offer an intuitive user interface for users to put content into categories and then find the content. Tags, on the other hand, are appropriate for assigning all kinds of metadata. Typically, if a dominant means categorizing is identified through conversations with users, such as content type or subject category, this categorization scheme can be used for the folders, and then all other means of categorization and classification can be handled with the tags.

Recently a colleague asked me which method I thought was best for associating subject disciplines with multimedia content stored in a repository where the system offered both options: put them into folders named for each discipline or assign metadata tags for the disciplines. The answer, of course, is “it depends.” It depends on:

Workflow: Will the files always stay in this repository or will they “travel” downstream to other applications? If the content will likely move to other systems, then tags are preferred.
Taxonomy size: Is the taxonomy under consideration for folders large? A large set of folders may be cumbersome to browse through and more suitable for type-ahead lookup in a metadata field lookup table or search box.
User preference: Do users who upload prefer to use folders or tags only? Do users who need to retrieve the content prefer to browse through folders or only search on tags?
Categorization enforcement: Can you enforce users to assign descriptive tags? If you are concerned that they will not, folders will better enforce the use of the categories.
Support for hierarchy: Will the system support a hierarchy of categories within the lookup controlled vocabulary lists for the tag fields, or are hierarchies only supported as folders, or neither? Then consider which fields would benefit most from a hierarchy.
Support for synonyms: Do the lookup controlled vocabulary lists for the tag fields include support for synonyms/alternate labels. If so, and if the controlled vocabulary is large, then tags have the advantage over folders, which cannot have synonym labels.

After determining what part of the categorization system, if any, goes into folders, and what goes into tags, the next task is to figure out how many descriptive metadata tag fields to create. Issues include:

What metadata can be assigned automatically and what must be done manually? If it can be assigned automatically (such as file format type or language by auto-detect software or maybe even subject category by use of auto-categorization software), that’s great, but manually assigned metadata should be limited so as not to make the task burdensome.
What fields are users likely to search on in retrieval? You need to cover the basics, but there is no need for additional fields that users are not likely to use as lookup criteria.
What method of classification is important to the users? “Subjects” is a catch-all field, but if users are always thinking of something else too, such Discipline or Product, then these should be pulled out into separate fields or facets.

Finally, when designing taxonomy and metadata for a CMS, the taxonomist should have use of a test data instance of the system to try out the implementation of the taxonomy in the CMS user interface. A taxonomy that looks good offline (in Excel or a taxonomy management system), might appear awkward within the limitations of a CMS’s user interface.

Sunday, April 17, 2016

"The Accidental Taxonomist," 2nd edition

Recently I was asked what I added to the newly published 2nd edition of my book, The Accidental Taxonomist. The additions and changes are summarized in the book's preface, so I have decided to post the entire preface here, which follows:

When I published the first edition of The Accidental Taxonomist, I knew that changes would be needed within a couple of years, mostly to reflect the changes in thesaurus management software vendors, as software is a volatile industry characterized by new companies, acquisitions, and some vendors going out of business. It was also expected that the website examples, given as screenshots in the book, would change. As it turned out, the changes were more widespread than anticipated. I ended up replacing all screenshots and adding some new ones (totaling 44), since even existing software vendors or websites had updated their user interfaces. More than half of the various website URLs found throughout the book also had to be updated.

In the area of software, what I did not anticipate was that software changes have gone beyond just who the vendors are and what features vendors have added. There have also been some notable trends, such as in the adoption of Semantic Web standards, the convergence of taxonomy and ontology support, and more web-based, cloud/software-as-a-service offerings. Thus, in addition to adding more software vendors (and removing a few), I have also added a short section summarizing all of these software trends.

Also with respect to software, the first edition made no mention of SharePoint, since SharePoint 2010, the first version to support taxonomies, came out the same year my book did. So this new edition includes some discussion of managing taxonomies in SharePoint. There is not the space here to go into all the details, so I explore specific topics, such as managing polyhierarchy in SharePoint, on my blog, also called The Accidental Taxonomist.

The standards have changed too. ANSI/NISO Z39.19 2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies was reaffirmed in 2010, but more significantly ISO 2788 Guidelines for the Establishment and Development of Monolingual Thesauri and 5964 Guidelines for the Establishment and Development of Multilingual Thesauri have been replaced by ISO 25964 Thesauri and Interoperability with Other Vocabularies, Part 1 in 2011 and Part 2 in 2013. This is not merely a reorganization of parts. The changes also comprise new content in the area of interoperability, including the exchange of taxonomy data and mappings between vocabularies. Now ANSI/NISO Z39.19 is coming due for a new version, but it is a long process. With an eye to a wider international audience, in this edition I cite the ISO standard along with the ANSI/NISO standard whenever relevant.

In addition to the change in the ISO thesaurus standard, there is also a change involving the wider adoption of other kinds of standards, most significantly those associated with the Semantic Web. Although development had begun earlier, the World Wide Web Consortium (W3C) formally released the SKOS (Simple Knowledge Organization System) standard only in August 2009, when I was busy finalizing my manuscript for the first edition, before the extent of the eventual adoption of SKOS was known. Now it is quite common for taxonomy management software to follow the SKOS specifications of concept modeling and taxonomy output. So, more attention to SKOS is given in this edition.

Another trend, which was already underway at the time I wrote my first edition, but which I simply did not bother to consider in detail, is the convergence of metadata and taxonomy. So, I have added a short section on the topic. I needed the intervening years to actually work in areas where taxonomies and metadata meet, whether through consulting or in a department called Metadata Standards and Services, before I felt I could say something original on the subject.

As for the people who do taxonomy work, the accidental taxonomists, I conducted a new survey, which has shown that their backgrounds remain as diverse as they were when surveyed six years prior, but there are new stories and examples of how people got involved in this type of work and what they like about it. Meanwhile, the opportunities for taxonomists continue to grow. I executed the exact same search for jobs in fall of 2009 and again in fall of 2015, on the job board aggregator Indeed.com, and found the numbers of currently posted openings had significantly increased.

Although I considered myself quite experienced with various taxonomies at the time I wrote the first edition, I have continued to gain additional taxonomy work experience since, so here and there throughout the book I have added information based on further reflection. Thus, in the chapter on planning and designing a taxonomy, I have added some advice regarding designating facets for enterprise taxonomies, questions to ask during stakeholder interviews, how to conduct stakeholder workshops, and methods of testing taxonomies.

I had also started writing my blog the year after the first edition, but the blog post topics are not the same as the additions to this book. The Accidental Taxonomist blog allows me to explore tangents in more detail, and this book is already longer than needs to be!

Taxonomies are interesting in that some things about them are fundamental and do not change, such as the notion of a concept, its varied names, its hierarchical and nonhierarchical relationships with other concepts. But, as anything related to information technology, there are things about taxonomies that do change, such as how they are managed, implemented, and utilized. Thus, it is not only the varied subject matter that makes taxonomy work interesting, but also the various implementations and opportunities to take advantage of in new technologies, such as those related to the Semantic Web and Linked Open Data. Although this new edition addresses these topics, my ongoing blog will cover further considerations in such areas.

Friday, March 25, 2016

Taxonomy Books

I am pleased to announce the 2^nd edition of The Accidental Taxonomist. The print edition is available to order from the publisher, Information Today Inc., now, and will be available from various online retailers by early June. Ebook versions will follow. So, this is a good time to survey other books on taxonomy creation.

When I wrote the first edition in 2009, I had looked into other books about taxonomies that were published at the time. Following is what I had written in my proposal to the publisher regarding the “competition,” including my comments at the time and how my book would fill a gap. (The only change I have made below is updating the prices, which are the list prices, but lower prices can usually be found through online retailers.)

Lambe, Patrick. Organising Knowledge: Taxonomies, Knowledge and Organisational Effectiveness. Oxford, England: Chandos Publishing. (2007)
This book is well-reviewed, but published in the UK and somewhat expensive. It takes a somewhat broader approach, looking at knowledge management and not just taxonomies. It is aimed more at the business professional, manager, consultant, rather than the practicing taxonomist. It can be a bit overwhelming to the new MLIS grad or the indexer curious about getting into taxonomy construction. [$70.00]

Stewart, Darin L. Building Enterprise Taxonomies. Portland, Oregon: Mokita Press. (2008)
This book is self-published and not well marketed. It was created to as a book for an online course taught by the author at the University of Oregon Applied Information Management Master’s degree program. Its reviews are generally good. The book is focused on enterprise taxonomies only, though. Its index is horrible. [$39.99]

Jagerman, Evert J. Creating, Maintaining and Applying Quality Taxonomies. Zoetermeer, Netherlands: E.J. Jagerman. (2006)
This book is self-published (Lulu.com) and not well marketed, published in the Netherlands, and only 152 pages. I have not found any reviews of it. [$43.62]

King, Brandy E. and Kathy Reinold. Finding the Concept, Not Just the Word: A Librarian's Guide to Ontologies and Semantics. Oxford, England: Chandos Publishing. (2008)
Like Lambe’s book Organising Knowledge, this book is also published by Chandos Publishing in the UK and is rather expensive. It is focused on only ontologies and not other kinds of taxonomies, and its audience is research librarians. The inclusion of four case studies is interesting, though. [$61.81]

Aitchison, J., A. Gilchrist, and D. Bawden. Thesaurus Construction and Use: A Practical Manual (4th ed.). Chicago: Fitzroy Dearborn. (2000)
This book is limited to traditional information retrieval thesauri, is somewhat out of date (based on a first edition published in 1972), published in the UK and rather expensive. [$125.95]

Broughton, Vanda. Essential Thesaurus Construction. London: Facet Publishing. (2006)
This book is limited to traditional information retrieval thesauri, scholarly and not well marketed, published in the UK and rather expensive [$85.00]

Bailey, Kenneth D. Typologies and Taxonomies: An Introduction to Classification Techniques (Quantitative Applications in the Social Sciences series) Thousand Oaks, California: Sage Publications. (1994)
This is a short monograph of under 100 pages and focuses on cluster analysis (whatever that is) written by a professor of sociology with a focus on research methods. It is mathematically too technical for most readers. [$15.46]

Of the aforementioned books, the one that I would recommend, and I recommend highly is Patrick Lambe’s Organising Knowledge. The book is indeed worth its price, but is probably more suited for readers who are serious about taxonomies and not merely curious about them.

In the intervening six years a few more books about taxonomies or controlled vocabularies have been published, and I have looked at the following:

Harpring, Patricia. Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Los Angeles: Getty Publications. (2010)
This book is focuses on classification schemes, taxonomies, thesauri, or other controlled vocabularies for indexing information within a limited subject area and especially for museum works. It is an excellent book for that purpose, but less relevant for taxonomies or controlled vocabularies in general or for other purposes. [$50.00]
Abbas, June. Suctures for Organizing Knowledge: Exploring Taxonomies, Ontologies, and Other Schemas. New York: Neal-Schuman Publishers. (2010)
book gives a broad treatment of knowledge organization systems. While it does not provide detailed instructions on how to create a taxonomy or another type of system, the comprehensive and thorough coverage puts taxonomies, controlled vocabularies, and other classification schemes into perspective and context, and thus is very informative book. [$93.00]

Hlava, Marjorie H.K. The Taxobook. (series) Morgan & Claypool Publishers. (2015)
This book is published as a series of three separate volumes: Part 1: History, Theories and Concepts of Knowledge Organization (54 pages); Part 2: Principles and Practices of Taxonomy Construction (117 pages); Part 3: Applications, Implementations, and Integration in Search (128 pages). The first volume has some interesting insights, but is not information that is needed for practical purposes. The second volume explains the basics of taxonomies and how to create them. The third volume presents more unique information on taxonomy implementation, but may be aimed at others than those who create taxonomies. [$150 for the series, or $50 for each volume]

For more in-depth information on some of these books, I have published book reviews in Key Words the journal of the American Society for Indexing, and have PDF copies on my website:

Lambe, Patrick. Organising Knowledge

Stewart. Building Enterprise Taxonomies

Harpring. Introduction to Controlled Vocabularies

Abbas. Structures for Organizing Knowledge

Hlava. The Taxobook

As for where The Accidental Taxonomist fits in, I would consider it the most practical and instructional book on how to create taxonomies and thesauri for all kinds of uses. It is clearly aimed at the practicing taxonomist, as the title implies, especially since the core of the book came out of an online course I had developed a year and half prior. It also has unique information about the field of taxonomy work. It’s also not expensive ($39.50 list price, with discounts periodically available). So, if you already own the first edition, you should consider buying a copy of the second edition as well, which has new information on taxonomy software, SharePoint, metadata, taxonomy testing, etc., a new taxonomist survey, and all new screenshot graphics. Thus, another benefit of The Accidental Taxonomy, 2nd edition, is that it's the most up-to-date book on the subject.

Monday, February 29, 2016

Free Taxonomy Management Software

There is always an interest in free taxonomy or thesaurus management software. Many people who create taxonomies try to save money on purchasing taxonomy management software by simply not using any taxonomy management software but something else they already have, such as Excel. Those who are developing either very large taxonomies or more complex thesauri, however, realize that a dedicated taxonomy/thesaurus management system will save a lot of time and headache in the long term.

Various free thesaurus management software offerings have been available since the early 1990s. They tend to have their origins in academic projects in computer science, information science, or library science at universities, and others have been government projects. Some free software of the previous decade is no longer available, though. Discontinued software is still listed for posterity on the web directory of "Software for building and editing thesauri," started by Leonard Will and now managed on the Taxobank website. For example, two free software products listed were for MS-DOS and one no later than Windows 3.1.

The first free thesaurus software I was familiar with was TheW, a simple thesaurus management software developed by Tim Craven a professor of information science at the University of Western Ontario, since retired. I actually ran across it, because I was at the time exploring another software program of Prof. Craven’s for creating website indexes. TheW32, which is available for Windows XP, Vista, and 8 and for Java, is no longer maintained. It was last updated for Windows in in 2007 and for Java in 2009. At this point, I would no longer recommend it.

Protégé Ontology Editor is an established free and open-source ontology editor from Stanford University. It is quite robust, has an active user community and support groups, and continues to be upgraded (with version 5.0.0 recently released in beta). The issue with Protégé is that it is a native ontology management tool, not a thesaurus management program (or even ontology “lite” as some thesaurus management software can manage semantic relationships and classes). Thus, it takes a very different approach to modeling and building vocabularies, which is not intuitive to taxonomists, such as myself, and, although I downloaded it, I never found it worth the difficulty to learn. If you can truly consider yourself an ontologist, though, then great, this might just be the solution for you.

I had explored some other free software offerings when writing my book, The Accidental Taxonomist, six years ago and came across TemTres and ThManager. At the time I did not find them adequately enforcing valid relationships between terms, so I was somewhat dismissive about the software. Recently I revisited these products.

TemaTres, which has its origins in the Library and the University of Buenos Aires, Argentina still does allow creating duplicate terms, which was my initial cause for concern, but since then the user interface of the latest version (2.1) offers a new configuration option for quality policies, to enable or disallow duplicate terms. Thus, TemaTres is a suitable free thesaurus software product if used by a knowledgeable and experienced taxonomist who knows to set the options and understands the alerts. TemaTres is being supported, and its latest version was just this winter, 2016. The software is web-based, which means that it requires a PHP, MySQL, and HTTP web server, so it may not be the configuration that any independent taxonomist would set up and install in a small/home office. Otherwise, TemaTres is worth looking into.

ThManager is from the University of Zaragoza and GeoSpatiumLab S.L., both in Zaragoza, Spain. ThManager supports the SKOS standard rather than ANSI/NISO Z39.19 or ISO 25964, which means it does not by default enforce all rules of the latter standards. But I have since found this to be a trend of new vocabulary management software: compliance with SKOS and support for ANSI/NISO Z39.19 or ISO 25964, as configurable rather than by default. Thus, I am no longer complaining if it does not support ANSI/NISO Z39.19 by default. The main problem with ThManager, though, is that it is not kept so well up to date. It was last significantly updated in 2006. The installation for even Windows 7 requires a “portable” version due to an installation bug.

More recently I discovered another free thesaurus management software, VocBench. It was developed originally for the management the AGROVOC thesaurus of the Food and Agriculture Organization (FAO) of the United Nations as a joint project of FAO, which is based in Rome, Italy, and the Artificial Intelligence Research group at the University of Rome Tor Vergata. VocBench, like TemaTres, is SKOS-compliant, rather than ANSI/NISO Z39.19 compliant. VocBench is web based, with web server requirements of Apache Tomcat, MySQL, and OWLIM installed on a Sesame2 server.

In addition to being free, these applications tend to have the advantage of being able to run on multiple platforms and yet can be installed and used by single user. The editing features may be a little less standard and thus less intuitive, and documentation and support tends to be less than commercial software. Yet, they are worth considering for long-term experimentation (with no time limit as in commercial demo software), for use in nonprofit or low-budget projects, or by anyone with a strong interest in working with open source software.

Saturday, January 30, 2016

Polyhierarchy in the SharePoint Term Store

Last year I had the opportunity to create some taxonomy in the SharePoint Term Store (also called Managed Metadata), and while I am pleased that hierarchical taxonomies are supported in this widely used platform, I had some concerns about the support of polyhierarchy, as information about this capability is inconsistent. So I experimented further.

Polyhierarchy means a taxonomy term has more than one broader term or parent term. In a traditional hierarchical taxonomy structure, a term has one broader term (unless it is the top term, in which case it has no broader term) and multiple narrower terms. Occasionally, though, the logic of the hierarchy and the practical need to guide users down different possible paths, makes it beneficial to give a term two or more broader terms. It may appear to the user that the term is duplicated in different locations in the taxonomy, but this duplication is in appearances only, because it is the same term and thus linked/indexed to the same content, no matter which broader term path the user clicked down through.

An example would be the term Financial report, which is shown in Figure 1 screenshot from the SharePoint Term Store.

Fig. 1 Financial report as a narrower to the term Financial documents.

It would be practical to have a broader term of Financial documents and another broader term of Reports. Some users will look for the term under Financial documents, and other users will look for it under Reports.

The SharePoint 2010 or 2013 Term Store claims to support the creation of polyhierarchy, but it has significant limitations.

Polyhierarchy permitted only across different hierarchies

The support of polyhierarchy in the SharePoint Term Store, takes the notion of “polyhierarchy” too literally by insisting that the two broader terms of a term in a polyhierarchy actually belong to different hierarchies. This means that the polyhierarchy can only be created across different Term Sets in SharePoint. A Term Set is a hierarchy or a facet with a single top term. It is prohibited to create a polyhierarchy within the same Term Set. This is quite problematic, because I find that the vast majority of the time that I want to create a polyhierarhcy it is within the same top-level hierarchy for facet.

In the example of Financial report, it is logical to have two broader terms of Financial documents and Reports. Both of these broader terms, however, are within the same Term Set or facet, which I might call Document type, so the SharePoint Term Store will not permit this polyhierarchy. Having the term Financial documents appear under a second broader term within any other Term Set or facet, on the other hand, such as the Department or Location facet, is permitted by SharePoint, but this would not be a correct hierarchical structure by taxonomy standards.

Only one method to create polyhierachy

In the SharePoint Term Store, you cannot create a broader term relationship; you can create only narrower term relationships. Thus, you can only create hierarchies from the top down. The normal way to create a polyhierarchy, however, is to add a second broader term relationship, but this is not possible in SharePoint. Instead, the same term has to be made as a narrower term to a second term.

So, if you have the term Financial report as narrower to Financial documents, and you want to make Reports also a broader term (and Reports exists in another Term Set), you would go to the second term that will be the new broader term (Reports), click on Create Term, and type in the name of an existing term (Financial report). SharePoint, however, does not enforce taxonomy standards and permits you to create a new term with the same name as another term (Financial report), but it will not be the same term. You can see at the bottom of the General information pane, that the duplicate Financial report term’s unique identifier is different from the original Financial reports term., as shown in Figure 2.

Fig. 2 General Information for a selected term

This matters, because terms are used for indexing/tagging. The term with one ID in one location may be indexed to some of the content, and the term with the other ID in the other location will be indexed to other content, and neither term will be indexed to all the content. This would be bad for retrieval. So, this method should not be used to create polyhierarchy.

To create polyhierarchy in SharePoint, go to a second term that is intended to be the additional broader term (Reports), click on Create Term and type in the name of an existing term (Financial report). You will see at the bottom of the screen “Suggestions” with the start of the suggested matching, with yellow highlighted type-ahead matching, to existing terms in another Term Set or even another taxonomy group. If you select one of these suggested terms, then you will indeed be creating a polyhierarchy. After doing so, you will notice that the tag icon preceding the term becomes the “reused tag” icon, as shown in Figure 3, in both locations, under the new broader term and under the existing broader term. You will also notice that when you select the term and view its General details that the data in the box under Member Of shows that the term is a member of both hierarchies.

Fig. 3 Reused tag example for the term Marketing

Importing a taxonomy with polyhierarchy

If you import an externally created taxonomy in CSV format as a Term Set via the Term Store’s import feature and that taxonomy has polyhierarchy, the Term Store will not recognize the polyhierarchy, but rather will treat the polyhierarhcy terms as distinct terms with duplicate names, assigning them unique IDs. Thus, they could be used inconsistently in indexing/tagging. Therefore, you should ensure that imported CSV taxonomies should not have any polyhierarchy.

If you import a taxonomy created in an external taxonomy/thesaurus/ontology management system which permits polyhierarchy, and that software has a feature or connector to import to SharePoint Term Store, there are different methods of dealing with the polyhierarchy issue. The default of some software, such as Semaphore Ontology Editor and TopBraid Enterprise Vocabulary Net, is to retain only one of the pair of broader term relationships upon export. For example, in Semaphore, the first hierarchical relationship encountered for a term is retained and any other are not, but the user gets an alert. Wordmap also provides a validation error if there is a polyhierarchy for import into the same Term Set. Rather than maintaining a random one of more than one broader term relationship, Synaptica strips out all broader term relationships if there are more than one, and then the former polyhierarchy terms show up on the orphan term list for review. In some software, such as TopBraid EVN, the user can define quality/validation rules that would identify polyhierarchy, so the user can remove any before importing into SharePoint. Other software vendors, such as Data Harmony and PoolParty, say they have work-arounds for the SharePoint import to sort of support polyhierarchy, but I have not tested these.

In conclusion, the Term Store’s support of polyhierarchy only across Term Sets (hierarchies or facets) is not very useful, since the majority of time that we would want to create a polyhierarchy, it is within the same Term Set, especially if the Term Set is to be used as a facet. A term with the same name in more than one facet typically would have a slightly different meaning and usage.