Tuesday, October 6, 2015

Taxonomies and Tables of Contents

A table of contents and a hierarchical taxonomy appear to be quite similar. In my last blog post I looked at taxonomies and indexes, and in the end concluded: “A taxonomy serves a purpose that is both, or something in-between, that of a table of contents and a back-of-the-book index. It’s for searching (like in an index) and also for navigating (like in a table of contents), but it points to the subsection level (as in a detailed table of contents), not to a page (as in an index).” Taxonomies, especially the thesaurus kind, have many similarities to indexes when it comes to looking up a topic. Taxonomies, especially the hierarchical kind, are also similar to a table of contents or the navigation aid to a set of content.

Despite the apparent similarities in hierarchical structure and the the purpose of supporting browse navigation, the differences between a table of contents and a hierarchical taxonomy, however, are far greater than the differences between a displayed index and a search-supporting thesaurus.

A table of contents provides navigation, whether for a printed book or large document or for an electronic document or collection. In fact, in a MS Word document with headings, a table of contents that is generated in the left margin pane from those headings is called “Navigation.” Labels in a table of contents or navigation system are arranged like a taxonomy but are not exactly a kind of taxonomy.

Navigation is not a taxonomy

 

Navigation or a table of contents has to perfectly reflect the content that it belongs to. It is completely customized. Two books on the same subject cannot have the same table of contents.  The same taxonomy, however, may be used for more than one content source and typically is. In a table of contents or navigation, each navigation entry, menu label, or heading matches one-to-one to a single, specific section or web page.  Terms in a taxonomy are intended to be used more than once, so each term in a taxonomy is linked to multiple documents or content items.  As such, taxonomy terms need to be somewhat generic, whereas labels or headings in a table of contents or navigation can be specific. Taxonomy terms also need to be created with the anticipation of serving not only current content but also future content, whereas navigation or table of contents entries need only reflect the current content.

Different label wording 

In addition to being more generic, taxonomy terms differ from table of contents entries or navigation labels in other ways.

  • The names of chapters and headings may be longer descriptions (such as “Procedures to Enhance the Accuracy and Integrity of Information Furnished”), whereas taxonomy terms should be concise to aid skimming. A complex topic with a complex heading, can be covered with a combination of taxonomy terms instead of a single complex term, because taxonomy terms do not need to match all content one-to-one (such as the combination of terms: Information accuracy, Information integrity, and Information-gathering procedures).
  • The names of chapters and headings might be question phrases (such as “Why study statistics?”), whereas taxonomy terms should be nouns or adjective-noun phrases and start off with a “keyword” likely to be looked up (not “Why”) to support alphabetical lookup options. Even in a hierarchical taxonomy display, a list of terms at the same hierarchical level tend to be arranged alphabetically.
  • Table of contents entries may be context-specific based on the parent/broader level (such as “Identification and General Terms” or “Special Concerns”), and, in fact, the same sub-heading could repeat under different broader headings. In a taxonomy, each term should be independently unambiguous.
  • Table of contents often start off naming introductory information (such as “Introduction to Identity Theft”) or have sections for Conclusions, neither of which should be terms in a taxonomy. If the same topic is covered three times, in an introduction, body, and conclusions, it will be indexed with the same single taxonomy term, and the end-user will retrieve all indexed results on that topic grouped together.
  • Table of contents or navigation headings can be like titles, which may be “catchy” or enticing to the reader, especially at the top level. Taxonomy terms, by contrast, are clear, concise, and common (based on what most users would call the concept), and not especially creative.

Different structure

 

Tables of contents and taxonomies also differ in their structure. Tables of contents or navigation schemes reflect the organization of content, which may be chronological, pedagogical, from fundamental to detailed, from most important to least important, or the order of perceived user interest. In a taxonomy, the terms at each hierarchical level are arranged alphabetically by default. In a navigation there are no “related terms”, so what appear as subtopics might not be taxonomical narrower terms, but just related terms. Taxonomies, on the other hand, must follow the ANSI/NISO Z39.19 guidelines or ISO 25964 with respect to structuring hierarchical relationships: narrower terms bust be specific types, instances, or integral parts of their broader terms.  By having this standard format, a taxonomy provides organizational predictability for all kinds of users and all kinds of content.

There are certain editorial conventions for content, such as having units of a roughly standard length, which then impact the table of contents or navigation. While there are some variations, one chapter or section is typically not twice as long as another. To achieve balance, a large topic may be spread out over two or more sections, whereas several small topics are grouped together under a heading that is a serial list (such as “Poverty, Inequality, and Mobility”), or under “Other.” Thus, a table of contents topics are based on the amount of material presented. Taxonomy structure, on the other hand, looks at the terms/concepts only, and does not take into consideration the amount of content per term. There is once concept per term, not a list. Rare occurrences of two concepts combined into a single term, such as “Author voice and tone,” are the consequence of two topics being very closely related with overlapping meaning and usage.

Conclusions


While a table of contents or navigation system is not a taxonomy, nor should it be used as a taxonomy, when a legacy print source is converted to units of digital content, a table of contents is still an excellent source for creating a taxonomy.




Monday, August 31, 2015

Taxonomies and Indexes



Taxonomies and indexes are similar in that they both help guide people to find desired information on a selected topic. While they could be searched, they are designed specifically to be browsed. The obvious difference is that taxonomies for end-users are arranged hierarchically (or by facets), and indexes are arranged alphabetically. I have blogged previously on a comparison of index creation and taxonomy/thesaurus creation, but for those who are not already skilled at creating one or the other, let’s step back and further compare taxonomies and indexes themselves.

Taxonomy and Index Similarities and Differences


Taxonomies and indexes were developed for different kinds of media. Modern taxonomies are designed to function well in online implementations (through clicking on hyperlinks to narrower topics or plus signs to expand hierarchical trees), although taxonomies have existed in print as well. Indexes, specifically the back-of-the-book style, are designed to function well in print (through scanning a large number of entries and subentries on a page), although displayed indexes occasionally exist online as site A-Z indexes on small, static websites. Hyperlinked indexes at the end of ebooks are also possible, but the inadequate application of ebook standards have hindered such indexes from becoming commonplace.

Taxonomies and indexes serve different kinds of content. Taxonomies work well for content in a subject area that is easy or logical to categorize: products or product types, industries, geographic areas, occupational areas, media or document types, etc. Indexes work will for content on a subject area that is more abstract and does not lend itself to hierarchical categories: management concepts, history, news, etc. Indexes, since they are arranged alphabetically, are also excellent for browsing names/proper nouns. Taxonomies work well for a defined scope, such as collections of documents of the same type (all resumes, all marketing materials, all legal documents, etc.). Indexes, on the other hand, tend to serve better for content with a less defined scope, such as general encyclopedic information or detailed user manuals. Not surprisingly, book-like content continues to be best served by indexes.

The differences in structure are not as simple as taxonomies being hierarchical and indexes being alphabetical. Taxonomies also have alphabetical aspects, as terms at the same level of a hierarchy are typically (or by default) arranged alphabetically. Indexes, meanwhile, also have hierarchical aspects, as there are main entries with subentries under them. Some large indexes even have a third level of sub-subentries. Then there are kinds of taxonomies, called thesauri, which are structured more around terms and relationships than hierarchical trees, and such thesauri may be arranged alphabetically. In fact, the same thesaurus can be arranged both hierarchically or alphabetically, with the click of a toggle button in a thesaurus management system. But re-sorting a thesaurus alphabetically does not change it into an index. It will still lack the subentry features of an index.

The defining difference between a taxonomy and an index is that an index is not an index unless it is linked to content, as the word “index” means “to indicate” or “to point,” as in to point to content. A taxonomy is still a taxonomy whether or not it is linked to content. (But it is not really useful, unless it is linked to content.)

Where Taxonomies and Indexes Meet


In addition to back-of-the-book indexes, there also exist periodical article indexes, such as the green-bound printed volumes of the Reader’s Guide to Periodical Literature and subsequent online periodical and reference databases accessed through libraries (InfoTrac, ProQuest, EBSCOhost, etc.) What happens is that indexers index the articles with terms from the taxonomy (or thesaurus or controlled vocabulary). The result of the indexing, an alphabetical arrangement of taxonomy terms that were used in the indexing with their links to content, constitutes an index. So, the index comprises terms in the taxonomy that are linked to content and arranged alphabetically. Displayed browsable alphabetical indexes, however, have become less common in online services, as they have been replaced by features that search on the index terms instead.

The trend toward “multi-channel publishing” means that the same original content may appear in different formats and media, such as print and online. Online, however, may mean more than just a PDF or other ebook format of the printed version. Rather, digital text content gets chunked into units of the size or length that could be indexed as a whole with taxonomy terms, and images and new multimedia exist as separate assets that can also be indexed with taxonomy terms.  What this means is that a manual, user guide, or textbook that in print had a back-of-the-book index, in the digital or online medium consists of multiple files for each section or unit and for each media asset, which are indexed and thus retrieved by taxonomy terms instead of using the back-of-the-book index.

Index Entries for Taxonomy Terms?


I have worked on projects were printed content (books, manuals, etc.) were digitized and put into small chunks or files to be indexed with a taxonomy, and the original printed volume had a back-of-the-book index. So, the issue arose: to what extent should the legacy back-of-the-book index be utilized when developing the new digital retrieval taxonomy?  I had access to the index for candidate taxonomy terms and was encouraged to utilize it.

My conclusions have been that the back-of-the-book index serves a slightly different purpose for users than does an indexed taxonomy. A back-of-the-book index serves to locate the page where something was mentioned on a specific topic. Users of a reference work, however, may at other times consult the table of contents to navigate and find the relevant sections and sub-section. A taxonomy serves a purpose that is both, or something in-between, that of a table of contents and a back-of-the-book index. It’s for searching (like in an index) and also for navigating (like in a table of contents), but it points to the subsection level (as in a detailed table of contents), not to a page (as in an index). Also more content is expected to be linked to a taxonomy term (a section unit, and often multiple such units) than content indicated by an index entry (as little as one sentence). So, it would not be right to use all or most of the main entries of a back-of-the-book index to create a taxonomy for the same content.  



Wednesday, July 1, 2015

Taxonomies for Indexing Images

It’s becoming more common to index images with taxonomy terms, instead of just text documents or instead of just keyword-tagging of images. A taxonomy for the subject-indexing of images need not be significantly different than a taxonomy for indexing textual documents, but other metadata differs, and the indexing activity is also quite different.

A dedicated taxonomy for images might be needed for various reasons:
1.    There is no subject-indexing of text documents by an organization.
2.    Different software systems are used by the same organization to manage images and for managing text documents.
3.    Text documents of the same organization are large and thus indexed or cataloged at a broader level.

1.    No text indexing
Some organizations have a large image collection, and that is what they focus their indexing efforts on. They thus design or adapt a taxonomy specific to their image collection. They likely did not have any taxonomy for indexing text. They either don’t find the need for text document search and retrieval, or if they do, they will simply use the search engine instead, since, after all, search engines can search on text, unlike images.

2.    Different systems
Large image collections are increasingly managed in dedicated digital asset management systems, which are designed to support the various metadata associated with images and other nontext media files. Text documents, on the other hand, may be managed in document management systems, record management systems, or collaboration systems such as SharePoint. Each of these kinds of system support some form of controlled vocabulary for tagging content. But if the images are in one system and the text documents are in another system, different controlled vocabularies are likely to be developed. Of course, a generic “content management system” may be used for both images and text documents, but many organizations don’t manage all their content in a single system.

3.    Different levels of indexing detail
The classic example of different levels of detail is for materials at Library of Congress, which had developed Subject Headings for descriptive cataloging for library materials, which are generally monographs, such as books, or video-recordings of films, or sound recordings of music collections. While the subjects of these works might be quite specific, they are often not as specific as an individual graphic material. (An entire book may have numerous specific images.) But over the years, individual images also became part of its collection, and the LC Subject Headings were not specific enough, so the Library of Congress development the Thesaurus for Graphic Materials, which is freely available. The fact that the Thesaurus for Graphic Materials exists does not mean that a dedicated thesaurus for images is always needed, but that it was needed in the context of the Library of Congress collections and the shortcomings of the Library of Congress Subject Headings for indexing images.

If you already have a detailed taxonomy for documents, it certainly can be used for images, as well. Some terms, such as for abstract concepts (such as “Beliefs”), will simply not be needed in the image indexing, whereas a new terms might need to be added (such as the name of a specific type of flower.)

There is definitely unique metadata for images, of which subjects for indexing are just a part. Examples of other possible image metadata includes Creator/photographer, Location shown, Location of creation (camera location), Collection name, Time or part of day (especially if outdoors), Date taken (in contrast to date the image was digitized or edited), Number of people depicted, Copyright, Intended purpose, etc. The Thesaurus for Graphic Materials has had a separate “genre” facet that is very specific for types of graphical works (such as terms for Abstract paintings, Family trees, HVAC drawings, and Magazine covers). Image metadata standards include the IPTC (International Press Telecommunications Council)’s Photo Metadata for photojournalism. Different metadata may be needed for different kinds of images (news, commercial/advertising, art, etc.)

Indexing images is different from indexing text documents. First of all, it’s mostly manual because automation is very limited in image detection (but may be able to detect people’s faces). It’s more subjective as to what is of key importance in an image versus a document. An indexer may also tend to index for what is not actually depicted but for what is implied, which often, but not always, should be avoided.

I recently attended a conference presentation on this subject, “Get the Picture: Use Your Taxonomy to Classify Images” at the SLA conference in Boston earlier this month. The presenter, Ann Poole from Corbis, mentioned various challenges of image indexing, including over-indexing by photographer-submitters, indexing for emotions depicted or implied, and indexing for the backstory of an image in a known place.

Thursday, June 4, 2015

Taxonomist Trends



Last month I conducted an online survey of 150 taxonomists (described in my last blog post). Although the results of which will be used in another publication, it is interesting to note at this time a few comparisons between the results of this survey with a similar one I had conducted in late 2008 for my book, The Accidental Taxonomist. While I added further questions this time, some of the questions stayed the same for comparison.

We would expect over time that more taxonomists have been doing the work for longer. While this is the case for those in the field for 8-15 years, for those involved in the longest period, over 15 years, surprisingly, the survey results did not indicate this. Those who have done taxonomy work for 15 years or more were 26.2% in 2008 but only 17.6% now. The raw numbers, however, for over 15 years did, in fact, increase. So, the survey percentage indicates that there are proportionally more people who have been involved in taxonomies for an intermediate period of time. At the most beginner level, the numbers and percentage of respondents with less than a year of experience in taxonomies declined, from 9.2% to 3.4%. Those with 1-4 years of experience are about the same, and those with 4-15 years of experience increased from 32.4% to 41.2%. So, these numbers could indicate a maturing of the taxonomist profession, but not a graying of the field.

Trends in taxonomist work situation has not changed much with respect to it being a primary job responsibility vs. secondary and with respect to freelance vs. full-time employed. There was a noticeable difference, though, among those who are freelancers (totaling 17% before and 16% now), that more of them are now doing freelance taxonomy work only “occasionally” compared with before,  8% now compared with 4.7% in 2008, and not as many are doing it “often” as before, 8% compared with 12.5%. The fact that there is work for those who want to do freelance taxonomy work only occasionally, whether on top of another job or in combination with other kinds of freelance work is encouraging for those individuals who want to gradually break into taxonomy work.

Regarding the professional and educational background, the leading degree and prior profession of taxonomists today remains that of librarian, and the percentage has, in fact, increased slightly. Meanwhile, those with a technical background have proportionally decreased.  The percentage with an MLS/MLIS degree increased from 48.4% to 54.4% of respondents, and for the options of prior work experience, “librarian” increased from 27.7% to 28.3%. Those with an M.S. or M. Eng. degree decreased from 14.1% to 8.7%. Those with a background in Software/IT decreased from 12.3% to 8.3%, and those with a background in database design, development, or administration, decreased from 6.2% to 1.5%.  While the taxonomy field can certainly benefit from those with a technical background, it is not a necessary skill, and we might assume that fewer IT people in taxonomy work since 2008 might be due to an improvement in the economy, whereupon more of those people have found work in IT again.

In other areas, knowledge management, content management, and content strategy are backgrounds that have become more common, whereas “document management” has decreased. This is likely due to the fact that “content” of various formats is becoming more common than mere “documents.” Digital asset management was not even presented as an option, but three respondents wrote in the blank under “Other.”

Despite the preponderance of MLS/MLIS graduates, still only a minority of respondents had training in taxonomies/classification in college courses, and only a few percentage points more than before, merely reflecting that there were more MLS/MLIS graduates. Those having taken continuing education courses or workshops on taxonomies increased from 13.8% to 20.1%, but there are more such course that did not exist before (including mine). On-the-job training remains the primary means of learning how to create taxonomies. There has been a slight increase in on-the-job “formal” training over “informal” learning and experience, with the percentage with formal on-the-job training having increased from 21.5% to 28.9%.  Since this particular survey question permitted multiple responses, the leading response of informal on-the-job learning was 71.1%, but this was the only response option with a decrease (down of 83.1%). This is a good sign that taxonomists seem to be learning the skill in more varied means than the dominant on-the-job experience.
 

Monday, May 11, 2015

Taxonomist Survey

I had created a survey of taxonomists to gather some information for writing my book, The Accidental Taxonomist. It was mainly for Chapter 2: Who Are Taxonomists?  With the word “taxonomist” in the title, I had to write something about taxonomists, and not just about taxonomies, and this was the best way I could get more information than some anecdotes from colleagues.

But that was in late 2008, 6½ years ago. Has there been change in the industry since? In most fields, 6-7 years is not long at all, but in field of taxonomies, there could be changes. First of all, there have been significant changes in the economy over that particular period (recession and partial recovery), and, at least for internal, enterprise taxonomies, the role of the taxonomist could be considered something expendable in tight economic times. (I know, as I was laid off in 2008 and again in 2010.) More significantly, the field of information science is evolving very rapidly. So, I released a new survey this month.

My previous survey had 9 multiple choice questions and one open response. I chose to keep those questions with no changes or only minor wording changes, in order to compare the changes over time. I also decided to add a few more questions. To help me come up with the questions, I asked for input from an audience of presentation I have last month ("Taxonomy Displays: Bridging UX & Taxonomy Design" at the Content Strategy Seattle Meetup. Suggestions from that group included questions on the size of taxonomies, job titles, and taxonomy work pain points. The current survey now has 14 multiple-choice questions, one very short answer (job title), and three open responses, although all questions are optional, and it is permitted to skip questions.

Where to find taxonomists to survey


In 2008, I could think of only one logical channel to find taxonomists, the Yahoo group called Taxonomy Community of Practice. But it is no longer the only group and no longer the most active. The Taxonomy Community of Practice Yahoo group averaged only 5 messages per month in the last 6 months. In contrast, the 6 months around the time of my last survey, this group average 39 message per month. This is most likely because the LinkedIn group of the same name, Taxonomy Community of Practice, which was created in September 2007, has taken over the most of the taxonomy discussions.  Furthermore, there are additional LinkedIn groups, such as “Controlled Vocabularies”  and “Thesaurus Professionals.” The American Society for Indexing started a Taxonomies & Controlled Vocabularies Special Interest Group in late 2007, and SLA (Special Libraries Association) started a TaxonomyDivision in 2009, both of which have member discussion lists.

I have announced the current survey in all of these groups and more. However, I do not expect to reach significantly more taxonomists than before. That’s because, whereas the single Yahoo group back in 2008 tended to be subscribed to by email (individual or digest), the proliferation of groups and lists of similar or overlapping subjects has led to subscribers/members to opt out of direct emails. Additionally, email software, such as Gmail, can filter messages from lists to a category/tab that users may choose to overlook. So, my email announcements of the survey to groups may go unnoticed by many group members. It would be tempting to individually contact everyone I know personally who is involved in taxonomy work, but that could be a personal bias that would skew the pool of respondents.

Taxonomist tendencies


There have already been enough respondents to the current survey, that I can safely say that the largest number do taxonomy work as their primary responsibility, as with the previous survey, and that, like before, the majority are employees, rather than contractors, freelancers, or independent consultants. The most common educational or professional background (although not the majority) is library/information science. What is striking, though, is that despite the fact that 48% of respondents in 2008 had an MLS/MLIS degree (and from the early survey returns, the percentage is even slightly higher), only a small percentage of taxonomists learned taxonomy skills through formal educational institution coursework. Self-taught through reading, on-the-job experience, and on-the-job training, and conference workshops or seminars are each methods of learning taxonomies that are more prevalent than college courses. Additional, more specific comparisons will be the subject of a future blog post.