In planning a taxonomy, I have often said that it is important at the beginning to define the taxonomy’s scope, specifically the subject area scope of the taxonomy’s terms, but without going into more detail. Recently I was asked by a client how to define a taxonomy’s scope. This is a good question. The taxonomy should be suited to the subject area scope of the content that will be tagged with the taxonomy and to the scope of the user’s expectations. Terms or topics only marginal to the subject scope, however, could occur in the content, and whether they should also be included in the taxonomy is a question. Ultimately, that should depend on whether user expectations justify it, as the needs of users should also be a factor in creating a taxonomy. A taxonomy should suit both its content and its users.
Sources for Taxonomy Terms
For content as a source of taxonomy terms, a combination of manual and automated approaches is recommended. By manually reviewing sample individual documents or content items, you can discern the main ideas and main topics, which should form the start and basic structure of the taxonomy and also help define its scope. Automated methods of extracting terms, through text analytics technologies, can bring in many additional terms from a much larger corpus of documents more quickly, picking up terms that a limited manual review would miss. Even though automated text analytics extracts terms based on relevancy and frequency of occurrence, such terms could be out of scope of the subject domain. That’s why it’s important to start first with a manual review of content to define the subject scope. Then, when you enrich the taxonomy with automated extraction, you can approve terms that appear to be in scope or at least closely relevant and reject others. But should you reject all that are out of scope, even if they appear with sufficient frequency and relevancy? My advice is to try to assume the role of the user. Ask yourself: Might a user want to search for content on this term in this content collection?
For user needs and expectations as a contributing source of taxonomy terms, obtaining this information can be very direct, such as by creating a user questionnaire (at least for your internal users) that asks what the topics of importance are, how those users would define the scope, and what “marginal” topics would be acceptable for them to include. You could also request sample challenging (not expected, basic, typical) queries that the users would make. Another good way to obtain input from the user side is to look at search query logs that list search strings that users have entered over a period of time, ranked by frequency. If a search phrase that is slightly out of scope of the subject occurs frequently, then the term should still be considered for inclusion in the taxonomy.
In either case, the scope of the subject gets better defined as the taxonomy is created. For example, a taxonomy for recipes may initially be scoped to comprise terms for the names of dishes, ingredients, and cooking method. But then a different term shows up significant frequency, “Nutrition Facts.” If it occurs in both the content and the user research, then it likely should be included. If it shows up in the content only, but is not validated in user research, then it is more questionable.
Taxonomy Structure
The initial taxonomy structure itself tends to impose limits on scope. Taxonomies tend to be hierarchical with a limited number of top terms. If a candidate term appears in the content that does not seem to belong anywhere in the current taxonomic hierarchy, you might be inclined to exclude it. Factors of user needs (they might want to look up this term in this content), however, should take precedence. For example, the term “COVID-19” might be marginal but still of interest to be included many taxonomies on varied subjects, but there would exist no broader term for diseases in those taxonomies. Then adjustments need to be made, such as renaming or adding broader terms, or perhaps, more likely, the proposed term should be modified to fit the context of the taxonomy, such as becoming “COVID-19 impacts.”
Another thing to consider is adopting more a thesaurus structure than a taxonomy structure, at least for the facet or concept scheme of the taxonomy that is for miscellaneous “topics.” One characteristic of thesauri is to not rely so heavily on extensive hierarchical trees. What this means is that you could decide that it is acceptable that not all terms have broader terms and thus it’s OK to have a very large number of top terms, with the more specific terms linked to other terms only by related-term relationships, another feature of thesauri, if not by broader/narrower-term relationships. Abandoning the full hierarchical tree structure should only be considered if this hierarchy is not displayed as a navigation to the end users.
Documenting Policy
In any case, you need to define policies regarding what kinds of terms can be added and what kinds should not. This will evolve out of the activity of building the taxonomy, especially from evaluating what extracted terms to approve and what search log terms to approve. Whoever is doing this task (hopefully more than one person), should document each instance of uncertainty. While many term approvals and rejections will be obvious, there will be a gray area. This should be collected and discussed together, and then a policy can emerge.