iDigBio 2014: Data Modeling Workshop

On March 24, 2014, the first day of the 2014 iDigBio conference, I attended a Data Modeling workshop where I learned about developing aggregators (a fancy word for a website or program that collects related items of content and displays them or links to them).

Aggregators store and curate information and pull primary data and images from the data source while creating find-ability for users easier. The ideal aggregator would presuppose what types of questions users will be asking when visiting the site.

The key role of aggregators is to be able to provide sustained service over long periods of time (sustainability) and to be able to transfer the data easily to a new aggregator if needed. For example, if the website goes away, is there a way to pull the main source of information and metadata that will help people to rebuild the aggregator at a later date just using that archived information. Finally, the quality of the data is improved by its movability.


In this workshop, we also discussed the process of sustainability.  In particular, we asked, what is the archival possibility for an aggregator and the primary data?  For example, there is a need for aggregators to be on going because re-creating the aggregator causes too much work and money that could be used by the source to improve the source data.

While we have become reliant on aggregators to find information quickly, there can be some issues. For example, data curators may become quickly overwhelmed by the amount of results returned, data may be pushed back into the system from the archived metadata source, creating a knowledge store which has to be mineable and finally, leaving the knowledge store in a cloud for the manager to have access but is still able to run algorithms.

I learned that when creating an aggregator, it is necessary to ask the client the following questions:

  1. Do you have the necessary resources to support the data management staff?
  2. Is there a model where we could stabilize and standardize the data so it can go back into a repository and shared?
  3. Many locations do not have a discoverable website for their collections. So would this be beneficial for them?

Also, one must consider what other communities/users are looking for with regard to this scientific data (not just taxonomic scholars/scientists):

  1. Most taxonomic community want images/objects;
  2. Care about it being a reliable well-curated record;
  3. Must determine who are the actual users and audience;
  4. To make the aggregator effective, you must first determine what each audience is looking for.


What are some of the issues from provider’s point of view with dealing with data that comes back for example, crowdsourcing data label transcription:

  • Deciphering hand written labels;
  • Allow items to be added to gazetteer, which will assess credibility or attach information such as a link or reference that states that they found the information at a particular area;
  • Noisy data, duplicates and errors; and
  • A deluge of information.

Solutions to crowdsourcing issues:

  • Create pre-populated items that can be found on a map either taken from US GIS data or Google Maps;
  • Partition out noise from curated clean bucket, which will allow user to canvas the clean bucket first; and
  • Expert feedback from scholars.

Another important topic was the idea of legacy tracking of taxonomic classification. This is a process, which archives the previously used terms relating to classification of a species (animal or plant) in relation to any new classification.  Some things to consider when creating or maintain legacy-tracking data would be:

  • Is the structure of historical records organized historically?
  • Synonyms and misspellings should also be tracked for search-ability.
  • How is it represented on the site and what items are available for searching?

Finally, since taxonomic data in a digital format can quickly become overwhelming, it is important to attach a globally unique identifier to each item so that it can easily be traced back to the institution/scholar source. This would require assigning globally unique identifiers for each specimen. Additionally, to make this process easier, some programs assign identifiers for you.