Collaborative biodiversity data and Flickr

How to ID this stuff?

How to ID this stuff?

[ Jump to: Existing biodiversity platforms, Flickr, Plant photos, iNaturalist ]

So I’ve had a ‘big data dream’ in the back of my mind for a while now: a vast, integrated, dynamic, internet collaboration of creators of plant biodiversity images and data (plots, dets, collections, taxonomy). Imagine, for instance:

...an attractive, well-designed site that draws upon plant images from a wide range of different sources and permits the user to easily scroll through thumbnails organized by taxonomy, to find a match for his/her own images. Finding a suitable match, he/she can then easily make a (machine-generated, RDF) public statement that, “My plant, KalBar-SungaiDalam-PlotB-34, which I have given a morphotype taxon label of KalBar-SungaiDalam-POLYBLA4 is a good match (using images) to Jones’s Brunei-HutanKota-12 which he has determined to be Maasia sumatrana (Miq.) Mols, Kessler & Rogstad (Plantlist ID: kew-2607796).”

Nice idea hey! However, at the same time I’ve become increasingly realistic about how busy people are, and what we can and cannot expect collaborators to learn and do, and it is clear to me now that the only way to realize this dream is for someone other than the data creators to do the bulk of work needed to integrate these data. I’ve also noted the many barriers to collecting data into a single new biodiversity portal: most people will generally want to keep control of their data, publishing it in the way that best fits with their own needs and capacity. So the tasks of the ‘integrator’ are:

  1. Make a list of useful biodiversity data resources (e.g., for images: personal websites, Flickr, herbaria collections websites, etc),
  2. Contact the creators and i) explain the goals of this integration project, the benefits to the creator, and the methods to be used to harvest the data (page scraping, API calls, etc), ii) solicit the permission of the creator to use their data, and iii) to determine the (CC) license that the creator wishes to use,
  3. Write a set of instructions and documentation for each collaborator,
  4. Build a unique set of automated scripts for each resource,
  5. Run the scripts and store a cache of the data in a database,
  6. Develop the APIs to access the database,
  7. Build a cool website on top, allowing users to make statements about plant-plant and plant-taxon matches, and finally
  8. Re-publish old and new data as RDF.

Phew! Lots to do! But... I really think this ‘bespoke’ approach will in the end be most productive, because we are not talking about hundreds of key collaborators/resources, but only a few tens. This integration is not like launching a new social media platform that has to be scalable and generic.

Existing platforms

In the process of moving towards this and discussing the idea with people, I’ve met many researchers who would like to be part of this effort, but don’t have the tech skills to build their own online database. In the model above, there is no central repository for data and images, so where do I turn people to? There are a number of existing user-focused biodiversity platforms (e.g., iSpot, iNaturalist, Project Noah), and I just discovered that iNaturalist has an API to get the data back out; I need to look into this more as an option to recommend in future. However, so far I’ve found myself most often recommending Flickr as a great place to store the data about biodiversity observations, when those observations are represented by sets of photographs of individual plants, as they often are. The primary reasons for using Flickr are:

  1. Of all the main social networking sites it is most optimized for handling lots of large images, including image editing,
  2. It is oriented in the same way that we deal with physical collections, with a set of tools for organizing and curating objects, and,
  3. It has an excellent API.

More about Flickr, below. There also exist a lot of identified images of plants on Facebook, particularly in groups (e.g., the amazing Plants Community for Kalimantan Barat), and it may be possible to extract images and determinations via the Facebook API (e.g., for each image posted to a group, seek the last taxonomic name in comments on that photo). However, the information in Facebook is less useful than Flickr for building a networked plant data resource, because Facebook is fundamentally a social interaction platform, as opposed to Flickr’s object orientation: it is harder on Facebook to associate metadata (tags, descriptions, locations) with each ‘object.’

Flickr

The awesome Flickr platform

The awesome Flickr platform

I’ve been developing some ideas about using Flickr as a biodiversity image platform, particularly during i) discussions about the adoption of Flickr as a repository for staff-generated images of plants in the Arnold Arboretum, and ii) with Louise Neo, who has been taking amazing images of the plants of Singapore for a while now, storing them on Flickr (she also pointed me towards the amazing photos of Yeoh Yi Shuen and Ng Xin Yi). I’ll work towards writing a comprehensive set of guidelines for users (see a start I made here), but for the moment here are some pointers:

The Flickr API is awesome. See this small project for an example of using the API. The generic steps are:

  1. Get an OAuth access token for your script (‘app’)
  2. Get a list of albums for a user (e.g. method=flickr.photosets.getList&user_id=102148157%40N08; try it!)
  3. Get a list of photos in the album (e.g., method=flickr.photosets.getPhotos&photoset_id=72157642193981464; try it!)
  4. Get the metadata for each image (e.g., method=flickr.photos.getInfo&photo_id=13078067683; try it!)
  5. Get the direct URL of the image (e.g., method=flickr.photos.getSizes&photo_id=13078067683; try it!).

With these calls you can easily and automatically build a database of a user’s images, with metadata. As discussed above, the particular solution for each committed contributor will be unique, allowing for a compromise between their choice of customization (e.g., of metadata codes) and of the ‘integrator’s’ needs, so each contributor script will probably look slightly different. Perhaps a good hybrid solution for many users would be to use Flickr to store the images and then post a simple (flat Darwin Core) CSV file of the associated metadata on a web-page that they can write to.

Plant photos

A selection of images for a single plant

A selection of images for a single plant

As for the photos for each individual plant, the more the better! While photos are (usually) inferior to a physical collection for the purpose of making a good determination, if they are high resolution, and of the appropriate plant parts, accurate determinations can often be made. The ‘appropriate plant parts’ for diagnosis will vary from taxon to taxon, and the general botanist will usually not know if he/she is photographing the key morphological and even anatomical characters, but, if time permits, a thorough, destructive, photographic recording of the plant will increase the chances that key parts are included. As a minimum, I recommend:

This full set is 21 images for each album. The images should be cropped to highlight the intended part of the plant. Here’s an example from our Xmalesia project database. See Baskauf & Kirchoff 2008 for a more comprehensive discussion of standard views of plants.


Edit: 2015-02-21

iNaturalist

I’ve just been looking into the latest from iNaturalist. Wow! If contributors can be enticed to use it, iNaturalist and its API provide a fantastic platform for collaborative biodiversity work. Definitely the best ‘out of the box’ platform I’ve seen. The image tools are less developed than Flickr, and people who already use and love Flickr are unlikely to want to put their pics in two places, but for people who primarily want to contribute biodiversity data, iNaturalist is definitely the better choice:

The API allows very easy, unauthenticated access to user data, returned as XML (Darwin Core Simple Records) or JSON. For example, the query (try it):

  http://www.inaturalist.org/observations.dwc?&taxon_id=47126
   &swlat=-11&swlng=93&nelat=8&nelng=141

returns all plant observations for Indonesia (NB: cool bounding box tool!), including image URLs, XML. Simple as that! The same API seems to be used for the website’s search (try same query) showing a pretty map:

  http://www.inaturalist.org/observations?&taxon_id=47126
   &swlat=-11&swlng=93&nelat=8&nelng=141&view=map

I.e., iNaturalist already has most of the cool features I’ve wanted (and sometimes tried) to incorporate into my own biodiversity platforms. So maybe... I might even use it for my own data ;-)