Collaborative biodiversity data and Flickr

How to ID this stuff?

[ Jump to: Existing biodiversity platforms, Flickr, Plant photos, iNaturalist ]

So I’ve had a ‘big data dream’ in the back of my mind for a while now: a vast, integrated, dynamic, internet collaboration of creators of plant biodiversity images and data (plots, dets, collections, taxonomy). Imagine, for instance:

...an attractive, well-designed site that draws upon plant images from a wide range of different sources and permits the user to easily scroll through thumbnails organized by taxonomy, to find a match for his/her own images. Finding a suitable match, he/she can then easily make a (machine-generated, RDF) public statement that, “My plant, KalBar-SungaiDalam-PlotB-34, which I have given a morphotype taxon label of KalBar-SungaiDalam-POLYBLA4 is a good match (using images) to Jones’s Brunei-HutanKota-12 which he has determined to be Maasia sumatrana (Miq.) Mols, Kessler & Rogstad (Plantlist ID: kew-2607796).”

Nice idea hey! However, at the same time I’ve become increasingly realistic about how busy people are, and what we can and cannot expect collaborators to learn and do, and it is clear to me now that the only way to realize this dream is for someone other than the data creators to do the bulk of work needed to integrate these data. I’ve also noted the many barriers to collecting data into a single new biodiversity portal: most people will generally want to keep control of their data, publishing it in the way that best fits with their own needs and capacity. So the tasks of the ‘integrator’ are:

Make a list of useful biodiversity data resources (e.g., for images: personal websites, Flickr, herbaria collections websites, etc),
Contact the creators and i) explain the goals of this integration project, the benefits to the creator, and the methods to be used to harvest the data (page scraping, API calls, etc), ii) solicit the permission of the creator to use their data, and iii) to determine the (CC) license that the creator wishes to use,
Write a set of instructions and documentation for each collaborator,
Build a unique set of automated scripts for each resource,
Run the scripts and store a cache of the data in a database,
Develop the APIs to access the database,
Build a cool website on top, allowing users to make statements about plant-plant and plant-taxon matches, and finally
Re-publish old and new data as RDF.

Phew! Lots to do! But... I really think this ‘bespoke’ approach will in the end be most productive, because we are not talking about hundreds of key collaborators/resources, but only a few tens. This integration is not like launching a new social media platform that has to be scalable and generic.

Existing platforms

In the process of moving towards this and discussing the idea with people, I’ve met many researchers who would like to be part of this effort, but don’t have the tech skills to build their own online database. In the model above, there is no central repository for data and images, so where do I turn people to? There are a number of existing user-focused biodiversity platforms (e.g., iSpot, iNaturalist, Project Noah), and I just discovered that iNaturalist has an API to get the data back out; I need to look into this more as an option to recommend in future. However, so far I’ve found myself most often recommending Flickr as a great place to store the data about biodiversity observations, when those observations are represented by sets of photographs of individual plants, as they often are. The primary reasons for using Flickr are:

Of all the main social networking sites it is most optimized for handling lots of large images, including image editing,
It is oriented in the same way that we deal with physical collections, with a set of tools for organizing and curating objects, and,
It has an excellent API.

More about Flickr, below. There also exist a lot of identified images of plants on Facebook, particularly in groups (e.g., the amazing Plants Community for Kalimantan Barat), and it may be possible to extract images and determinations via the Facebook API (e.g., for each image posted to a group, seek the last taxonomic name in comments on that photo). However, the information in Facebook is less useful than Flickr for building a networked plant data resource, because Facebook is fundamentally a social interaction platform, as opposed to Flickr’s object orientation: it is harder on Facebook to associate metadata (tags, descriptions, locations) with each ‘object.’

Flickr

The awesome Flickr platform

I’ve been developing some ideas about using Flickr as a biodiversity image platform, particularly during i) discussions about the adoption of Flickr as a repository for staff-generated images of plants in the Arnold Arboretum, and ii) with Louise Neo, who has been taking amazing images of the plants of Singapore for a while now, storing them on Flickr (she also pointed me towards the amazing photos of Yeoh Yi Shuen and Ng Xin Yi). I’ll work towards writing a comprehensive set of guidelines for users (see a start I made here), but for the moment here are some pointers:

Flickr permits all levels of access permission, and so the data ‘integrator’ may need to be accepted as a friend, and sharing permissions set accordingly, to access full-size images.
The user should set the default license for all uploaded images in the settings (CC BY-SA is generally recommended for biodiversity data, the ‘NC’ can be added if that is a concern).
For organizing the images, each album should hold images from a single plant in the field. Thus, the album (e.g., this one: https://www.flickr.com/photos/loupok/sets/72157642193981464/) is a virtual representation of a plant, complete with a permanent URI: it represents a digital collection object, from which (multiple) determinations can be made. These determinations can be offered by another users in the comments section.
Albums can be assembled into collections, e.g., by taxonomic family, or by plot or place.
Metadata can be deduced automatically, or added as text in either the album’s title or description fields.
- If any photo is georeferenced in an album, that long/lat can be taken to be the location of the plant.
- The date of the ‘digital collection’ can be taken to be the time-stamp in the photos’ EXIF data. Note that it is important the users set their cameras correctly!
- As a default, we can assume that the images were taken by the Flickr user, and that the determination was also made by the same user, unless otherwise indicated in the album’s description field.
- The album can be titled in any way that the user desires, but it is common to name the album to be the same as the user’s taxonomic determination (see above example). Different albums can have the same name, so several plants can be identified to the same taxon. For a more useful and standardized method of giving determinations, a text string of the form tpl:kew-2607792 can be added in the album’s description field; this gives an unambiguous, machine linkable determination to a name from The Plant List.
- I’m a big proponent of the need to explicitly give a confidence measure when we give a taxonomic match either to another individual plant, or to an external name. For the former I use high | mid | low, and for the latter, s+ | s- | g+ | g- | f+ | f-: high confidence to species, low confidence to species, high confidence to genus, etc...; note that a specific epithet can still be given with a g+, to indicate a very low quality det to a particular specific epithet. A text string can be included in the album’s description field, e.g., confidence="s-".
- The individual code of the plant, if it exists (e.g., tree tag number, with plot code).
- Other notes that might be included on a plant’s collection label can also be added to the album’s description field:
  - Locality name
  - Ecosystem type, habitat and microhabitat
  - Plant habit
  - Tree diameter and plant height
  - Resources used to make determination
  - Local names and uses
  - Smells
Ideally, where data are added in the album’s description field, they are machine parseable, and use an externally referenced ‘key’ (as in key-value pair), where the subject of the key-value pair is assumed to be the individual plant of which this digital collection was made. I.e., using Darwin Core fields where possible. An example might be dwc:island="Borneo" eco:habitat="Freshwater swamp" dsw:toTaxon="tpl:kew-2607792". Note that Flickr tags can also be used, in place of or in addition to the album’s description.

The Flickr API is awesome. See this small project for an example of using the API. The generic steps are:

Get an OAuth access token for your script (‘app’)
Get a list of albums for a user (e.g. method=flickr.photosets.getList&user_id=102148157%40N08; try it!)
Get a list of photos in the album (e.g., method=flickr.photosets.getPhotos&photoset_id=72157642193981464; try it!)
Get the metadata for each image (e.g., method=flickr.photos.getInfo&photo_id=13078067683; try it!)
Get the direct URL of the image (e.g., method=flickr.photos.getSizes&photo_id=13078067683; try it!).

With these calls you can easily and automatically build a database of a user’s images, with metadata. As discussed above, the particular solution for each committed contributor will be unique, allowing for a compromise between their choice of customization (e.g., of metadata codes) and of the ‘integrator’s’ needs, so each contributor script will probably look slightly different. Perhaps a good hybrid solution for many users would be to use Flickr to store the images and then post a simple (flat Darwin Core) CSV file of the associated metadata on a web-page that they can write to.

Plant photos

A selection of images for a single plant

As for the photos for each individual plant, the more the better! While photos are (usually) inferior to a physical collection for the purpose of making a good determination, if they are high resolution, and of the appropriate plant parts, accurate determinations can often be made. The ‘appropriate plant parts’ for diagnosis will vary from taxon to taxon, and the general botanist will usually not know if he/she is photographing the key morphological and even anatomical characters, but, if time permits, a thorough, destructive, photographic recording of the plant will increase the chances that key parts are included. As a minimum, I recommend:

whole plant
buttresses, trunk, slash
whole twig, twig bark, twig cross section
whole compound leaf, leaf upper surface, leaf lower surface, leafbase lower side
leaf axil, terminal bud
inflorescence, flower or fruit basal view, flower or fruit side view, flower or fruit apical view, flower or fruit cross section, flower or fruit longitudinal section, seed

This full set is 21 images for each album. The images should be cropped to highlight the intended part of the plant. Here’s an example from our Xmalesia project database. See Baskauf & Kirchoff 2008 for a more comprehensive discussion of standard views of plants.

Edit: 2015-02-21

iNaturalist

I’ve just been looking into the latest from iNaturalist. Wow! If contributors can be enticed to use it, iNaturalist and its API provide a fantastic platform for collaborative biodiversity work. Definitely the best ‘out of the box’ platform I’ve seen. The image tools are less developed than Flickr, and people who already use and love Flickr are unlikely to want to put their pics in two places, but for people who primarily want to contribute biodiversity data, iNaturalist is definitely the better choice:

The metadata are pre-structured as Darwin Core fields,
The taxonomy engine is well-designed and based on the Catalogue of Life and uBio,
The community can easily add new determinations (this process partially determines if the record is ‘research grade’ or not). You can easily ‘subscribe’ to a taxon, so that you are notified if new observations are posted to this taxon (and its descendants).

The API allows very easy, unauthenticated access to user data, returned as XML (Darwin Core Simple Records) or JSON. For example, the query (try it):

  http://www.inaturalist.org/observations.dwc?&taxon_id=47126
   &swlat=-11&swlng=93&nelat=8&nelng=141

returns all plant observations for Indonesia (NB: cool bounding box tool!), including image URLs, XML. Simple as that! The same API seems to be used for the website’s search (try same query) showing a pretty map:

  http://www.inaturalist.org/observations?&taxon_id=47126
   &swlat=-11&swlng=93&nelat=8&nelng=141&view=map

I.e., iNaturalist already has most of the cool features I’ve wanted (and sometimes tried) to incorporate into my own biodiversity platforms. So maybe... I might even use it for my own data ;-)

[ home | blog ]