OpenTrials beta launch – Features & technical background

Hello fans of evidence-based medicine and open data! Following the recent launch of the OpenTrials public beta, Open Knowledge International’s Head of Technical Product Paul Walsh shares more information on the features that are available in the current beta release, and what is in the pipeline on the road to a proper v1 release. 

Features

OpenTrials is designed as a modular platform, employing a set of loosely coupled components that handle distinct tasks from the collection of data from external sources, through to the exposing of data for consumption via the API (application programming interface) or a UI (user interface).

In the announcement post for the platform, we presented two diagrams that describe the architecture of the platform, and the data model that OpenTrials employs. The design presented then has not significantly changed: you can see those diagrams below for convenience.

What those diagrams do not reveal is the actual feature set exposed to users, so we’ll highlight the major features here.

API

The API is currently at v1. The base endpoint for the API is here, and interactive documentation for the API can be found here.

The API exposes several RESTful endpoints that allow querying the primary entities of the OpenTrials data model. This is particularly useful for creating applications that paginate through related data, or where the application has identifiers for particular objects and wants to query all information for those objects.

The API also exposes a search endpoint, which is backed by Elasticsearch. This allows for deep queries into the entire database, and does not necessarily require any knowledge of the data model, or the relations between entities, to yield useful results. We’ve found that, given the nature of the data itself, the search endpoint is the most useful endpoint for regular use.

Our API implements the Open APIs specification. One of the many great benefits of implementing against Open APIs is that client libraries can be auto-generated from the spec itself. So, fire up a REPL in Python, JavaScript or any other language that has a Swagger/Open APIs implementation, and start playing with the data.

Explorer UI

The Explorer UI is the main portal into the OpenTrials data. It is based on a search-driven user experience, and once a user reaches a trial page, opens up further navigation into the related entities for that trial.

The explorer easily enables users to navigate back to the data sources used to populate the database, and notably features many prompts for user contributions to help enhance the database by either identifying errors or contributing new data.

We are very happy with the usability of the Explorer UI, and we hope you will be too.

Query UI

The Query UI is a way to perform ad hoc SQL queries over the entire OpenTrials data warehouse, and then do interesting visualisations with the results of those queries.

We leverage the excellent Re:dash for this, and we’ve had great success in the lead up to the beta in using this to extract interesting insights out of the data.

Crowdsourcing UI

The Crowdsourcing UI is a way to allow users to help us improve the OpenTrials database via an interactive process, either validating data matches that have been made, or by making new manual matches across different records in the database.

We leverage the excellent Crowdcrafting for this. At beta, we are launching with a preview of this feature by exposing a “task” to validate matches we’ve automatically made in processing clinical trial records.

Other features

There are many “behind the scenes” features related to what we call the “data warehouse”. It is especially important to understand these if you plan on contributing code to OpenTrials. The major features to note here are what we call collectors and processors.

As the naming suggests, collectors are responsible for bring external data into the database, and processors work on that data for matching, cleaning, normalisation, and the generation of the public API database.

Data

Of course, the code we are writing is all directed to a single purpose – building a centralised database for information related to clinical trials. To this end, the majority of the work done over the last 6-8 months has been around data collection and processing. Likewise, going forward, this is really the area on which most effort will be focused.

Our focus so far has been on building out the spine of the database using data from clinical trial registers. This data provides the essential information that we need to thread together additional data from a range of other sources.

We’ve also integrated a data donation related to schizophrenia from Cochrane, systematic review data from the UK Health Research Authority, references to clinical trials in academic literature from PubMed, and more.

For more information on the data in the OpenTrials database at present, read our introduction to the data.

Coming up

In terms of the technical platform, we will not be adding many more features. We’ll expand the crowd sourcing tasks available, to facilitate user contribution to the database, and we will open up some more specific API endpoints. Of course, there are always bugs and small enhancements to work on.

Most of the work remaining is closely tied to the data. In terms of data cleaning, entity normalisation, and record matching, we’ve barely touched the surface. We’ll be particularly focused on this over the next 6 months, and this is a great area to contribute if you have some data wrangling and data science interest or expertise.

We’ll also integrate a few more data sources. These are sources we’ve been working on over the last months, but are not yet at a stage to expose over the public API. Of note here are an integration with Epistemonikos, which is a rich source of quality systematic reviews, and OpenAIRE, which has an impressive text and data mining feature set that is helping us uncover more clinical trial references in academic publications.

Contributing

As with all Open Knowledge International projects, we welcome and encourage contributions. For the technical work on OpenTrials, contributions can mean any or all of code, documentation, testing, etc. See OpenTrials on GitHub for all the repositories with interesting tasks to take up. You can also see all the different ways you can contribute here or email us on [email protected].

OpenTrials API information

Since our beta launch, we’ve had a steady stream of technical questions about how to use and interact with the data that powers OpenTrials.

Good news! We have a beta public API available that will allow you to retrieve structured data about clinical trials listed in the OpenTrials database for using for your own projects, analyses, and visualisations. While we don’t enforce rate limits on it yet, we ask you to be considerate of our server resources when using the API, limiting your speed to less than a request per few seconds.

We aim to keep adding to this post with new information, demonstrations, and tutorials relating to the OpenTrials API, so bookmark it and come back. If you’ve written something or created an interesting app, visualisation, or analysis, let us know and we’ll add it to the examples below.

 

Here are some links to help you understand the OpenTrials data and API:

 

20161008_160858
Code from the OpenTrials Hack Day in Berlin (photo by benmeg / CC BY)

Projects and documentation

Want some inspiration? Here are some projects and documentation which others have created – many thanks to you!

 

OpenTrials API record structure (via Darko Bergant)
OpenTrials API record structure (via Darko Bergant)


If you’ve created a project using the OpenTrials API or used another language to talk to the API, and you’ve documented your efforts we’d love to showcase and link to them –
get in touch!

For those of you asking for a database dump, you can find one linked from this GitHub issue.

We’re also happy to discuss potential collaborations, and ask that if you use our data for academic research, you cite our paper – thanks!

If you have any feedback, bug reports, or feature requests for OpenTrials or the API, please either email us or file an issue on our GitHub repo.

Hack Day summary + API information

It’s been quite a week – we were speaking at the International Open Data Conference in Madrid on Fri 7th, running a Hack Day in Berlin the following day, and then launching OpenTrials beta two days later at the World Health Summit.

Now that the dust has settled, I wanted to give you a brief summary of what we got up to at the Hack Day and also give you some information on the API for those of you who want to play around with the data that powers OpenTrials.

Hack Day

We had a range of people attend our event in Berlin, from researchers, developers, artists, and industry consultants. Some were just interested in the concept of OpenTrials and its ability to improve medicine, but were new to coding, whereas others were seasoned developers used to hacking code and got their teeth straight into our API.

 

Open Trials Hack Day group in discussion
The OpenTrials Hack Day in discussion (photo by benmeg / CC BY)

 

A range of ideas for projects, integrations, and improvements were discussed, including:

  • Integrating OpenTrials information into WikiData
  • Adding information on R&D costs to intervention pages
  • Showing additional summary information in the search results (to avoiding clicking through to a trial’s page)
  • Linking more academic publications to clinical trials listed in OpenTrials by searching the full text of journal articles for mentions of clinical trial IDs (we currently only look at the abstracts)
  • Creating a two-way lookup table, showing what fields/variables can be expected from each clinical trial registry source
  • Adding early FDA approval letters (not available from [email protected]) to drug/intervention pages (subject to an ongoing, lengthy FOIA request)
  • Displaying links between conditions, interventions, and trials (see diagram below)

 

Mapping conditions, interventions, and trials
Mapping conditions, interventions, and trials (photo by benmeg / CC BY)

 

Of particular mention is Matthias Koenig’s hack (see above) – more screenshots and code here.

The conversations of the day generally revolved around medicine, but covered everything from text and data mining, improving research quality, clinical trial transparency, through to open access and open science, We’d love to keep the conversations going, so if you’re inspired to work on something or have already started, please let us know and we’ll make sure to spread the word.

Whether you attended or are interested in working with the OpenTrials data, please join the conversation in our chatroom and view the code behind OpenTrials.

 

API information

If you’re a developer or data scientist who’s interested in playing with our underlying data, we have an API available (caveat emptor: early version, may change) – here are some documents to get you up to speed with the data:

Update: If you use R, be sure to check out Darko Bergant’s great tutorial – Using OpenTrials API with R

NB. For those of you asking for a database dump, we’re planning to release one – please subscribe to this GitHub issue to track our progress.

We’re also happy to discuss potential collaborations, and ask that if you use our data for academic research, you cite our paper – thanks!


If you have any feedback, bug reports, or feature requests for OpenTrials please either
email us or file an issue on our GitHub repo.

How we collect data in OpenTrials

A vital part of the OpenTrials project is data collection. We get data from various sources to process, and link them in the next stage of our data pipeline. Let’s take a look at our data collection system:

Overview

For data acquisition we use a component called collectors:

collectors

Key facts about collectors:

  • Written in Python
  • Executed in isolated Docker containers (running independently)
  • Deployed to Docker Cloud using Travis CI/CD

Each collector runs as a daemon on Docker Cloud and always restarts after exiting – this is to ensure we provide continuous data collection, staying in sync with the specific data source. As we have to deal with very inconsistent sets of data sources, there are no strict rules about how collectors should be implemented, except that they must comply with the collect(conf, conn, *args) function interface. However, the usual flow is:

  1. Check the most recent update date of collected data records
  2. Iteratively collect data, starting from this date
  3. Log any iteration errors without exiting
  4. Sleep and exit after finishing

After the collector exits, its Docker container is automatically restarted, and the process is done again. Using this flow, our collectors run continuously, getting and updating the data from external sources.

 

Collectors in Detail

Even though each collector is written specifically for a different source, they share many common functionalities. To avoid repeating ourselves, we extracted them to a base collectors library which includes:

  • Record – data model
  • fields – data model fields
  • helpers – various helpers

The most important concept here is a Record defined using Fields:

Record

A record is a representation of an atomic data unit to collect. This class reflects the source data model and differs for all sources.

Collectors use the Record.create(url, data) factory method to create a record. All passed data is parsed using the corresponding field types. Records are then written to the warehouse using the record.write(conf, conn) method.

 

Using Scrapy in Collectors

Collectors can run any Python code to get and write records. However, for many cases the most convenient way is to use the Scrapy framework.

Scraper

Scrapy is the most popular Python framework for data scraping. It manages many tasks including request scheduling, page deduplication, timeouts, and auto throttling (see the Scrapy documentation for more details). In our case, we are interested in the following features:

  • spider – to get HTTP responses for the data we’re interested in
  • parser – to get a Record from an HTTP response from the spider

We use Scrapy’s programming interface to implement this functionality in our collectors. Our base collectors library provides a Scrapy pipeline to write the scraped records to our data warehouse.

 

Conclusion

This was a high-level introduction to the OpenTrials data collection framework. In the future we will write about some specific aspects in more detail, including using Scrapy, or about the next data pipeline stages (e.g. data processing and linking).

All our code is open-sourced. Feel free to explore, contribute, and use it:

https://github.com/opentrials/collectors

We really appreciate any feedback, ideas and comments.

Thanks!

 

Evgeny Karev

Senior Developer, Open Knowledge International

[email protected]

Twitter: @opentrials