End of OpenTrials Phase 1

TL;DR – OpenTrials has finished Phase 1 funding – this means that the team is having a break, with the project on hold until we secure Phase 2 funding (we’re currently working on it). In the meantime the OpenTrials explorer, our API, and database downloads will remain freely accessible. Thanks to everyone who’s contributed their time and ideas to this phase of OpenTrials, and to our funder, the Laura and John Arnold Foundation, for their generous and insightful support of the project.

It’s been a very interesting journey over the past couple of years and we’re really proud of what we’ve achieved with a small team in a short amount of time. This project couldn’t have got as far as it has without many of you, who’ve helped in a number of ways, from pledging assistance, helping with user testing or giving us feedback at conferences, and those of you we’ve met to discuss aspects of the project. And of course those individuals and organisations who have offered/donated data to the project. So, thank you – we’re making OpenTrials for you, and want to make it as useful as possible, so all your contributions have been invaluable.

What follows is a summary of what we’ve built, events we’ve been involved in, some challenges and successes of the project so far, how to get involved in the meantime, and lastly some interesting articles we’ve been reading recently.

What have we built?

OpenTrials

Based on Ben Goldacre and Jonathan Gray’s 2016 paper outlining the need for a linked database of clinical trials (and related documents and data), we’ve built an early (beta) version of OpenTrials, bringing together data from multiple clinical trial registries, deduplicated those trials, automatically matched them to publications, and integrated third-party datasets. We’ve also implemented a contribution feature, meaning that users can submit a wide range of documents and data relating to trials, further enhancing the potential information available.

Here’s a summary of what data we currently have in OpenTrials:

351,851 trials from five clinical trial registries:

ClinicalTrials.gov
EMA EUCTR
WHO ICTRP
ISRCTN (new)
GSK (new)

Data integrations/linkages:

PubMed – journal articles (example)
Cochrane Schizophrenia Group – Risk of Bias data (example)
Food and Drug Administration (FDA) – drug approval documents (example)
Health Research Authority – lay summaries (example)

From a technical perspective, we’ve created an Application Programming Interface (API) which allows programmers, data scientists, and others who want to use the data in their own tools, research, or analyses to make live queries against the site using code. Alternatively, for those who want to play with the entire database, we’ve just made our database dumps available for download.

OpenTrialsFDA

Working with Drs. Erick Turner and Ben Goldacre, we were selected as one of six finalists of the Open Science Prize (ultimately being placed as a runner-up) to build a prototype solution to make the information contained in the FDA’s Drug Approval Packages more easily accessible and searchable. The result can be found at fda.opentrials.net and enables full text searching of over 55,000 FDA drug approval documents – something that was not possible before. For more information on OpenTrialsFDA here’s a more detailed blog post + 3min video.

OpenTrials walkthrough

If you’ve not seen it yet, here’s a 10min walkthrough of the site from Jan 2017 – you’ll notice we’ve added some more features since then, but it’s a helpful overview:

Events

Over the past year the OpenTrials team has presented at Evidence Live in Oxford (June 2016), the International Open Data Conference in Madrid (October 2016), the OpenTrials hack day in Berlin (October 2016), the World Health Summit in Berlin (October 2016), the Cochrane Colloquium in Seoul (October 2016), Bioinformatics Meetup in London (December 2016), Clinical Innovation & Partnering World in London (March 2017), and the Cochrane UK & Ireland Symposium in Oxford (March 2017). It’s been great to meet so many people who are as excited as we are about what OpenTrials is trying to achieve; we’ve had some great conversations with users and potential data partners, leading to some interesting ideas for collaborations and development in the next phase of the project.

We’ve also run a number of user testing sessions with domain experts, including medical librarians – this has been invaluable. Thank you again to all those who volunteered their time to help us understand what works, what doesn’t, what’s not obvious, and what features would be useful – we’ve added everything you’ve pointed out to our GitHub issues – have a look, and feel free to get involved in the conversations (or contribute code if you’re more technical).

Challenges and successes

Website layout/content changing, causing data scrapers to fail

The majority of the data we currently import is using scrapers, grabbing the data directly from a source’s website. This relies on two things:

that the order/location of the data stays in the same place (i.e. the layout/design of the website does not change)
the structure of the data itself stays the same (e.g. one data source changed the text it used to represent both sexes from ‘both’ to ‘all’)

If either or both of these change, our scraper can no longer retrieve data until it is rewritten to accommodate the changes. An example which has impacted us is the [email protected] website which we’ve used as the data source for our OpenTrialsFDA project. The site was redesigned after our initial scrape, meaning that to keep the documents on OpenTrialsFDA up-to-date our scraper needed rewriting.

Suggestion: we encourage those offering searchable databases on their website to also provide the option to retrieve the data via either an API or bulk download (preferred).

Data heterogeneity

When combining or grouping data from multiple sources, we’ve encountered issues where the same elements are represented in different ways. This is due to a combination of sources allowing free-text input and not using standards.

A good example of this is geographical location – for instance ‘United Kingdom’ may be entered as ‘United Kingdom’, ‘Great Britain’, ‘UK’, or ‘GB’.

The effect of this on the project is that we’ve spent a lot of time processing and normalising data to make it more usable, and the task is ongoing and potentially unending.

Suggestion: trial registries should use known standards for metadata – in the case of countries, ISO country codes, and for fields such as condition names, a controlled vocabulary such as MeSH.

N.B. In Phase 2 we plan to deploy a controlled vocabulary/ontology such as MeSH or SNOMED CT.

Licensing

In order for us to use a dataset in OpenTrials it must be offered with a suitably permissive licence. Ideally, a dataset would be licenced as open data, meaning that the data can be “freely used, modified, and shared by anyone for any purpose”, for example under a Creative Commons licence such as Attribution 4.0 International (CC BY 4.0) or even better as a Public Domain Dedication (CC0).

Currently, the majority of datasets we see are not usable due to restrictive terms and conditions on their websites. This may be due to organisations using boilerplate terms and conditions with built in restrictions, erring on the side of perceived risk, or wanting to protect information they perceive as having value as intellectual property to the organisation.

Over the past six months, with the help of an intellectual property lawyer, we’ve been in discussions with a number of organisations which have restrictive data licenses. We’ve explained how we’d like to use their data on OpenTrials, how their current licence prevents that use, and how different parts of the licence (e.g. non-commercial, personal use only, no redistribution) are problematic/ambiguous, and have suggested more open, permissive alternatives.

Suggestion: if you’re a data provider, we’d encourage you to follow the example of one of the organisations below and make your data licences more open – we’re happy to talk to you about the issues: [email protected]

N.B. We’re planning a detailed blog post about licensing – watch this space!

Licensing successes

We’re pleased to announce that two organisations (ISRCTN and GSK) have already changed their terms and conditions to allow far greater use of their data.

In the case of ISRCTN, this covers their trial metadata (e.g. condition, intervention, trial title, phase etc) under a CC BY 4.0 licence. In the case of GSK, this covers both their trial metadata and their collection of documents relating to trials (Protocol Summaries, Scientific Results Summaries, Protocols, and Clinical Study Reports).

We’ve just added these new sources to OpenTrials, meaning that even more trials and documents are now listed – many thanks to both organisations for showing great leadership on this issue!

Commenting on this progress, Andrew Freeman, Head of Medical Policy at GSK said:

“We can and do publicly disclose information about our clinical trials on our register. Disclosure is important but it’s not enough. The value of disclosing information can be significantly limited if the information is not readily accessible and usable. To that end, we recently clarified that the use of information on our register is unrestricted provided that it may not be used in applications by others for regulatory approval of a product.”

How to get involved in the meantime

If you’re keen to get involved in discussions with other OpenTrials users and open data fans, there are a number of ways you can do that while our core team is having a rest.

If you’re interested in looking at the bugs and feature requests we have (or want to file new ones!), take a look at our GitHub issues and feel free to comment on any with your insights. We also have a forum where you can discuss issues relating to OpenTrials, for instance if you know of a data source we might be interested in using, or a way of improving matches or cleaning the data. For the more technical amongst you, feel free to also contribute code and get involved in our Gitter chat room.

We’d also like to hear about any problems you have with the OpenTrials explorer and OpenTrialsFDA – use the ‘Flag an error’ link at the bottom of any page.

And lastly, use our API and/or database downloads to create tools, visualisations, and analyses + let us know what you’ve made!

What we’ve been reading

And lastly, as it’s going to be a while until we’re in touch again, here’s a bumper crop of articles from the last few months:

Trial data transparency

Technical

Reporting bias

Other

As always, for the latest updates make sure you’re subscribed to our newsletter and follow us on Twitter @opentrials

Signing out for now from the OpenTrials team at Open Knowledge International – see you on the other side!