End of OpenTrials Phase 1

TL;DR – OpenTrials has finished Phase 1 funding – this means that the team is having a break, with the project on hold until we secure Phase 2 funding (we’re currently working on it). In the meantime the OpenTrials explorer, our API, and database downloads will remain freely accessible. Thanks to everyone who’s contributed their time and ideas to this phase of OpenTrials, and to our funder, the Laura and John Arnold Foundation, for their generous and insightful support of the project.

It’s been a very interesting journey over the past couple of years and we’re really proud of what we’ve achieved with a small team in a short amount of time. This project couldn’t have got as far as it has without many of you, who’ve helped in a number of ways, from pledging assistance, helping with user testing or giving us feedback at conferences, and those of you we’ve met to discuss aspects of the project. And of course those individuals and organisations who have offered/donated data to the project. So, thank you – we’re making OpenTrials for you, and want to make it as useful as possible, so all your contributions have been invaluable.

What follows is a summary of what we’ve built, events we’ve been involved in, some challenges and successes of the project so far, how to get involved in the meantime, and lastly some interesting articles we’ve been reading recently.

What have we built?

OpenTrials

Based on Ben Goldacre and Jonathan Gray’s 2016 paper outlining the need for a linked database of clinical trials (and related documents and data), we’ve built an early (beta) version of OpenTrials, bringing together data from multiple clinical trial registries, deduplicated those trials, automatically matched them to publications, and integrated third-party datasets. We’ve also implemented a contribution feature, meaning that users can submit a wide range of documents and data relating to trials, further enhancing the potential information available.

Here’s a summary of what data we currently have in OpenTrials:

351,851 trials from five clinical trial registries:

  • ClinicalTrials.gov
  • EMA EUCTR
  • WHO ICTRP
  • ISRCTN (new)
  • GSK (new)

Data integrations/linkages:

  • PubMed – journal articles (example)
  • Cochrane Schizophrenia Group – Risk of Bias data (example)
  • Food and Drug Administration (FDA) – drug approval documents (example)
  • Health Research Authority – lay summaries (example)

From a technical perspective, we’ve created an Application Programming Interface (API) which allows programmers, data scientists, and others who want to use the data in their own tools, research, or analyses to make live queries against the site using code. Alternatively, for those who want to play with the entire database, we’ve just made our database dumps available for download.

 

OpenTrialsFDA

Working with Drs. Erick Turner and Ben Goldacre, we were selected as one of six finalists of the Open Science Prize (ultimately being placed as a runner-up) to build a prototype solution to make the information contained in the FDA’s Drug Approval Packages more easily accessible and searchable. The result can be found at fda.opentrials.net and enables full text searching of over 55,000 FDA drug approval documents – something that was not possible before. For more information on OpenTrialsFDA here’s a more detailed blog post + 3min video.

 

OpenTrials walkthrough

If you’ve not seen it yet, here’s a 10min walkthrough of the site from Jan 2017 – you’ll notice we’ve added some more features since then, but it’s a helpful overview:

Events

Over the past year the OpenTrials team has presented at Evidence Live in Oxford (June 2016), the International Open Data Conference in Madrid (October 2016), the OpenTrials hack day in Berlin (October 2016), the World Health Summit in Berlin (October 2016), the Cochrane Colloquium in Seoul (October 2016), Bioinformatics Meetup in London (December 2016), Clinical Innovation & Partnering World in London (March 2017), and the Cochrane UK & Ireland Symposium in Oxford (March 2017). It’s been great to meet so many people who are as excited as we are about what OpenTrials is trying to achieve; we’ve had some great conversations with users and potential data partners, leading to some interesting ideas for collaborations and development in the next phase of the project.

We’ve also run a number of user testing sessions with domain experts, including medical librarians – this has been invaluable. Thank you again to all those who volunteered their time to help us understand what works, what doesn’t, what’s not obvious, and what features would be useful – we’ve added everything you’ve pointed out to our GitHub issues – have a look, and feel free to get involved in the conversations (or contribute code if you’re more technical).

 

Challenges and successes

Website layout/content changing, causing data scrapers to fail

The majority of the data we currently import is using scrapers, grabbing the data directly from a source’s website. This relies on two things:

  • that the order/location of the data stays in the same place (i.e. the layout/design of the website does not change)
  • the structure of the data itself stays the same (e.g. one data source changed the text it used to represent both sexes from ‘both’ to ‘all’)

If either or both of these change, our scraper can no longer retrieve data until it is rewritten to accommodate the changes. An example which has impacted us is the [email protected] website which we’ve used as the data source for our OpenTrialsFDA project. The site was redesigned after our initial scrape, meaning that to keep the documents on OpenTrialsFDA up-to-date our scraper needed rewriting.

Suggestion: we encourage those offering searchable databases on their website to also provide the option to retrieve the data via either an API or bulk download (preferred).

 

Data heterogeneity

When combining or grouping data from multiple sources, we’ve encountered issues where the same elements are represented in different ways. This is due to a combination of sources allowing free-text input and not using standards.

A good example of this is geographical location – for instance ‘United Kingdom’ may be entered as ‘United Kingdom’, ‘Great Britain’, ‘UK’, or ‘GB’.

The effect of this on the project is that we’ve spent a lot of time processing and normalising data to make it more usable, and the task is ongoing and potentially unending.

Suggestion: trial registries should use known standards for metadata – in the case of countries, ISO country codes, and for fields such as condition names, a controlled vocabulary such as MeSH.

N.B. In Phase 2 we plan to deploy a controlled vocabulary/ontology such as MeSH or SNOMED CT.

 

Licensing

In order for us to use a dataset in OpenTrials it must be offered with a suitably permissive licence. Ideally, a dataset would be licenced as open data, meaning that the data can be “freely used, modified, and shared by anyone for any purpose”, for example under a Creative Commons licence such as Attribution 4.0 International (CC BY 4.0) or even better as a Public Domain Dedication (CC0).

Currently, the majority of datasets we see are not usable due to restrictive terms and conditions on their websites. This may be due to organisations using boilerplate terms and conditions with built in restrictions, erring on the side of perceived risk, or wanting to protect information they perceive as having value as intellectual property to the organisation.

Over the past six months, with the help of an intellectual property lawyer, we’ve been in discussions with a number of organisations which have restrictive data licenses. We’ve explained how we’d like to use their data on OpenTrials, how their current licence prevents that use, and how different parts of the licence (e.g. non-commercial, personal use only, no redistribution) are problematic/ambiguous, and have suggested more open, permissive alternatives.

Suggestion: if you’re a data provider, we’d encourage you to follow the example of one of the organisations below and make your data licences more open – we’re happy to talk to you about the issues: [email protected]

N.B. We’re planning a detailed blog post about licensing – watch this space!

 

Licensing successes

We’re pleased to announce that two organisations (ISRCTN and GSK) have already changed their terms and conditions to allow far greater use of their data.

In the case of ISRCTN, this covers their trial metadata (e.g. condition, intervention, trial title, phase etc) under a CC BY 4.0 licence. In the case of GSK, this covers both their trial metadata and their collection of documents relating to trials (Protocol Summaries, Scientific Results Summaries, Protocols, and Clinical Study Reports).

We’ve just added these new sources to OpenTrials, meaning that even more trials and documents are now listed – many thanks to both organisations for showing great leadership on this issue!

Commenting on this progress, Andrew Freeman, Head of Medical Policy at GSK said:

We can and do publicly disclose information about our clinical trials on our register. Disclosure is important but it’s not enough. The value of disclosing information can be significantly limited if the information is not readily accessible and usable. To that end, we recently clarified that the use of information on our register is unrestricted provided that it may not be used in applications by others for regulatory approval of a product.

 

How to get involved in the meantime

If you’re keen to get involved in discussions with other OpenTrials users and open data fans, there are a number of ways you can do that while our core team is having a rest.

If you’re interested in looking at the bugs and feature requests we have (or want to file new ones!), take a look at our GitHub issues and feel free to comment on any with your insights. We also have a forum where you can discuss issues relating to OpenTrials, for instance if you know of a data source we might be interested in using, or a way of improving matches or cleaning the data. For the more technical amongst you, feel free to also contribute code and get involved in our Gitter chat room.

We’d also like to hear about any problems you have with the OpenTrials explorer and OpenTrialsFDA – use the ‘Flag an error’ link at the bottom of any page.

And lastly, use our API and/or database downloads to create tools, visualisations, and analyses + let us know what you’ve made!

 

What we’ve been reading

And lastly, as it’s going to be a while until we’re in touch again, here’s a bumper crop of articles from the last few months:

Trial data transparency

Technical

Reporting bias

Other

 

As always, for the latest updates make sure you’re subscribed to our newsletter and follow us on Twitter @opentrials

 

Signing out for now from the OpenTrials team at Open Knowledge International – see you on the other side!

Open Knowledge International’s OpenTrials team, March 2017

Open Science Prize results for OpenTrialsFDA + feedback

After reaching the final three of the Open Science Prize, and being judged by an expert panel, we can announce that our project, OpenTrialsFDA was placed as a runner-up alongside MyGene2, with Real-Time Evolutionary Tracking for Pathogen Surveillance and Epidemiological Investigation being declared the winner of the grand prize of $230,000. As you may know, OpenTrialsFDA is designed to make clinical trial data from the Food and Drug Administration (FDA) more easily accessible and searchable. Congratulations to all the finalists – it’s been great to see such a range of projects dedicated to innovating and advancing areas of biomedicine and health through open science, content, and data.

Although OpenTrialsFDA didn’t win the grand prize, we were very grateful for all the public votes which took us from the original shortlist of six to the final three. A number of the public also chose to give comments along with their vote – there were many fantastic ones and we wanted to share a selection:

“I made OpenTrialsFDA my first choice because patients have suffered and died because information from research which should be available to health professionals and patients has not been accessible. This project addresses this important problem.”

“I am particularly keen on projects that get data out in there open where it can be subjected to independent analysis.”

“As a professional technologist, I have found the APIs produced by the folks at OpenTrials are of extremely high quality; making integrating with and using the data straightforward yet powerful. If all open science projects had as technically competent and sophisticated people involved then huge opportunities for knowledge growth would open up to us all very quickly.”

“OpenTrialsFDA project will be relevant to a wide range of clinical research fields, with important potential impacts on patients, clinicians, researchers, and health decision makers.”

“It’s great to see a clinical science entry in the Open Science Award. Easier access to the full FDA data for approved indications for approved drugs will assist clinical practitioners, biomedical scientists and educated citizens in making more informed choices on medical and research choices.”

“Improving transparency of clinical data will literally save and transform lives”

“Drugs affect peoples’ lives and all too frequently patients and physicians do not have the data and evidence they need to make truly informed decisions about treatments. Access to data on effectiveness and harms is becoming even more critical with rapidly rising drug prices. Shining a light on what we know and don’t know about treatments will hopefully move us in a direction of making better individual and population-based decisions about treatments.”

Thanks so much for all the kind words – it means a lot to the entire team and we strongly believe in the power of the project to facilitate positive change!

 

What’s next for OpenTrialsFDA?

The search engine will continue to exist at fda.opentrials.net – it currently indexes over 55,000 FDA approval documents and where possible links to clinical trials on OpenTrials (a few examples). While development on OpenTrialsFDA has stopped for the moment (we’re focusing on the main OpenTrials explorer) there are a number of ways you can help support the project:

Upcoming events: OpenTrials at Cochrane UK and Clinical Innovation & Partnering World

On 9th March, our Community Manager,  Ben Meghreblian, will participate in two sessions at Clinical Innovation and Partnering World in London. The event is a great opportunity for OpenTrials as it brings together leading experts in the fields of clinical innovation, outsourcing, alliance management and strategic partnering, with attendees from big pharma, biotechnology/technology firms, and contract research organisations (CROs). The focus of the event is on disruptors in clinical outsourcing and innovation, so we’re pleased to have been invited.

Ben will be giving a plenary talk, introducing attendees to OpenTrials, and will also run a roundtable titled ‘Aligning external with internal values’ which will explore some of the issues relating to clinical data sharing, including benefits to patients and researchers, concerns, solutions, and licensing issues (session details). We’re really looking forward to having some useful discussions, hearing industry views, and demonstrating the power of combining datasets in OpenTrials.

 

After presenting OpenTrials at the annual Cochrane Colloquium in Seoul last year, we’ll be giving an updated talk at the Cochrane UK & Ireland Symposium in Oxford on 15th March (session details). One of our researchers, Jessica Fleminger, will join Ben Meghreblian in giving an overview of the project, the latest progress, and ask for your feedback on the OpenTrials explorer.

We encourage anyone coming along (hello systematic reviewers!) to bring a laptop and make yourself familiar with OpenTrials so that we can have some fruitful discussions about what’s good, what’s missing/broken, and generally how we can make it a better tool for users.

 

User testing – medical librarians

We’ll also be conducting some user testing sessions with medical librarians, so if that’s you and you’re interested in helping us improve OpenTrials, fill in this form to let us know.

As always, if you have feedback about OpenTrials you can email us your thoughts, or file a GitHub issue if you’re familiar with it. We’ll keep you updated with any developments on the blog, but please make sure you subscribe to our newsletter and follow us on Twitter!

Data Journalism Hack Day and Bioinformatics & Data Analytics Meetup

On the 1st and 2nd of December, our Community Manager, Ben Meghreblian, was out spreading the word about OpenTrials in London.

Firstly, we were invited to talk to the members of the London Containing Bioinformatics & Data Analytics Meetup – they’re a mix of bio/health/medical informaticians, machine learners, software developers, engineers, and data scientists, so it was a great opportunity to talk about OpenTrials to a slightly different crowd from usual. There were lots of interesting questions, some potential offers of help from those who supervise students, and the conversations continued long after the session ended. Thanks to Paul A for organising and Ben vZ for hosting!

highres_456498764
Ben Meghreblian talking about OpenTrials at the London Meetup – photo by Paul Agapow

 

Secondly, we took part in a Data Journalism Hack Day at King’s College London (KCL), in collaboration with Open Knowledge International and SoBigData. The day was organised for students of the KCL Data Journalism Course, and saw them working in groups on issues from international taxation, clinical trials, human rights violations, and natural resource extraction in the Global South. The students had already extracted data from various sources and their aim for the day was to work with data and domain experts to build a story around the data which they would subsequently write up.

image01
KCL students discussing ideas for their data journalism stories – photo by Ben Meghreblian

 

The students working with the OpenTrials data had previously decided to focus on migraine trials and had already used the OpenTrials API to extract the relevant trials, importing them into Excel where they spent most of their time cleaning the data, analysing it, and visualising it.

It was an enjoyable day helping the students better understand how clinical trials work, what sort of issues may be worth considering for their story, and how the data can inform it. It was also interesting to see the OpenTrials data being used in a real, hands-on way, how powerful it can be to answer specific questions, along with challenges and limitations relating to missing data from some registries and non-normalised company names – issues we are aware of and want to address.

We’re looking forward to reading the finished clinical trials ‘exclusive’ by the students – meanwhile we don’t want to steal their journalistic thunder, so we won’t post any of the cool visualisations they’ve created yet, but once the story is live, we’ll link to it here.

 

OpenTrials at Cochrane Colloquium

Earlier this year, two weeks after the beta launch of OpenTrials, two members of our team, Ben Meghreblian (Community Manager) and Jessica Fleminger (Researcher) travelled to the Cochrane Colloquium in Seoul to talk about OpenTrials. As you may know, we are keen to speak to different users of OpenTrials to better understand how it works well, how it doesn’t, and how we can improve it. This was a great opportunity to speak to a mixture of researchers, systematic reviewers, and information specialists and get their feedback, along with spreading the word amongst the Cochrane community.

Our talk covered some of the problems with the information architecture of evidence-based medicine, how OpenTrials aims to help fix them, a technical overview of the platform, how we import data, licensing issues, user examples, and a demo – here are the slides:

Along with some good questions and suggestions for OpenTrials, we also had a number of meetings, both planned and spontaneous, to discuss potential collaborations to improve OpenTrials functionality and integrate others’ data into our system.

A great example of this is the Risk of Bias data which the Cochrane Schizophrenia Group previously kindly gave to us (thanks a lot!). This is structured data produced by researchers, who have graded schizophrenia trials on issues such as blinding and selective reporting. Here are some examples of how we’ve integrated this data onto individual trial pages (hint: have a look at the ‘Methodological rigour’ section at the bottom of each page).

twitter-seoul

As always, if you have feedback about OpenTrials you can email us your thoughts, or file a GitHub issue if you’re familiar with it. We’ll keep you updated with any developments on the blog, but please make sure you subscribe to our newsletter and follow us on Twitter!

OpenTrials API information

Since our beta launch, we’ve had a steady stream of technical questions about how to use and interact with the data that powers OpenTrials.

Good news! We have a beta public API available that will allow you to retrieve structured data about clinical trials listed in the OpenTrials database for using for your own projects, analyses, and visualisations. While we don’t enforce rate limits on it yet, we ask you to be considerate of our server resources when using the API, limiting your speed to less than a request per few seconds.

We aim to keep adding to this post with new information, demonstrations, and tutorials relating to the OpenTrials API, so bookmark it and come back. If you’ve written something or created an interesting app, visualisation, or analysis, let us know and we’ll add it to the examples below.

 

Here are some links to help you understand the OpenTrials data and API:

 

20161008_160858
Code from the OpenTrials Hack Day in Berlin (photo by benmeg / CC BY)

Projects and documentation

Want some inspiration? Here are some projects and documentation which others have created – many thanks to you!

 

OpenTrials API record structure (via Darko Bergant)
OpenTrials API record structure (via Darko Bergant)


If you’ve created a project using the OpenTrials API or used another language to talk to the API, and you’ve documented your efforts we’d love to showcase and link to them –
get in touch!

For those of you looking for a database dump, you can find them here.

We’re also happy to discuss potential collaborations, and ask that if you use our data for academic research, you cite our paper – thanks!

If you have any feedback, bug reports, or feature requests for OpenTrials or the API, please either email us or file an issue on our GitHub repo.

Hack Day summary + API information

It’s been quite a week – we were speaking at the International Open Data Conference in Madrid on Fri 7th, running a Hack Day in Berlin the following day, and then launching OpenTrials beta two days later at the World Health Summit.

Now that the dust has settled, I wanted to give you a brief summary of what we got up to at the Hack Day and also give you some information on the API for those of you who want to play around with the data that powers OpenTrials.

Hack Day

We had a range of people attend our event in Berlin, from researchers, developers, artists, and industry consultants. Some were just interested in the concept of OpenTrials and its ability to improve medicine, but were new to coding, whereas others were seasoned developers used to hacking code and got their teeth straight into our API.

 

Open Trials Hack Day group in discussion
The OpenTrials Hack Day in discussion (photo by benmeg / CC BY)

 

A range of ideas for projects, integrations, and improvements were discussed, including:

  • Integrating OpenTrials information into WikiData
  • Adding information on R&D costs to intervention pages
  • Showing additional summary information in the search results (to avoiding clicking through to a trial’s page)
  • Linking more academic publications to clinical trials listed in OpenTrials by searching the full text of journal articles for mentions of clinical trial IDs (we currently only look at the abstracts)
  • Creating a two-way lookup table, showing what fields/variables can be expected from each clinical trial registry source
  • Adding early FDA approval letters (not available from [email protected]) to drug/intervention pages (subject to an ongoing, lengthy FOIA request)
  • Displaying links between conditions, interventions, and trials (see diagram below)

 

Mapping conditions, interventions, and trials
Mapping conditions, interventions, and trials (photo by benmeg / CC BY)

 

Of particular mention is Matthias Koenig’s hack (see above) – more screenshots and code here.

The conversations of the day generally revolved around medicine, but covered everything from text and data mining, improving research quality, clinical trial transparency, through to open access and open science, We’d love to keep the conversations going, so if you’re inspired to work on something or have already started, please let us know and we’ll make sure to spread the word.

Whether you attended or are interested in working with the OpenTrials data, please join the conversation in our chatroom and view the code behind OpenTrials.

 

API information

If you’re a developer or data scientist who’s interested in playing with our underlying data, we have an API available (caveat emptor: early version, may change) – here are some documents to get you up to speed with the data:

Update: If you use R, be sure to check out Darko Bergant’s great tutorial – Using OpenTrials API with R

 

NB. You can download a database dump of OpenTrials here.

We’re also happy to discuss potential collaborations, and ask that if you use our data for academic research, you cite our paper – thanks!


If you have any feedback, bug reports, or feature requests for OpenTrials please either
email us or file an issue on our GitHub repo.

OpenTrials launch date + Hack Day

Exciting news! OpenTrials will officially launch its beta on Monday 10th October 2016 at the World Health Summit in Berlin. After months of work behind-the-scenes meeting, planning, and developing, we’re all really excited about demoing OpenTrials to the world and announcing how to access and use the site!

worldhealthsummit_logoThe launch will take place at the ‘Fostering Open Science in Global Health’ workshop, with OpenTrials being represented by our Community Manager, Ben Meghreblian. The workshop will be a great opportunity to talk about the role of open data, open science, and generally how being open can bring improvements in medicine and beyond!

If you’ll be attending the conference or the workshop, we’d love to meet you – please do get in touch and let us know.

Hack Day

If that wasn’t enough, we also have a confirmed date and location for the OpenTrials Hack Day – it will take place on Saturday 8th October at the German office of Wikimedia in Berlin.

We’re inviting people from a range of backgrounds. So, if you’re a developer, data scientist, health technologist, open data advocate, or otherwise interested in health, medicine, and clinical trials, come along and learn more about the data that powers OpenTrials, how it’s structured, and how to use our API to search the OpenTrials database or build applications using the data.

On the day our technical lead and a domain expert will be on hand to explain the data and facilitate the day – we’re really looking forward to seeing what clever hacks and mini-projects you’ll create.

For those of you who have already asked, we’ll be releasing documentation on the OpenTrials API and database soon, but meanwhile if you’re interested in the event you’ll find more details on the OpenTrials Eventbrite page, or you can register quickly below.

 

OpenTrials presents at Evidence Live

If you follow us on Twitter you may know that we presented our progress with OpenTrials at the Evidence Live conference in Oxford a few weeks ago. Amongst those attending were leaders across the world of Evidence Based Medicine, including researchers, doctors, and the pharmaceutical industry, so we were excited to participate!

It was great to speak to so many people who are interested in OpenTrials, both in terms of researchers who want to use the platform and those with a general enthusiasm for its impact on medicine.

Around 40 people attended our talk which explained why OpenTrials is an important infrastructure project for medicine, covered some of the technical aspects of the platform, details of what data we’ve imported so far, and lastly a quick demo.

If you’re feeling impatient, here are the slides from the talk, or scroll down for a summary.

OpenTrials presenting at Evidence Live 2016

 

Ben Goldacre and Vitor Baptista present OpenTrials at Evidence Live 2016 (photo by benmeg / CC BY)

What we’ve imported into the OpenTrials database so far

  • 331,999 deduplicated trials, collected from three clinical trial registries:
    • ClinicalTrials.gov 205,422
    • EU CTR 35,159
    • WHO ICTRP 298,688
imported-trials

Current functionality

  • Basic search (by keyword)
  • Searching for trials with publications
  • Uploading missing data/documents for a particular trial
  • Showing trials with discrepancies (e.g. target sample size)

What we’re importing next

Feedback and get involved

If you attended the talk and have any questions or feedback, please email us. And generally if you’re interested in contributing to OpenTrials, get in touch.

Want to get early access to the data and be a user tester? Sign up and we’ll be in touch soon.

How we collect data in OpenTrials

A vital part of the OpenTrials project is data collection. We get data from various sources to process, and link them in the next stage of our data pipeline. Let’s take a look at our data collection system:

Overview

For data acquisition we use a component called collectors:

collectors

Key facts about collectors:

  • Written in Python
  • Executed in isolated Docker containers (running independently)
  • Deployed to Docker Cloud using Travis CI/CD

Each collector runs as a daemon on Docker Cloud and always restarts after exiting – this is to ensure we provide continuous data collection, staying in sync with the specific data source. As we have to deal with very inconsistent sets of data sources, there are no strict rules about how collectors should be implemented, except that they must comply with the collect(conf, conn, *args) function interface. However, the usual flow is:

  1. Check the most recent update date of collected data records
  2. Iteratively collect data, starting from this date
  3. Log any iteration errors without exiting
  4. Sleep and exit after finishing

After the collector exits, its Docker container is automatically restarted, and the process is done again. Using this flow, our collectors run continuously, getting and updating the data from external sources.

 

Collectors in Detail

Even though each collector is written specifically for a different source, they share many common functionalities. To avoid repeating ourselves, we extracted them to a base collectors library which includes:

  • Record – data model
  • fields – data model fields
  • helpers – various helpers

The most important concept here is a Record defined using Fields:

Record

A record is a representation of an atomic data unit to collect. This class reflects the source data model and differs for all sources.

Collectors use the Record.create(url, data) factory method to create a record. All passed data is parsed using the corresponding field types. Records are then written to the warehouse using the record.write(conf, conn) method.

 

Using Scrapy in Collectors

Collectors can run any Python code to get and write records. However, for many cases the most convenient way is to use the Scrapy framework.

Scraper

Scrapy is the most popular Python framework for data scraping. It manages many tasks including request scheduling, page deduplication, timeouts, and auto throttling (see the Scrapy documentation for more details). In our case, we are interested in the following features:

  • spider – to get HTTP responses for the data we’re interested in
  • parser – to get a Record from an HTTP response from the spider

We use Scrapy’s programming interface to implement this functionality in our collectors. Our base collectors library provides a Scrapy pipeline to write the scraped records to our data warehouse.

 

Conclusion

This was a high-level introduction to the OpenTrials data collection framework. In the future we will write about some specific aspects in more detail, including using Scrapy, or about the next data pipeline stages (e.g. data processing and linking).

All our code is open-sourced. Feel free to explore, contribute, and use it:

https://github.com/opentrials/collectors

We really appreciate any feedback, ideas and comments.

Thanks!

 

Evgeny Karev

Senior Developer, Open Knowledge International

[email protected]

Twitter: @opentrials