A vital part of the OpenTrials project is data collection. We get data from various sources to process, and link them in the next stage of our data pipeline. Let’s take a look at our data collection system:

Overview

For data acquisition we use a component called collectors:

collectors

Key facts about collectors:

Each collector runs as a daemon on Docker Cloud and always restarts after exiting – this is to ensure we provide continuous data collection, staying in sync with the specific data source. As we have to deal with very inconsistent sets of data sources, there are no strict rules about how collectors should be implemented, except that they must comply with the collect(conf, conn, *args) function interface. However, the usual flow is:

  1. Check the most recent update date of collected data records
  2. Iteratively collect data, starting from this date
  3. Log any iteration errors without exiting
  4. Sleep and exit after finishing

After the collector exits, its Docker container is automatically restarted, and the process is done again. Using this flow, our collectors run continuously, getting and updating the data from external sources.

 

Collectors in Detail

Even though each collector is written specifically for a different source, they share many common functionalities. To avoid repeating ourselves, we extracted them to a base collectors library which includes:

The most important concept here is a Record defined using Fields:

Record

A record is a representation of an atomic data unit to collect. This class reflects the source data model and differs for all sources.

Collectors use the Record.create(url, data) factory method to create a record. All passed data is parsed using the corresponding field types. Records are then written to the warehouse using the record.write(conf, conn) method.

 

Using Scrapy in Collectors

Collectors can run any Python code to get and write records. However, for many cases the most convenient way is to use the Scrapy framework.

Scraper

Scrapy is the most popular Python framework for data scraping. It manages many tasks including request scheduling, page deduplication, timeouts, and auto throttling (see the Scrapy documentation for more details). In our case, we are interested in the following features:

We use Scrapy’s programming interface to implement this functionality in our collectors. Our base collectors library provides a Scrapy pipeline to write the scraped records to our data warehouse.

 

Conclusion

This was a high-level introduction to the OpenTrials data collection framework. In the future we will write about some specific aspects in more detail, including using Scrapy, or about the next data pipeline stages (e.g. data processing and linking).

All our code is open-sourced. Feel free to explore, contribute, and use it:

https://github.com/opentrials/collectors

We really appreciate any feedback, ideas and comments.

Thanks!

 

Evgeny Karev

Senior Developer, Open Knowledge International

[email protected]

Twitter: @opentrials

Leave a Reply

Your email address will not be published. Required fields are marked *