A vital part of the OpenTrials project is data collection. We get data from various sources to process, and link them in the next stage of our data pipeline. Let’s take a look at our data collection system:
For data acquisition we use a component called collectors:
Key facts about collectors:
- Written in Python
- Executed in isolated Docker containers (running independently)
- Deployed to Docker Cloud using Travis CI/CD
Each collector runs as a daemon on Docker Cloud and always restarts after exiting – this is to ensure we provide continuous data collection, staying in sync with the specific data source. As we have to deal with very inconsistent sets of data sources, there are no strict rules about how collectors should be implemented, except that they must comply with the collect(conf, conn, *args) function interface. However, the usual flow is:
- Check the most recent update date of collected data records
- Iteratively collect data, starting from this date
- Log any iteration errors without exiting
- Sleep and exit after finishing
After the collector exits, its Docker container is automatically restarted, and the process is done again. Using this flow, our collectors run continuously, getting and updating the data from external sources.
Collectors in Detail
Even though each collector is written specifically for a different source, they share many common functionalities. To avoid repeating ourselves, we extracted them to a base collectors library which includes:
- Record – data model
- fields – data model fields
- helpers – various helpers
The most important concept here is a Record defined using Fields:
A record is a representation of an atomic data unit to collect. This class reflects the source data model and differs for all sources.
Collectors use the Record.create(url, data) factory method to create a record. All passed data is parsed using the corresponding field types. Records are then written to the warehouse using the record.write(conf, conn) method.
Using Scrapy in Collectors
Collectors can run any Python code to get and write records. However, for many cases the most convenient way is to use the Scrapy framework.
Scrapy is the most popular Python framework for data scraping. It manages many tasks including request scheduling, page deduplication, timeouts, and auto throttling (see the Scrapy documentation for more details). In our case, we are interested in the following features:
- spider – to get HTTP responses for the data we’re interested in
- parser – to get a Record from an HTTP response from the spider
We use Scrapy’s programming interface to implement this functionality in our collectors. Our base collectors library provides a Scrapy pipeline to write the scraped records to our data warehouse.
This was a high-level introduction to the OpenTrials data collection framework. In the future we will write about some specific aspects in more detail, including using Scrapy, or about the next data pipeline stages (e.g. data processing and linking).
All our code is open-sourced. Feel free to explore, contribute, and use it:
We really appreciate any feedback, ideas and comments.
Senior Developer, Open Knowledge International