CC Open Source Blog

Data flow: from API to DB

gravatar

by K S Srinidhi Krishna on 2020-07-22

This blog is part of the series: GSoC 2020: CC catalog

Introduction

The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database, which is then surfaced to the CC search tool. The workflows are set up for each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow. Airflow is an open source tool that helps us to schedule and monitor workflows.

Airflow intro

Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner. DAGs files are standard python files that are loaded from the defined DAG_FOLDER on a host. Airflow selects all the python files in the DAG_FOLDER that have a DAG instance defined globally, and executes them to create the DAG objects.

CC Catalog Workflow

In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are inside the dags directory in the repo dag folder. Provider workflows are set up to pull metadata about CC licensed images from the respective providers , the data pulled is structured into a standardised format and written into a TSV (Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader workflow.

Provider API workflow

The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.

Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs. These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.

Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the Science museum UK or Statens Museum for Kunst. We don’t expect museum providers to change data on a daily basis.

The scheduling of the DAGs by the scheduler daemons depends on a few parameters.

Example: Cleveland museum is currently scheduled for a monthly crawl with a starting date as 2020-01-15. cleveland_museum_workflow

Loader workflow

The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel, one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database. Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the failure directory for future inspection.

Loader workflow

Acknowledgement

I would like to thank Brent Moran for helping me write this blog post.