CC Open Source Blog

Visualize CC Catalog data - data processing

This blog is part of the series: GSoC 2019: The Linked Commons

ℹ️ 2023-08-31: This project was archived along with the shuttering of CC Search (now Openverse). Please also see the Quantifying the Commons project.

Welcome to the data processing part of the GSoC project! In this blog post, I am going to tell you about my first thoughts with the real data, and give you some details of the implementation developed so far.

Data Extraction

Each month, Creative Commons uses Common Crawl data to find all domains that contain CC licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in S3 buckets and Apache Spark is used to extract the data from Common Crawl.

Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:

Each file can easily contain dozens of millions of rows. My first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, I will test the scripts for the data processing with a portion of the real data. Afterwards, I will use Dask with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.

Cleansing and Filtering

This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. I try to:

Formatting Domain Names

In the dataset, we have domain names in the form of URLs. But we want to make the graph looks pretty well. This is why I am going to extract the domain name from the URLs we have in the dataset. For this purpose, I use tldextract, which is a simple and complete open source library for extracting the parts of the domains (e.g. suffix, subdomain, domain name). This package is available in conda-forge too. Here is how tldextract works:

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk') #extract the domain name "bbc"

The main part is the extraction of the domain name. This will be applied to the _provider_domain_ and links fields in order to build the graph. The domain names will be the ones displayed over the nodes, as depicted in my first blog post.

License Validation

Another important aspect is the licenses types. In the dataset, we do not have the exact license name; rather, we have a URL that directs to the license definition on creativecommons.org. We have developed a function with some regular expressions to validate the correct format of these URls, and extracts from them the license name and version. This information will be shown in the pie chart that appears after the user clicks on a node.

'https://creativecommons.org/licenses/by/4.0/' #valid license
'https://creativecommons.org/licenses/zero/' #non-valid license

Coming Soon

You can follow the project development in the Github repo.

CC Data Catalog Visualization is my GSoC 2019 project under the guidance of Sophine Clachar, who has been greatly helpful and considerate since the GSoC application period. Also, my backup mentor, Breno Ferreira, and engineering director Kriti Godey, have been very supportive.

Have a nice week!

Maria