CC Open Source Blog

Visualize CC Catalog data - data processing part 2

This blog is part of the series: GSoC 2019: The Linked Commons

This is a continuation of my last blog post about the data processing part of the CC-data catalog visualization project. I recommend you to read that last post for a better understanding of what I'll explain here.

The data

Every dataset needs cleasing and pre processing operations before their analysis. In order to implement validations, I have to know first with what kind of inconsistencies I would deal with. Here are some interesting insights about the dataset:

Aside from the above, I have had to face with almost empty lines(meaning just a single column had information), columns bad separated (not a single but more than one tab between the columns), and some other usual problems of a real and non perfect dataset. I have made validations to catch these inconsistencies.

Data aggregation

It is needed to aggregate the data by provider_domain, in order to get the complete information of every node. Aggregating the images column is simple, as I only have to sum the values in that column. Now the links column is a little bit tricky to be aggregated. We have to remember that this field contains dictionaries, with domains as keys and the times they have been referenced to as values. So for aggregating this column, I need to:

Then, I have to extract creativecommons from the final links dictionary, and put the value into another column, called _Licences_qty_. This is because the quantity of links to can tell us how many licenses the provider_domains uses.

We also need to aggregate the licences column. The goal is to have a data structure that contains the licenses types the provider_domain uses, and to know how many licenses per each license type the provider_domain has. To achieve this, I:

At the end, we will have rows like the following:

Example row of the processed dataset
Example row, with data aggregated.

Considerations and future challenges

I mentioned before that there are provider domains with a lot of images and a few links, and vice versa. As I still have to prune and filter data, I can develop a rule to exclude the domains that are not relevant to the graph. This relevance can be determined by the quantity of images and/or links. My thought with the rules are the following:

The thresholds for the quantity of images and links are my intuitions from having seen the data and manually checking some provider domains. If it is possible I could validate it with some data analysis (checking average, maximum and minimum values of the columns).

Coming soon

You can follow the project development in the Github repo.

CC Data Catalog Visualization is my GSoC 2019 project under the guidance of Sophine Clachar, who has been greatly helpful and considerate since the GSoC application period. Also, my backup mentor, Breno Ferreira, and engineering director Kriti Godey, have been very supportive.

Have a nice week!