CC Open Source Blog

Combating dead links with CC Link Checker

author's gravatar

by Bhumij Gupta on 2019-07-15

Creative Commons provides vast number of public copyright licenses for people who want to enable free distribution of their work. Creative Commons licenses currently covers over 1.6 billion resources. These license files are then translated to multiple different languages and ported for different jurisdictions for international usage. People link to the respective licenses along with their licensed works. These license files are in the form of html files, stored in creativecommons/creativecommons.org repo.

Problem Statement

These license files contain links to their deeds, license translated to other languages, internal links, and many more. Sometimes due to errors, these files may contain dead / broken links. These broken links leads to incorrect/ incomplete understanding of the license clauses and permissions by the viewer. This may sometime lead to incorrect usage of the licensed resource.

At the time of writing, the repo contains over 930 files with an average of 50 links per file. New translation of license deeds are regularly added to the repo and the existing license deeds are also updated frequently. Manually testing these files would take a lot of time. Considering the future additions of licenses, translations and jurisdiction ports, the time required for manual testing would increase drastically.

CC Link Checker aims to solve the problem by automating the task of checking links in the license and reporting errorneous and broken links. The python script scrapes all the licenses from the repo and checks the status of the links in the files. The script checks the link for 40X errors, timeout errors and invalid schema error.

Firstly, let's get the features out of the way. The script uses multiprocessing, taking full use of multiple core processors, has 2 modes of output to CLI - default and verbose, and can also print the error links, summary of the result, and mapping of error links to their URLs of occurence in a file.

Now let's hop in the nerd train and take a deeper look at development journey.

Development Journey

I started working on the project a month back i.e. 13 Jun. During this journey there were many ups and downs, with some productive and some totally unproductive days. For better understanding, lets look the journey for each week.

Future work

CC Link Checker is only possible due to the support and guidance of my mentors Alden Page and Timid Robot Zehta, who have been very supportive on every step of the project. Also I would like to thank engineering director Kriti Godey for her continuous support.

You can follow the project on Github: creativecommons/cc-link-checker. You can also join the discussion on #cc-link-checker on Slack

The project is approaching its completion. Can't wait to see it in production.

Signing off Bhumij Gupta