CC Open Source Blog

Data Science Discovery: Quantifying the Commons

University of California, Berkeley, Data Science Discovery Program Fall 2022

Project Objective

Problem Statement

In the previous years, from 2014 to 2017, Creative Commons (CC) have been releasing public reports detailing the growth, size, and usage of Creative Commons, demonstrating the significance and influences of Creative Commons. However, the effort to quantity Creative Commons has ceased at the proceeding year. This is the preincarnation of our current open-source project: Quantifying the Commons.

An example visualization from the previous report in 2017: 2017 State of the Commons data graph

The reason is that prior efforts to generate usage reports suffered unreliable data retrieval methods; while prone to malfunction over the updates of website architecture from data sources, these data extraction methods are not particularly rigorous in performance and have a significantly low (compared to current methods, at the scale or an hour v.s. 5 business days).

To advance and continue the work of quantifying CC product states, the student researchers are delegated the design and implementation for reliable data retrieval processes on CC data that were employed in previous reports to replicate past efforts of this project's preincarnation, quantify the size and diversity of CC Product Usage on the Internet.

Data Retrieval

How to detect county of CC-Licensed Documents?

If an online document uses a CC tool to protect it, then it will either be labeled as license under that tool or contain a hyperlink towards a creativecommons.org webpage that explains the license's rules (the deed).

Therefore, we may use the following approach to identify and count CC-licensed documents:

  1. Select a list of CC tools to inspect (provided by CC).
  2. Use APIs of different online platforms to detect and count documents that are labeled as license by platform and/or contains a hyperlink towards CC license webpages.
  3. Store these data in tabular form to contain the count of documents protected under each type of CC tools.

What platforms to collect counts from?

Here is a list of online platforms that we sampled document count from, as well as the delegations for platforms' data collection, visualization, and modeling in this project:

Platforms Containing Webpages Platforms Containing Photos Platforms containing Videos
Google (Dun-Ming Huang) DeviantArt (Dun-Ming Huang) Vimeo (Dun-Ming Huang)
Internet Archive (Dun-Ming Huang) Flickr (Shuran Yang) YouTube (Dun-Ming Huang)
MetMuseum (Dun-Ming Huang)
WikiCommons (Dun-Ming Huang)

Exploratory Data Analysis (EDA)

Here are some significant defects found in datasets across sampled platforms during EDA:

Flickr

Google Custom Search API

YouTube Data API

Expanding the Dataset

Here are reasons and efforts of dataset expansion on platforms that received more data:

Google Custom Search API

YouTube Data API

Visualization

Philosophies and Principles

The visualizations of Quantifying the Commons is to be communicative and exhibitory.

Some new aesthetics and principles we adopted (as a response to enhancement of prior efforts) are to:

Exhibiting a Selection of Visualizations

Diagram 1C

Trend Chart of Creative Commons Usage on Google Trend Chart of Creative Commons Usage on Google

There are now more than 2.7 Billion webpages protected by Creative Commons indexed by Google!

Diagram 2

Heatmap on density of CC-licensed Google indexed webpages over country Heatmap on density of CC-licensed Google indexed webpages over country

Particularly, Western Europe and Americas enjoy a much robust use of Creative Commons document in terms of quantity. A Development in Asia and Africa should be encouraged.

Diagram 3C

Barplot for number of webpages protected by six primary CC licenses Barplot for number of webpages protected by six primary CC licenses

We can see that Attribution (BY) and Attribution-Nonderivative (BY-ND) are popular licenses among the 3 billion documents sampled across the dataset.

Diagram 6

Barplot of CC-licensed documents across Free Culture and Non Free Culture licenses Barplot of CC-licensed documents across Free Culture and Non Free Culture licenses

Roughly 45.3% of the documents under CC protection are covered by Free Culture legal tools.

Flickr Diagrams

Usage of CC licenses on Flickr concentrated on Australia, Brazil, United Stated of America while is pretty low in Asia countries.

Note: Sampling Frame of these visualizations are locked at the first 4,000 search results on photos under each general license types.

Diagram 7A

Analysis of Creative Commons Usage on Flickr

CC BY-SA 2.0 license usage in Flickr pictures taken during 1962-2022

Diagram 7B

Flickr maximum views of pictures under all licenses

Photos on Flickr under Attribution-NonCommercial-NoDerivs (BY-NC-ND) license has gained highest possible views, while usage of license Public Domain Mark has highest increasing trend in recent years.

Diagram 7C

Flickr yearly trend of all licenses 2018-2022

Diagram 7D

Flickr Photos under CC-BY-NC-SA 2.0 and CC BY-NC 2.0: Categories Keywords

Diagram 8

Number of works under Creative Commons Tools across Platforms Number of works under Creative Commons Tools across Platforms

DeviantArt presents the most number of works under Creative Commons licenses and tools, followed by Wikipedia and WikiCommons. The estimate of video counts on YouTube is understimated, as demonstrated in Diagram 11B.

Diagram 9B

Barplot of Creative Commons Protected Documents across Countries Barplot of Creative Commons Protected Documents across Countries

Diagram 10

Barplot of Creative Commons Protected Documents across languages Barplot of Creative Commons Protected Documents across languages

Diagram 11B

Trend Chart of Cumulative Count of CC-Licensed YouTube Videos across Each Two-Months Trend Chart of Cumulative Count of CC-Licensed YouTube Videos across Each Two-Months

The orange line stand for the imputed value of new CC-Licensed YouTube video counts based on linear regression, which is the decided method of imputation because most medias' growth of CC-licensed document count also experience a linear growth.

Modeling

(A side track)

Objectives of Modeling

The models of this project aim to answer: "What is the license typing of a webpage/web document given its content?"

Individual researchers have attempted each of their solutions via different resources, metrics, under different modeling contexts:

Model of Google Webpages (Dun-Ming Huang)

Model for Flickr Photos (Shuran Yang)

Training Process Summary: Google Model

Preprocessing Pipeline

  1. Deduplication
  2. Remove Non-English Characters
  3. URL, [^\w\s], Stopword Removal
  4. Remove Non-English Words
  5. Remove Short Words, Short Contents
  6. TF-IDF + SVD
  7. SMOTE

Model Selection

Logistic Regression(
    penalty="l2",
    solver="liblinear",
    class_weight="balanced",
    C=0.1,
)
SVC(
    C=0.5,
    probability=True,
    kernel="poly",
    degreee=1,
    class_weight="balanced",
)
RandomClassifier(
    class_weight="balanced_subsample",
    n_estimators=100,
    random_state=1,
)
GradientBoostingClassifier(
    n_estimators=5,
    random_state=1,
)
NultinomialNB(
    fit_prior=True,
    alpha=10,
)
  1. text : InputLayer
  2. preprocessing : KerasLayer
  3. BERT_encoder : KerasLayer
  4. dropout : Dropout
  5. classifier : Dense

Training Results

Testing Performances across Models by Top-k Accuracy

Training Process Summary: Flickr Model

Preprocessing Pipeline

  1. Deduplication
  2. Translation
  3. Stopword Removal, Lemmatization
  4. TF-IDF

Model Selection

SVC(
    C=1.0,
    kernel="linear",
    gamma="auto",
)

Training Results

An accuracy of 66.87% was reached.

Next Steps

From Preincarnation to Present

Via the efforts addressed above, we have not only managed to transform a data retrieval process from unstable, unexplored, and unavailable into an algorithmic, deterministic process reliable, documented, and interpretable! And the visualizations have become more exhibitory, concentrating on more effortfully extracted insights, and look at Creative Commons in further depth and more remarkable breadth.

With significant re-implementations and designing policies to the data retrieval process for Quantifying the Commons, visualizations can be readily, immediately produced upon command; and upon the conceptual transformations of visualization production, Creative Commons will obtain new insights into the development of product and eventual policies upon the axes along which data was extracted from. Furthermore, we expect the production of model to work beyond the bounds of a Machine Learning product, but as a possibility to draw inferences upon product usage upon.

Such efforts are a short jump start to the long-term reincarnation of Quantifying the Commons.

From Reincarnation onto Baton Touches

The current team would encourage the future team to increase the availability and user experience for our open source data extraction method, via automation and by-batch data extraction methods, for which Dun-Ming has written a design policy for. For modeling, the team also encourage building ingerence pipelines for using ELI5 for Logistic Regression models, as well as experiment more with loss function options of Gradient Boosting Classifier. For Flickr, the writer of this poster would like to suggest some data extraction method outside Flickr API but has access towards Flickr media, say Google Custom Search API.

Additional Reading