CC Search

The largest open source product in CC’s portfolio is CC Search: a search engine for CC licensed and public domain creative works. The product involves the following components:

CC Catalog ingests and processes CC licensed and public domain works, then makes that data available to CC Catalog API. CC Catalog API is publicly accessible, and is used to serve the data from the catalog to CC Search.

What Are We Up To?

The highlights on our roadmap for the current and upcoming quarters are as follows:

Q2 2020

Task Name Task Description
OER Planning Plan for potential integration of OER into CC Search through research and consideration of potential issues.
Improve Search Algorithm with Popularity Data Integration Make changes to the search algorithm that incorporate image popularity data gathered from sources that provide it.
Plan search algorithm changes for new metadata [AWS Grant] Plan out search algorithm changes to incorporate image metadata generated via AWS Rekognition.
Catalog Infrastructure Improvements Improve data processing infrastructure in the Catalog by parallelizing loading and moving storage of data files from providers to S3.
Implement architecture for schema for new metadata [AWS Grant] Update Catalog schema to include new metadata generated through AWS Rekognition.
Image Selection for Rekognition [AWS Grant] Develop metrics for and select a set of ~100 million high quality images for which we'll generate additional metadata through AWS Rekognition.
Improve Catalog Deployment and Provisioning Manage Catalog deployment and provisioning entirely through infrastructure as code.
Use Data Dumps for Wikimedia Ingestion Switch our Catalog data ingestion for Wikimedia Commons to use the data dumps provided by Wikimedia instead of the MediaWiki API.
Improve Common Crawl Infrastructure Update our Common Crawl provider infrastructure to: (1) use Apache Airflow instead of AWS tools like Data Pipeline & Glue for processing data (2) unify provider processing to use the same base classes as API providers
Improve Documentation for Community Contributors Create better documentation for community contributors by consolidating internal and public documentation and making it available for everyone.
Machine Processing - Popularity Data to Catalog [AWS Grant] Save popularity data (views, comments, uses, etc.) associated with images from our sources into the Catalog's database.
Move data cleaning pipeline from API to Catalog Move our data cleaning code from the ingestion step of the API to the initial data processing step of the Catalog to eliminate unnecessary repetitive data cleaning.
User Reporting Strategy & Implementation We need a frontend feature where users can report problematic content, backend support, and an internal process for taking action on content that is reported as problematic.
Design Sprint: Audio UI for CC Search Designing and prototyping an upcoming user interface for searching for audio on CC Search.
Audio Support and Integration Design and user test UIs for audio. Ingest a pilot collection of audio to the Catalog, build support in the API. Integrate design to frontend to allow users to search for CC licensed audio.
Internationalization Infrastructure Build infrastructure necessary for internationalization, to allow CC Search to be accessible in other languages.
Accessibility Improvements Make accessibility improvements to the UI.
JSON Export to CC Open Source for Public Roadmap Create a public version of the CC Search roadmap on CC Open Source.
Integration of Design Sprint: License Language Changes Integrate License Language Changes into CC Search frontend, which include tooltips on license filters, and adjustments to the language and CTAs on single result pages.
Design Sprint: Meta Search Integration Integrating meta search functionality into CC Search for sources that are not currently indexed, and content types we do not currently support.
CC Search HTML Embed Design and build an embed of CC Search that can be placed on any website, as a starting point to discover objects in CC Search. Components from Design Library must be used, with the goal of simplicity.
Offline old CC Search Offline Old Search (oldsearch.creativecommons.org) and redirect traffic to CC Search. Prior to this, build in messaging on Old Search, and support similar functionality on CC Search. See "Meta Search Integration" for related work.
Improved Support Pages Improve the support pages on CC Search, which includes the Collections page, for a better experience. Add explanation text for collections, improve flow.
License Explanation/Compliance Improvements Improve how and where we explain licenses, and consider ways to make it easier for reusers to understand and comply with license requirements.

Q3 2020

Task Name Task Description
Web Monetization: Research Phase Research, mock up, and user test potential integrations for Web Monetization into CC Search and other CC web properties.
API UI with Usage Dashboard Build a UI for the Catalog API, where users can sign up, manage access, see usage metrics and statistics.
API documentation improvements Make CC Catalog API documentation more accessible to CC Search users, and improve user experience.
Scraping & Resizing Work [AWS Grant] Store a private copy of all the images in the CC Catalog to analyze via machine learning.
Wikidata integration with Catalog & Search Algorithm Collect and use structured data from Wikidata to enhance our search algorithm with semantic search.
Usage/Reuse Metrics Dashboard Build an analytics UI that is fed by Google Analytics and our internal analytics database.
Switch from Common Crawl to API For all possible providers, use their APIs to ingest data into the CC Catalog instead of scraping websites via Common Crawl data.
Run Rekognition on 100m images [AWS Grant] Generate metadata via machine learning (using AWS Rekognition) on a set of ~100 million high quality images from the CC Catalog.
Upgrade Catalog: Data Lake Upgrade the CC Catalog database to use a schema-less database instead of the relational database (Postgres) that we currently use.
Provider Review Automation Automate the process of finding new providers of CC-licensed content to index into the CC Catalog.
Implement Use of Thumbnails in Search & Catalog [AWS Grant] Implement changes to CC Search (frontend) and Catalog to make use of thumbnails, as they become available.
Partnership guidelines for all integration types Prepare partnership guidelines for CC Search. Create a page on CC Search publishing these guidelines.
Plan UI Updates in Response to Metadata [AWS Grant] Design updates to the CC Search UI in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters will be an option. Integration of design will take place subsequently.

Q4 2020

Task Name Task Description
Text Support and Integration Do a pilot integration of text-based content that is considered educational. Requires selection of source, Catalog and API structuring, frontend designs and integration.
Plan use of ccREL for easily adding content to cccatalog Plan out the usage of scraping ccREL metadata from the internet to index new content into the CC Catalog.
User Persona Redevelopment Update CC Search user personas based on user research during 2020.
Support multiple languages in CC Search Design and implement seamless support for multiple languages in CC Search, as content in languages becomes available. This is preceded by Internationalization Infrastructure work.
Implement UI Updates for new Metadata [AWS Grant] Implement design updates to the CC Search UI. Designs will be created in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters to be rolled out.
Implement Search Algorithm Changes [AWS Grant] Update our search algorithm to use metadata gathered using machine learning analysis (using AWS Rekognition).
Ensure Infrastructure Code is Open Source [AWS Grant] Release the infrastructure code used to power the CC Catalog, API, and CC Search projects publicly.
Enrich CC Catalog data with data from Common Crawl Enrich CC Catalog and data found in the wild using Common Crawl, for example, to track where CC-licensed content is reused.

Review our Pipeline of Future Ideas here if you want to see what else has been suggested for CC Search.

How Can I Help?

Contribute Code

To contribute code, take the following steps:

  1. Review the following:
  2. Determine which project works best for you
  3. Read through how to find open issues.
  4. Start contributing!
We keep track of our work in three projects in GitHub:
CC Search Active Sprint
• The best place to start! If an issue isn’t in progress yet, and is marked for community contribution, you’ll know it’s our highest priority.
CC Search Backlog
• The column called “Next Sprint” contains what our second highest priority items are.
• The current quarter (Q1, Q2, Q3, Q4) will tell you what we plan to work on, up to three months out.
• Check out the “Any Time/Community” for some fun tickets that aren’t a high priority for CC staff, but would be great if they got built.
CC Catalog Pipeline
• There are two columns with “Ready for Work” tickets. If they’re not blocked or marked as CC Staff only, we welcome your contribution.

Contribute Design

There are two ways you can show your interest in contributing to the design of CC Search:

  1. Follow the steps for suggesting a new feature for CC Search
  2. Join the #cc-search channel in the Creative Commons Slack and start a conversation about your design ideas.

Before you start work on any design project, get familiar with our design library in Figma. All CC Search designs use this design library.

Participate in Usability Tests and User Interviews

When we’re rolling out a specific feature, we do usability tests to test the proposed experience.

At any point in time, we’re engaging with our users through user interviews, where we learn more about attitudes towards the product as it stands, and dig into expansion areas we’re considering.

If you’re interested, we invite you to sign up for a time via this Calendly link.

Visit our Usability page (forthcoming) for more details on how to participate.

Back to top