CC Search

The largest open source product in CC’s portfolio is CC Search: a search engine for CC licensed and public domain creative works. The product involves the following components:

CC Catalog ingests and processes CC licensed and public domain works, then makes that data available to CC Catalog API. CC Catalog API is publicly accessible, and is used to serve the data from the catalog to CC Search.

What Are We Up To?

The highlights on our roadmap for the current and upcoming quarters are as follows:

Q2 2020

Task Name Task Description
Improve Search Algorithm with Popularity Data Integration Make changes to the search algorithm that incorporate image popularity data gathered from sources that provide it.
Move data cleaning pipeline from API to Catalog Move our data cleaning code from the ingestion step of the API to the initial data processing step of the Catalog to eliminate unnecessary repetitive data cleaning.

Q3 2020

Task Name Task Description
Improve Catalog Deployment and Provisioning Manage Catalog deployment and provisioning entirely through infrastructure as code.
Improve Documentation for Community Contributors Create better documentation for community contributors by consolidating internal and public documentation and making it available for everyone.
Plan search algorithm changes for new metadata [AWS Grant] Plan out search algorithm changes to incorporate image metadata generated via AWS Rekognition.
Implement architecture for schema for new metadata [AWS Grant] Update Catalog schema to include new metadata generated through AWS Rekognition.
License Explanation/Compliance Improvements Improve how and where we explain licenses, and consider ways to make it easier for reusers to understand and comply with license requirements.
Improved Support Pages Improve the support pages on CC Search, which includes the Collections page, for a better experience. Add explanation text for collections, improve flow.
Design Sprint: Meta Search Integration Integrating meta search functionality into CC Search for sources that are not currently indexed, and content types we do not currently support.
Offline old CC Search Offline Old Search (oldsearch.creativecommons.org) and redirect traffic to CC Search. Prior to this, build in messaging on Old Search, and support similar functionality on CC Search. See "Meta Search Integration" for related work.
Accessibility Improvements Make accessibility improvements to the UI.
Internationalization Infrastructure Build infrastructure necessary for internationalization, to allow CC Search to be accessible in other languages.
Design Sprint: Audio UI for CC Search Designing and prototyping an upcoming user interface for searching for audio on CC Search.
Audio Support and Integration Design and user test UIs for audio. Ingest a pilot collection of audio to the Catalog, build support in the API. Integrate design to frontend to allow users to search for CC licensed audio.
Improve Common Crawl Infrastructure Update our Common Crawl provider infrastructure to: (1) use Apache Airflow instead of AWS tools like Data Pipeline & Glue for processing data (2) unify provider processing to use the same base classes as API providers
Use Data Dumps for Wikimedia Ingestion Switch our Catalog data ingestion for Wikimedia Commons to use the data dumps provided by Wikimedia instead of the MediaWiki API.
Web Monetization: Phase 1 Research and test potential integrations for Web Monetization into CC Search and other CC web properties.
API UI with Usage Dashboard Build a UI for the Catalog API, where users can sign up, manage access, see usage metrics and statistics.
API documentation improvements Make CC Catalog API documentation more accessible to CC Search users, and improve user experience.
Scraping & Resizing Work [AWS Grant] Store a private copy of all the images in the CC Catalog to analyze via machine learning.
Wikidata integration with Catalog & Search Algorithm Collect and use structured data from Wikidata to enhance our search algorithm with semantic search.
Usage/Reuse Metrics Dashboard Build an analytics UI that is fed by Google Analytics and our internal analytics database.
Switch from Common Crawl to API For all possible providers, use their APIs to ingest data into the CC Catalog instead of scraping websites via Common Crawl data.
Run Rekognition on 100m images [AWS Grant] Generate metadata via machine learning (using AWS Rekognition) on a set of ~100 million high quality images from the CC Catalog.
Upgrade Catalog: Data Lake Upgrade the CC Catalog database to use a schema-less database instead of the relational database (Postgres) that we currently use.
Provider Review Automation Automate the process of finding new providers of CC-licensed content to index into the CC Catalog.
Implement Use of Thumbnails in Search & Catalog [AWS Grant] Implement changes to CC Search (frontend) and Catalog to make use of thumbnails, as they become available.
Partnership guidelines for all integration types Prepare partnership guidelines for CC Search. Create a page on CC Search publishing these guidelines.
Plan UI Updates in Response to Metadata [AWS Grant] Design updates to the CC Search UI in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters will be an option. Integration of design will take place subsequently.

Q4 2020

Task Name Task Description
CC Search HTML Embed Design and build an embed of CC Search that can be placed on any website, as a starting point to discover objects in CC Search. Components from Design Library must be used, with the goal of simplicity.
Plan use of ccREL for easily adding content to cccatalog Plan out the usage of scraping ccREL metadata from the internet to index new content into the CC Catalog.
User Persona Redevelopment Update CC Search user personas based on user research during 2020.
Support multiple languages in CC Search Design and implement seamless support for multiple languages in CC Search, as content in languages becomes available. This is preceded by Internationalization Infrastructure work.
Implement UI Updates for new Metadata [AWS Grant] Implement design updates to the CC Search UI. Designs will be created in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters to be rolled out.
Implement Search Algorithm Changes [AWS Grant] Update our search algorithm to use metadata gathered using machine learning analysis (using AWS Rekognition).
Ensure Infrastructure Code is Open Source [AWS Grant] Release the infrastructure code used to power the CC Catalog, API, and CC Search projects publicly.
Enrich CC Catalog data with data from Common Crawl Enrich CC Catalog and data found in the wild using Common Crawl, for example, to track where CC-licensed content is reused.

Review our Pipeline of Future Ideas here if you want to see what else has been suggested for CC Search.

How Can I Help?

Contribute Code

To contribute code, take the following steps:

  1. Review the following:
  2. Determine which project works best for you
  3. Read through how to find open issues.
  4. Start contributing!
We keep track of our work in three projects in GitHub:
CC Search Active Sprint
• The best place to start! If an issue isn’t in progress yet, and is marked for community contribution, you’ll know it’s our highest priority.
CC Search Backlog
• The column called “Next Sprint” contains what our second highest priority items are.
• The current quarter (Q1, Q2, Q3, Q4) will tell you what we plan to work on, up to three months out.
• Check out the “Any Time/Community” for some fun tickets that aren’t a high priority for CC staff, but would be great if they got built.
CC Catalog Pipeline
• There are two columns with “Ready for Work” tickets. If they’re not blocked or marked as CC Staff only, we welcome your contribution.

Contribute Design

There are two ways you can show your interest in contributing to the design of CC Search:

  1. Follow the steps for suggesting a new feature for CC Search
  2. Join the #cc-search channel in the Creative Commons Slack and start a conversation about your design ideas.

Before you start work on any design project, get familiar with our design library in Figma. All CC Search designs use this design library.

Participate in Usability Tests and User Interviews

When we’re rolling out a specific feature, we do usability tests to test the proposed experience.

At any point in time, we’re engaging with our users through user interviews, where we learn more about attitudes towards the product as it stands, and dig into expansion areas we’re considering.

If you’re interested, we invite you to sign up for a time via this Calendly link.

Visit our Usability page for more details on how to participate.

Back to top