[Archived] CC Search

The largest open source product in CC’s portfolio is CC Search: a search engine for CC licensed and public domain creative works.

The product involves the following components:

CC Catalog ingests and processes CC licensed and public domain works, then makes that data available to CC Catalog API. CC Catalog API is publicly accessible, and is used to serve the data from the catalog to CC Search.

What Are We Up To?

The highlights on our roadmap for the current and upcoming quarters are as follows:

Q3 2020

Task Name Task Description
Improve Search Algorithm with Popularity Data Integration Make changes to the search algorithm that incorporate image popularity data gathered from sources that provide it.
Move data cleaning pipeline from API to Catalog Move our data cleaning code from the ingestion step of the API to the initial data processing step of the Catalog to eliminate unnecessary repetitive data cleaning.
Implement architecture for schema for new metadata [AWS Grant] Update Catalog schema to include new metadata generated through AWS Rekognition.
Plan search algorithm changes for new metadata [AWS Grant] Plan out search algorithm changes to incorporate image metadata generated via AWS Rekognition.
License Explanation/Compliance Improvements Improve how and where we explain licenses, and consider ways to make it easier for reusers to understand and comply with license requirements.
Offline old CC Search Offline Old Search (oldsearch.creativecommons.org) and redirect traffic to CC Search. Prior to this, build in messaging on Old Search, and support similar functionality on CC Search. See "Meta Search Integration" for related work.
Web Monetization: Phase 1 Research and test potential integrations for Web Monetization into CC Search and other CC web properties.
3D Support in UI Support rendering of 3D objects on frontend of CC Search.
Improved Support Pages Improve the support pages on CC Search, which includes the Collections page, for a better experience. Add explanation text for collections, improve flow.
Accessibility Improvements Make accessibility improvements to the UI.
Internationalization Infrastructure Build infrastructure necessary for internationalization, to allow CC Search to be accessible in other languages.
Improve Common Crawl Infrastructure Update our Common Crawl provider infrastructure to: (1) use Apache Airflow instead of AWS tools like Data Pipeline & Glue for processing data (2) unify provider processing to use the same base classes as API providers
Design Sprint: Audio UI for CC Search Designing and prototyping an upcoming user interface for searching for audio on CC Search.
Audio Support and Integration Design and user test UIs for audio. Ingest a pilot collection of audio to the Catalog, build support in the API. Integrate design to frontend to allow users to search for CC licensed audio.
Run Rekognition on 100m images [AWS Grant] Generate metadata via machine learning (using AWS Rekognition) on a set of ~100 million high quality images from the CC Catalog.
Switch from Common Crawl to API For all possible providers, use their APIs to ingest data into the CC Catalog instead of scraping websites via Common Crawl data.

Q4 2020

Task Name Task Description
Search Relevance Improvements: Language Analysis, Quality Metrics, Minimums None
Plan UI Updates in Response to Metadata [AWS Grant] Design updates to the CC Search UI in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters will be an option. Integration of design will take place subsequently.
Provider Review Automation Automate the process of finding new providers of CC-licensed content to index into the CC Catalog.
Usage/Reuse Metrics Dashboard Build an analytics UI that is fed by Google Analytics and our internal analytics database.
Scrape all images and set up feed for new ones Once the Rekognition crawl finishes, we want to crawl the rest of the catalog (but not feed them to rekognition). This will give us useful metadata like dimensions and quality.
Improve Documentation for Community Contributors Create better documentation for community contributors by consolidating internal and public documentation and making it available for everyone.
Improve Catalog Deployment and Provisioning Manage Catalog deployment and provisioning entirely through infrastructure as code.
API documentation improvements Make CC Catalog API documentation more accessible to CC Search users, and improve user experience.
CC Search HTML Embed Design and build an embed of CC Search that can be placed on any website, as a starting point to discover objects in CC Search. Components from Design Library must be used, with the goal of simplicity.
Plan use of ccREL for easily adding content to cccatalog Plan out the usage of scraping ccREL metadata from the internet to index new content into the CC Catalog.
User Persona Redevelopment Update CC Search user personas based on user research during 2020.
Support multiple languages in CC Search Design and implement seamless support for multiple languages in CC Search, as content in languages becomes available. This is preceded by Internationalization Infrastructure work.
Implement UI Updates for new Metadata [AWS Grant] Implement design updates to the CC Search UI. Designs will be created in response to new metadata available as a result of applying machine learning to selected images in the Catalog. At a minimum, we expect new filters to be rolled out.
Implement Search Algorithm Changes [AWS Grant] Update our search algorithm to use metadata gathered using machine learning analysis (using AWS Rekognition).
Ensure Infrastructure Code is Open Source [AWS Grant] Release the infrastructure code used to power the CC Catalog, API, and CC Search projects publicly.
Enrich CC Catalog data with data from Common Crawl Enrich CC Catalog and data found in the wild using Common Crawl, for example, to track where CC-licensed content is reused.

Review our Pipeline of Future Ideas here if you want to see what else has been suggested for CC Search.

How Can I Help?

Contribute Code

To contribute code, take the following steps:

We keep track of our work in three projects in GitHub:
CC Search Active Sprint
• The best place to start! If an issue isn’t in progress yet, and is marked for community contribution, you’ll know it’s our highest priority.
CC Search Backlog
• The column called “Next Sprint” contains what our second highest priority items are.
• The current quarter (Q1, Q2, Q3, Q4) will tell you what we plan to work on, up to three months out.
• Check out the “Any Time/Community” for some fun tickets that aren’t a high priority for CC staff, but would be great if they got built.
CC Catalog Pipeline
• There are two columns with “Ready for Work” tickets. If they’re not blocked or marked as CC Staff only, we welcome your contribution.

Contribute Design

There are two ways you can show your interest in contributing to the design of CC Search:

Before you start work on any design project, get familiar with our design library in Figma. All CC Search designs use this design library.

Participate in Usability Tests and User Interviews

When we’re rolling out a specific feature, we do usability tests to test the proposed experience.

At any point in time, we’re engaging with our users through user interviews, where we learn more about attitudes towards the product as it stands, and dig into expansion areas we’re considering.

If you’re interested, we invite you to sign up for a time via this Calendly link.

Visit our Usability page for more details on how to participate.