Open Source Internships: Project Ideas

This is the project idea list for the current round of open source internships at Creative Commons.



Add Audio to the CC Catalog & CC Search

  • The Problem:

    Currently, CC Search and the CC Catalog API only support image search. We’d like to add more content types, especially audio. This would involve indexing audio sources in the CC Catalog and adding new endpoints to the CC Catalog API (user interface changes are out of scope for this project.)

  • Expected Outcome:
    • There would be scripts in the CC Catalog repository that index metadata related to openly licensed audio files and add it to our database.
    • The CC Catalog API would have a set of endpoints that allowed searching for and browsing audio files.
  • Internship Tasks:
    • Work with CC’s data engineer to define and implement a database schema for collecting audio file metadata.
    • Write scripts to ingest audio file metadata from open repositories such as Freesound, Free Music Archive, etc.
    • Implement additional API endpoints on the CC Catalog API to expose audio data.
  • Application Tips:
    • Include potential database schemas in your application.
    • Include sources that you might want to ingest in your application.
  • Resources:
  • Skills recommended: CSS, Django, Django REST Framework, HTML, JavaScript, Python, Vue.js
  • Mentors: Alden Page, Brent Moran, Anna Tumadóttir
  • Difficulty: Medium

Add filtering by node to the Linked Commons

  • The Problem:

    For last year’s GSOC, María Belén Guaranda Cabezas created a visualization graph of domains in the Commons, using one month of data from Common Crawl. For details, please see her posts on the CC Open Source Blog, which you can find at her author page. We’d like to expand on the current state of that project by adding the possibility to filter by domain, and show only nodes within distance 2 (i.e., they can be reached by traveling along no more than two edges from the chosen node). If that is accomplished in short order, we’d also like to give the user the ability to choose the distance of nodes to show from a given node.

  • Expected Outcome:
    • There should be a search box into which the user can type a domain (or part of a domain), and the graph should then show only nodes from that domain, and domains which can be reached with no more than two hops from the original.
    • Stretch Goal: It’d be great if the user could choose a distance from the chosen node via some sort of drop-down menu.
    • If both of those go well, we’d like to explore some graph-theoretic metrics on the graph.
    • We may also add live updating to the data that backs the visualization.
    • If these features are completed ahead of schedule, the intern may suggest further features to add to the visualization.
  • Internship Tasks:

    The intern should implement the first feature above, and if there’s time, implement the second. It may be useful for the intern to assist with setting up live-updating of the data backing the visualization.

  • Application Tips:

    Interest in and/or experience with graph theory would be useful!

  • Resources:
  • Skills recommended: JavaScript, Graph Theory
  • Mentors: Brent Moran, María Belén Guaranda Cabezas
  • Difficulty: Easy

Add Provider API Scripts to CC Catalog

  • The Problem:

    The CC Catalog gets a huge amount of its data by pulling image info from APIs via what we are calling ‘Provider API Scripts’. We have a backlog of providers which have been vetted, and we’d like to have scripts that pull data from their public APIs and pass it to our storage class. This would increase the breadth of material available from CC Search and the CC Catalog API.

  • Expected Outcome:

    We would like to have a number of completed, well-tested Provider API Scripts written by the end of this internship, and they should be deployed in production. Deployment in production implies we’d also have Apache Airflow DAGs (Directed Acyclic Graphs) that run the Provider API Scripts on an appropriate schedule.

  • Internship Tasks:

    The intern should write more Provider API Scripts, taking priorities from the backlog linked above. Such a script must pull image information from a public API provided by the provider, and pass it along to a function that will validate the information, format it as necessary, and write it to disk. This validation/storage function is already written, so the intern needs only to write a script that knows how to get the relevant data from the public API of the provider. For examples of what we are expecting from a Provider API Script, see wikimedia_commons.py and flickr.pyin the repository, as well as their accompanying tests.

  • Application Tips:

    Knowledge or experience with pulling real data from public APIs in JSON format would be helpful. It would also help if the intern is familiar with Python.

  • Resources:
  • Skills recommended: Python, JSON, Apache Airflow (optional)
  • Mentors: Brent Moran, Kriti Godey
  • Difficulty: Easy

Improve CC Search Accessibility

  • The Problem:

    Creative Commons is a global community, and yet, CC Search lacks some accessibility features, including not being very user-friendly to users of screen readers. It's also available only in English, which results in the tool being less accessible to international audiences and reduces our reach. As an open web tool and platform, CC Search should be accessible to the widest audience possible, in as many languages as possible.

  • Expected Outcome:

    A release of CC Search which contains:

    1. Improvements to the HTML of CC Search with regards to accessibility features, including Aria attributes, forms, color contrast ratios, and UI changes that improve usage of CC Search for users with some kind of disability.
    2. The implementation of i18n using vue-i18n, with the currently hardcoded text refactored into the compatible translation resources, localized numbers and dates, and locale detection so that the appropriate language is loaded when the user visits the CC Search website, and build a tool to easily integrate new translations with Transifex.
  • Internship Tasks:
    • Research and implement accessibility improvements
    • Perform usability tests with people with disabilities to identify problems and test solutions to accessibility issues
    • Setup vue-i18n in CC Search
    • Refactor hardcoded English text into translation resource files
    • Localize numbers and dates
    • Detect user ideal language and load the appropriate locale data
    • Allow users to change the current locale
    • Integrate with Transifex
  • Application Tips:
    • Good understanding of how to change and refactor code that is under active development
    • A plan that's broken into small enough tasks that can be done ideally in one week, and not contain big tasks that can take multiple weeks to complete
    • Bonus points for good and innovative ideas on how to integrate the actual translations
    • Bonus points for doing research into currently existing accessibility issues on CC Search
  • Resources:
  • Skills recommended: CSS, HTML, JavaScript, Vue.js
  • Mentors: Breno Ferreira, Ari Madian
  • Difficulty: Medium

Improvements to the CC WordPress Plugin

Integrate Vocabulary with CC Open Source & CCGN websites

  • The Problem:

    Creative Commons has many different websites (CC.org, CC Global Summit, CC Open Source, CC Certificates, CC Global Network, CC Chapter Sites, etc.), all of which have different design elements and styles. One of our 2020 goals is to unify them all using our new web design system, Vocabulary. We need help updating the CC Open Source and CCGN websites.

  • Expected Outcome:
    • Updates to the CC Open Source website replacing all styling with components from Vocabulary.
    • Updates to the CC Global Network WordPress theme that build upon our base WordPress theme (this theme is currently in progress) and use components from Vocabulary.
    • Updates to our Figma design library (in collaboration with our UX Designer) and Vocabulary itself for any new components that need to be added when redesigning the sites.
  • Internship Tasks:
    • Create wireframes for the website redesigns using Vocabulary components in Figma
    • Identify new components that need to be added to Vocabulary and work with the UX designer to design and implement them
    • Update the WordPress theme for the CC Global Network website to use Vocabulary exclusively
    • Update the CC Open Source website styling to use Vocabulary exclusively
    • Implement the new components in Vue Vocabulary if time permits
  • Application Tips:

    It is okay if you think you’ll only have time to do one of the two websites. We’d rather you do one of them well than rush.

  • Resources:
  • Skills recommended: CSS, HTML, JavaScript, PHP, WordPress
  • Mentors: Dhruv Bhanushali, Hugo Solar
  • Difficulty: Medium

Reimplement CC’s Legal Database using WordPress or Django

  • The Problem:

    CC maintains a collection of case law and legal scholarship relevant to legal issues around Creative Commons licenses. Users can submit information, which is reviewed by a member of CC’s legal team and edited if necessary before publishing it to the live site. This tool currently has a number of issues:

    • publishing new data is a cumbersome manual process for both the legal and tech teams.
    • there is no way to browse all resources on the site.
    • It does not use Vocabulary, CC’s new web design system.
  • Expected Outcome:

    A new website, built using either WordPress or Django, that supports the following features:

    • The general public can submit legal information without a user account (the information collected should be identical to the current implementation).
    • The legal team at CC can review and edit incoming information and approve it, at which point it goes live.
    • All the legal information on the site should be browseable and searchable by keyword and country.
    • The user-facing portion of the website should use Vocabulary components.
  • Internship Tasks:
    • Architect the backend of the legal database in either WordPress or Django, including researching appropriate plugins or libraries that will make the task easier.
    • Code the backend of the legal database.
    • Create design mockups for the frontend in collaboration with CC’s UX designer.
    • Implement new components to Vocabulary, if necessary, in collaboration with CC’s UX designer.
    • Code the frontend of the legal database using Vocabulary.
    • Assist CC staff with deployment related tasks, if needed.
  • Application Tips:

    Please specify the architecture and plugins/libraries that you’d like to use in your application. We don’t want you to reinvent the wheel; we’d like to use existing libraries as much as possible.

  • Resources:
  • Skills recommended: CSS, Django, HTML, JavaScript, PHP, Python, WordPress (either Django/Python or WordPress/PHP, not both)
  • Mentors: Kriti Godey, Timid Robot Zehta
  • Difficulty: Medium

Usage & Reuse Metrics Dashboard for CC Search

  • The Problem:

    CC Search, which is CC’s search engine making openly licensed content discoverable, has hundreds of thousands of visitors per month. The catalog powering CC Search contains collections ranging from user-generated content at Flickr to priceless art pieces at the Met, and everywhere in between. The catalog contains a couple of dozen sources at present and is set for significant expansion.

    While we do keep track of certain user actions in our database, and are able to glean other insights with the use of Google Analytics, we do not have a single dashboard to see relevant information.

    This is a problem for two reasons:

    • We have a lack of comprehensive understanding of user behavior, as we have to look in multiple places and pull disparate information to paint a picture of overall user behavior.
    • Our catalog partners have no way to understand what impact their presence in CC Search is having in terms of discoverability and increased exposure to their collections.
      • In turn we are unable to tell a story about that impact to potential, future partners.
  • Expected Outcome:

    We’d like the outcome to be an analytics interface, of some kind, where all of these metrics are pulled together into one place. The interface should be both aesthetically pleasing and clear to understand, allowing those accessing it to get the pertinent information they need, whether it is by day, week, month, collection, or any other logical grouping.

  • Internship Tasks:
    • Understand the required data to be displayed as defined by CC HQ
    • Research analytics dashboards to provide recommendations on additional data that is valuable
    • Document a cohesive plan for structuring an analytics database storing all pertinent data
    • Mockup (wireframes only) in what way data could usefully be presented and what (if any) navigational elements will be required
    • Research existing analytics UIs
    • Present findings and make a recommendation on which one to use
    • Implement an analytics UI for CC Search Usage & Reuse data
  • Application Tips:

    This is a complicated problem, but we believe there is an elegant solution. We’d like the intern to show that they understand the stakeholders, can argue for the value of being able to provide this data internally and externally, have a clear picture of how one would structure an analytics database (which could use example data like page views, unique visitors, time on page, bounce rate, by URL, by day, by month, etc.) with a view to flexibility for expansion and rapid querying, and an indication of their comfort level with both design and programming. The most important thing is solid development skills. Support for frontend design, based on proposed wireframes, can be provided by the organization.

  • Resources:
  • Skills recommended: CSS, design, HTML, Python, PostgreSQL
  • Mentors: Alden Page, Anna Tumadóttir
  • Difficulty: Hard

Web Monetization Research Project

  • The Problem:

    CC is a non-profit supported by donations from foundations, corporations, and individuals. Despite massive traffic to CC’s web products, we do not make use of any mainstream monetization methods on our sites, such as advertising. However, as a non-profit, CC is continuously thinking about ways to fund the important work we do. As one potential avenue, CC wants to carry out in-depth research of web monetization options in order to evaluate which, if any, are appropriate for CC to pursue and integrate for our own web products.

    This project is for Outreachy only, since it does not involve code output.

  • Expected Outcome:

    We’d like the outcome of this internship to be a deliverable, in written form, detailing in-depth research regarding the current state of web monetization technology, with recommendations for options for CC. We expect this deliverable to take the form of a CC-licensed research paper, based on which we expect the intern to also write a blog article.

  • Internship Tasks:

    The intern will be expected to create their own detailed work plan, containing activities including but not limited to:

    • preliminary research
    • user interviews
    • preparation of an outline
    • research to write the paper, including examples of existing implementations on other sites, of various web monetization technologies
    • further user interviews to help inform recommendations
    • writing a blog article about their findings

    They will work closely with their mentors, doing frequent check-ins, in order to ensure the direction of the research is on track, and that there aren’t oversights.

  • Application Tips:

    A clear sense that the intern has excellent written skills, understands how to carry out research interviews, and knows how to structure findings, whether they are technical truths or takeaways from interviews. Passion for or existing knowledge of online privacy issues, cryptocurrencies, browser technologies, and other related subject matter.

  • Resources:
  • Skills recommended: Research, Writing. In so far as the work is technical, it is understanding how to navigate the world of add-ons and plugins, and being clear enough on the subject matter that they are able to correctly describe how different web monetization technology works to a non-technical person.
  • Mentors: Anna Tumadóttir, Sarah Pearson
  • Difficulty: Easy

Original Ideas

  • We are open to original ideas for either new features or improvements to our existing projects or new CC-related plugins for third-party software.

    Please talk to us on the #cc-gsoc-outreachy channel on Slack or via the mailing list to find a mentor for the project before submitting your proposal.

Back to top