CC Open Source Blog

Case study of a simple but highly effective use of Semantic MediaWiki on the

gravatar

by akozak on 2010-09-08

Whether it's setting up the columns of a spreadsheet or defining a data structure in python, any project that involves gathering structure-able data for some purpose requires technical support. Choosing the right technical tool for the job involves careful consideration of your requirements while acknowledging your constraints.

At Creative Commons, we're never running out of ideas for useful collections of information. We're always looking for ways to highlight interesting uses of our legal and technical tools, to approximate the impact that our tools are having, and to better engage our variety of user bases and communities. But while we have a lot of exciting ideas for new datasets, we don't always have the resources or infrastructure to build new data collection and management tools for each of those projects.

As a solution to this constraint, we rely heavily on Semantic MediaWiki on the CC Wiki to manage various data-sets related to Creative Commons. For the uninitiated, Semantic MediaWiki is an extension to MediaWiki, the popular open source wiki platform that powers Wikipedia (all of the extensions discussed here are also open source). Semantic MediaWiki adds powerful organizational tools to your MediaWiki installation, allowing data queries, data I/O, powerful methods for page organization and collection, and when combined with some useful helper extensions such as Semantic Forms and Semantic Drilldown, user-friendly template call creation and data browsing.

One example of an effective use of Semantic MediaWiki, which recently underwent some maintenance, is the Case Studies database on the CC wiki. The Case Studies database uses Semantic MediaWiki and Semantic Forms to collect, annotate, and aggregate data contained on wiki pages about uses of Creative Commons license from around the world. For an example page, see the Case Study on Cory Doctorow.

Each Case Study page contains two basic elements: A template call and free text. The free text is unstructured and typically contains no semantic annotations (unless provided by the user). When you create a new wiki page in MediaWiki, you're just editing the page text. In the form to create a page, we've set it up to pre-populate the free text with some suggested structure for the Case Study, but otherwise the free text is a blank slate for whatever content the contributor wants to provide.

The template call is where all of the semantic annotations and interesting data queries are enabled. In each Case Study, the "Case Study" template is called with parameters defined in that template. The strings in each template call parameter gets assigned to semantic properties and processed for rendering (e.g. to turn a string into a link to a wiki-page if it exists). There can be many arbitrarily-named parameters for any MediaWiki template, and it wouldn't be easy for anyone to add a Case Study if they had to know the proper parameters for the template. Thanks to Semantic Forms, we're able to construct forms for users to fill out that then construct a template call and free text on a page.

But you might ask: What are the qualities of Semantic MediaWiki that make it useful for my projects?

Semantic MediaWiki enables "view source" for databases. This means that all of the template, property pages, forms, drilldown filters, and pages are viewable and editable with a complete page history for each. That is, the markup defining the database is editable by anyone. Of course, the pages could be protected from edits, but in general the markup is at least accessible. This gets you the possibility for user-driven development and rapid feedback. You might define a data structure for a community of users and come back to find that it's been modified to be more useful for its application without you having been involved at all.

Any registered user on the CC Wiki can create a wiki page, and thus, any user can contribute an item to the Case Studies database. Because Semantic Forms just creates or modifies template calls on pages, each page constructed or edited with Semantic Forms shows up on Recent Changes, and the complete page history will be available for review. In this respect, the data collection process in a Semantic MediaWiki database is transparent. This is important for most kinds of data, since usually the two types of data you might collect will be data requiring some subjective assessment or data that is meant to represent facts. For projects which you expect a large contributor base, you can expect with near-certainty that someone will eventually make a subjective assessment that diverges from common sense, or will add data that misrepresents some important fact. In either case, having a transparent data collection process mitigates the risk of bad data. This holds true for non-page namespace pages as well (template pages, property pages, forms, etc).

Lastly, the database structure is highly mutable. In many data collection efforts, particularly those with some idea how the data will be analyzed or applied, the process of gathering data informs the data you collect. For example, in collecting case studies of Creative Commons licenses, you might find that almost all of them fit into a few media types. With Semantic MediaWiki, it becomes trivial to create a new field in a form and associated property in the template, or if you have an existing structure for that type of data, to modify the kinds of data that property accepts. You could even change the allowed values for a property or change the data type and easily fix any incompatibilities that arise.

For example, we recently decided to add a method for Case Studies evaluation to the database. All it required was to create a partial form using Semantic Forms that populate two new property mappings in the template (Quality and Importance). This new form just contains two drop-down menus that let users select Quality or Importance values for the page and save that data back into the template call on the page. SMW allowed us to extend the data we collected on each page. But additionally, halfway through the development process we decided to use a different metric for quality. It was trivial to change the list of allowed values on the property page for that property and then query the existing data for pages needing updating to the new metric.

In short, Semantic MediaWiki is a powerful tool allowing rapid, decentralized development of complex databases that requires minimal investment into technical infrastructure. It's also a method to create a truly collaborative database that is an asset to you and to your community.