Data is not binary

Science, data, internet, ontology, work and non-work themes converging – my post on O’Reilly Radar, reposted below

Why open data requires credibility and transparency.

by Gavin Starks | June 30, 2010

Guest blogger Gavin Starks is founder and CEO of AMEE, a neutral aggregation platform designed to measure and track all the energy data in the world..

The World Bank has stated that “data in document format is effectively useless“.

However, “open data” is only the beginning of a journey. Simply applying the rules of open source as applied to software may help us take the first steps, but there are new categories of challenges to face.

Data needs to be computable (ie. acted upon in context)

“Data” is a much broader term than “code.” The term embodies a range of dimensions: there are more than just the numbers at play, especially with scientific data.

How was the data collected?
How should the data be used?
Are the models for processing the data valid?
What assumptions exist, in words and equations?
What is the significance of the assumptions?

In an age when peer review is an anachronism, we are searching for new solutions for “scientific content management”. When Pascal’s Wager is evoked, it is equally important to remember Godel’s incompleteness theorems (in complex enough systems, logic can be used to prove anything, including untrue statements).

Only eight percent of members of the Scientific Research Society agreed that “peer review works well as it is” (Chubin and Hackett, 1990; p.192). Peer review has also been claimed to be “a non-validated charade whose processes generate results little better than does chance.” But in the same context: “Peer review is central to the organization of modern science … why not apply scientific [and engineering] methods to the peer review process” (Horrobin, 2001)”. The absence of URLs on those two pieces of research are indicative of one of the problems we are trying to solve.

Peer review remains today in its current form because of history, but in a niche because technology has opened up usage to a mass audience.

We must build tools that enable credible engagement

To illustrate our story: we are engaged with the very pressing and complex issue of climate change. At AMEE we codify international, government, and proprietary data, models and methodologies that represent, at the most fundamental level, the algorithms that enable the energy, carbon and environmental cost of consumption and activities to be calculated. AMEE doesn’t just store and re-broadcast data, it performs the calculations based on inputs to the models.

One of our challenges is getting at the raw data in a useful, repeatable, and traceable form. As a result of this, one of the core services we offer to data and standards managers are tools that enable this.

Releasing raw data is vital. There can be no excuse not to. Releasing source code is optional. It’s truly great for open source review, but it’s also dangerous if everyone just re-runs the same code with the same baked-in implicit and explicit assumptions and errors.

This is where data and code deviate substantially. The logic cascade for the interpretation of data is not unary (there is no single interpretation), it is based on assumptions that may vary and are subject to many quantitative and qualitative inputs: the interpretation of the data is not even binary.

We believe it’s much better to publish the following five components to provide transparent and auditable disclosure:

The raw data
The circumstances of its collection
The method and assumptions used to process the data (in words and equations)
The results of the processing
The known limitations on the method and significance of the assumptions

The processing code should be written from scratch as many times as possible to reduce the chance that it affected the results in any way.

Once “published,” the challenge is the how to build out a credible, and usable, set of services that encourage correct usage.

Building the solution stack

At AMEE we have developed a six-tier solution to try and address some of these issues. Specifically, we address the gap between content creators/managers (e.g. standards bodies) and content users (e.g. software apps, consultants, auditors), with a solution that is both human and machine-readable.

1. Aggregation — We aggregate the raw data, and track and log the sources. We have a standards spider that checks for changes, not unlike a search engine spider.

2. Content Enhancement — In the process of aggregation, we document the data, and embed provenance, linking back to the source. We also add authority, a measure of the reliability and credibility of the source. We’re beginning to add other taxonomies and semantic links that enable the data to be joined, and are building tools for engagement with the platform to stimulate discussion.

3. Discoverability — AMEE Explorer is the human-readable version of the data, and the only search engine on carbon calculation models (N.B.: we are focused on the industrial and human impacts at the moment, not modeling the climate itself).

4. Repeatable Quality — We have a quality-control process around the underlying data that is similar to a Six Sigma process. Our systems self-test the data every 30 minutes, and human checks are carried out at random intervals to ensure systemic errors have not been introduced. Our target accuracy metric is 100 percent, not five-nines.

5. Computable Engine — We believe we are taking the notion of a master database service to an entirely new level by ensuring that not only the data is robust, but AMEE performs the actual calculations. AMEE retains an audit history behind both the inputs and the calculations themselves.

6. Interoperability and auditability — The AMEE API is the machine-readable version of the data (in fact all of the content including meta data and documentation), which enables the calculations to be done. AMEE also stores the audit-history of both the inputs and the calculation mechanics. For example: PUT a (flight in an F-15 from London to New York at combat thrust), and GET the kgCO2 for that journey, or PUT (1000kWh reported by my Whirlpool fridge for this month, in Washington, using my preferred energy supplier and my solar panels) and GET the kgCO2.

Challenges

AMEE is positioned right at the junction between cloud, code, API, content, data, and the usage of the data, and as carbon becomes priced, we believe the consequences of getting it wrong are extremely high.

From an “open” standpoint, one of the big challenges we face includes defining where the boundaries of “open” lie. Our value, of course, is in the ongoing maintenance and reliability of the system, and connecting the data.

Commercially, we are treading very carefully through the platform and use-case stack (core platform, API, data, algorithms, code, structure, etc), and increasing transparency at the most relevant points for the end-user (who needs to feel confident about their own inputs and outputs). It’s a complex stack, and no open source or creative commons licenses wholly cover the kinds of issues we face.

Our field, carbon footprinting, is what we call a “non-trivial” example of where open data meets the markets: billions of dollars are flowing through or around these data on the carbon markets. For example, thousands of businesses in the UK have to start reporting their carbon footprint to the government this year, and paying for it next year. Very, very few people understand how to use this data, how it all joins together, where the trap doors are, and why it’s important to build an industry-stack to solve the problem.

If we don’t build a credible industry stack, from the ground up, the outcome could be no industry at all (or a tiny one), and that has dire consequences not only for the vendors and businesses in the space (such as SAP, SAS, CA, Microsoft, Google, and others), but also removes our ability to accelerate solving the underlying issue of carbon and climate change itself. Root cause of this credibility-gap has been lack of transparency, and no one has comprehensively joined the dots to see what is real, and what it not.

We also believe this kind of approach has huge value in many areas beyond the ones AMEE is addressing.

Open data isn’t just about re-broadcasting data, but combining it, re-using it and building upon it. It’s about creating new uses, creating new markets and building credibility into the data as it flows.