Reviewing data for completeness and accuracy

"Just as the reality of daily life and complex ecosystems have high levels of entropy and thus ‘dirtiness,’ so does the data that surrounds it. We can not use this as an excuse to avoid solving problems..."

-- Brett Goldstein (Formerly CDO of the City of Chicago), Bad Data Handbook

Every public data set varies in completeness and quality, and prior to releasing data for outside users every department should strive to ensure their data is as accurate, complete and up to date as possible. There are a number of useful guides available (for both data publishers and data consumers) that can be heloful in spotting issues with data quality.

However, there is no such thing as "perfect" data and using data perfection as a prerequisite for releasing data can become an impediment.

When releasing data sets, be explicit about any limitations that were encountered in preparing it for release and add caveats that will help data consumers understand the limitations of your data if any exist. If data is subject to revision, if portions have been redacted, if it only covers a limited time period - make sure to clearly state these limitations as part of your data release. Clearly stating limitations and caveats will make your data more usable because consumers will have a more thorough understanding of what your data represents. See the discussion on the basic tenants of metadata.

It’s also worth noting that most communities of data consumers will provide feedback on the quality of data or any perceived inaccuracies. This can be invaluable information for improving the quality of your data.

For a good example of this dynamic at work, view some of the discussions on the SEPTA Developer Google Group, or the City of Philadelphia's open data public forum.