Chapter 17 Data Formatting Standards

You must be the chnage you wish to see in the world - Ghandi

17.1 Formatting

What a boring subject, right? Data format is something that we often take for granted; until we come upon an issue. Then it can be a big issue. My personal opinion is that reporting and data science professionals typically wait too late to get involved with establishing data standards. Sometimes this is due to legacy systems that sprung up at different times and different parts of the enterprise. Some people use a certain date format while another use another one. Some people use a field for a certain purpose different that its intent because at that time no one was using it. Characters versus number formats for student identification numbers (is it really a number)?

Because of these many issues and the organic nature of data systems sometimes we are stuck with what we have. Admitting a problem is the first step to recovery. As organisations have become more cognizant of the need to have clean, reliable data the focus on the extract, transform, load (ETL) process has become larger and now is the time for the consumers of the data to engage on the data format side. While it may take a long time to make significant changes to legacy systems there is no reason not help develop plans for migration to more standard and consistent formats. This includes using the correct nomenclature on all of the documents and data sources overwhich you have control. That even includes your own spreadsheets. For example Karl Broman and Kara Woo have an entire article that was published as part of The American Statistician on the very subject here.

17.2 Some guidelines

While this list is not exhaustive it does give a few ideas to help being conistent.

  • Be consistnet in naming conventions

  • Use YYYY-MM-DD format for all dates

  • Keep a metadata table that lists what the field is, the units, the format

XKXD Comic