Chapter 4 Cleaning and Transforming Data

Cleaning data easily consumes the majority of the time of any data scientist. From the more mundane tasks of just reading in data and correcting typos (where the powers of regular expressions can never be underestimated) to parsing a JSON file, the data scientist has to be adept. A general rule of data science is that if the data are available, they are not in the shape or format that you need them most in. That is a bit of hyperbole, but at times it can feel like that.

4.1 Searching for issues

The first rule in any data analysis is that there is an issue with your data. If you don’t see it yet, you just haven’t found it. While this is a little tongue and cheak it often is the reality that your data has issues (from simply formatting to the shape of it). As such the actual tranformation of the data into a shape that is suitable for analysis is often the second step.

4.1.1 Checks for data integrity

library(nycflights13)
check_na <- function(x){
  sum(is.na(x))
}
purrr::map_df(flights, check_na)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <int>    <int>
## 1     0     0     0     8255              0      8255     8713
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <int>,
## #   carrier <int>, flight <int>, tailnum <int>, origin <int>, dest <int>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   time_hour <int>

4.2 Wrangling

4.2.1 Gather/ Spread

4.2.2 SQL-esque

4.3 Transformations

4.3.1 Logs

4.3.2 Square Roots