Chapter 5 Import

Getting data into your analysis platform is step zero. Working in excel or in the raw data is always bad practice. The best way to approach data analysis is through a medium that was designed to do the analsys. This also allows you to leave the raw data unchanged. This is critically important for reproducibility and repeatability.

5.1 Generic Files

There are many ways to read data into R, as many in fact as there are data storage methods. Generally smaller data will come to us in excel spread sheets or comma deliminted files (CSVs). These data can be read in using some packages from the readr and readxl packages. For instance reading in a CSV file can be done with read_csv, a function from read_csv. There is a base R function called read.csv that also accomplishes the job, but it often requires that we specify additional information.

library(readr)
df <- read_csv("data/my_csv.csv")

Additionally if data comes to us in excel files, the package that we can use is readxl. Keeping with the same syntax we can specify the argument as follows:

library(readxl)
df <- read_excel("data/my_excel.xlsx", sheet = "Sheet1", range = "A1:D20")

In the reading in the excel file I have passed some additional arguments to the read_excel function. These are not required if the excel spreadsheet only has one tab, but it is a good practice to specify the tab name regardless in case a new tab is added in the future. I also passed an argument for the cell range. This is a good practice if someone has built a table in the excel spreadhseet and you want to capture just that selection. If you want to read in the entire spreadhseet that is possible by removing this additional range arguement.

Moving into some more exotic data types include SAS, Stata, and SPSS data formats. These too are easily read into R with the haven package.

library(haven)
df <-read_sas("data/sas_data.sas7bdat")

df <- read_spss("data.sas_data.sav")

df <- read_dta("data/stat_data.dta")

The latest style of storing data is in the cloud. One of the more common and popular alternatives is storing data in googlesheets. Googlsheets are flexible, can be fed directly through google forms, and can be shared dynamically over the internet. Luckily, we can also use this to our advantage and read the googlesheet directly from the cloud with the googlesheets package.

library(googlesheets)
my_google_object <- gs_gs()
df <- googlesheets::gs_read_csv(my_google_object)

5.2 Databases

Sometimes the data are just took big to be shared though flat files. If this is the case then you are talking about pulling data from a database. Any easy way to think about a database is that a database is a collection of spreadsheets. Each table stores some information, and the data from one table can be linked to another table through the use of a key. Depending on the database architecture this can be more complicated, but the gist is a collection of big spreadheets in one location that you can cross reference.

Pulling data from servers will most likely require that you have access to the database. This can come in the form of a username and password. With databases there can be user level permissions so a database administrator (DBA) will help set up a particular view, and in this way you get the data that you want and don’t have to worry about additional access. Sometimes this isn’t the case and you will have to have access to all of the tables. If that is the case you will have to manually build your own view that you wish to store. But before we get into that, let’s get connected first.

library(DBI)
library(odbc)
library(nycflights13)
my_con <- dbConnect(odbc::odbc(),
  driver = "PostgreSQL Driver",
  database = "test_db",
  uid = "postgres",
  pwd = "password",
  host = "localhost",
  port = 5432)

library(DBI)
con <- dbConnect(odbc::odbc(),
  .connection_string = "Driver={PostgreSQL Driver};Uid=postgres;Pwd=password;Host=localhost;Port=5432;Database=test_db;")
dbListTables(my_con)
library(DBI)
install.packages("RODBCDBI")
#> Loading required package: methods
rodbc <- dbConnect(RODBCDBI::ODBC(), dsn = "PostgreSQL")
system.time(rodbc_result <- dbReadTable(rodbc, "flights"))