R Tutorial: Data Restructuring and Correlations

Показать описание

---
This section will explore techniques that are useful for longitudinal data, data restructuring, and correlations over time.

Data are often stored in wide format, where each measurement occasion is stored as a separate column with one row for each individual unit. However, analysis in R is done in long format, with the measurements stacked on top of one another and variables for time and the measurement value.

The tidyr package can be used for data restructuring. The gather function is used for restructuring data from wide to long and the spread function is used for restructuring data from long to wide. You can learn more about the tidyr package in other DataCamp courses.

Long to wide format is not used as often as wide to long, however, it can be useful to calculate correlations between the outcomes over time. To change from long to wide using the BodyWeight data, first, mutate creates a new variable that adds Time underscore in front of the numeric time variable. This ensures valid column names in R. Next, spread restructures the data using two arguments. The first argument is the column representing the name for the new wide columns and the second determines the values for the new columns. Finally, select reorders the time variables instead of the alphabetical default order. The resulting wide format is shown for the first three rows.

Using the newly created structure from the previous example, the gather function restructures the data back to long format. The function takes the data as the first argument. The next two arguments, key and value, are the names of the new columns in the long format. The key is the name of the data columns in wide format and the value is the name for the data stored in each column. Finally, the variables to be restructured need to be specified. The colon operator specifies the range of variables to restructure. The syntax, Time underscore 1 colon Time underscore 64 is read as "combine all variables between Time underscore 1 and Time underscore 64". The resulting output shows the first six rows of the restructured data.

Exploration of correlations over time can show how dependent the multiple measurements are for longitudinal data, and how the dependency changes over time. Does the correlation stay constant or decrease over time?

Three functions from the corrr package, correlate, shave, and fashion, will help explore these questions. The correlate function calculates the correlation matrix, shave removes extraneous information, and fashion formats the correlation matrix. Let's explore an example.

Here the BodyWeight data is converted from long to wide format, as shown earlier. The select function selects all the measurement occasions, which are passed to the correlation function from corrr. The shave function is used to only keep the upper portion of the correlation matrix. Fashion truncates the output to three decimals. The output shows nearly constant correlations over all time points, which are all very large. The experiment has likely increased the correlations as the researcher had a large amount of control in the experiment.

Time to practice working with longitudinal data.

#DataCamp #RTutorial #Longitudinal #Analysis #data #Restructuring #Correlations