|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Reshaping Data from Long to Wide (or Wide to Long) |
| 4 | +nav_order: 4 |
| 5 | +parent: NCDS |
| 6 | +format: docusaurus-md |
| 7 | +--- |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +# Introduction |
| 13 | + |
| 14 | +In this section, we show how to reshape data from long to wide (and vice |
| 15 | +versa). To demonstrate, we use data from Sweeps 4 (23y) and 8 (50y) on |
| 16 | +cohort member’s height and weight collected. |
| 17 | + |
| 18 | +The packages we use are: |
| 19 | + |
| 20 | +```r |
| 21 | +# Load Packages |
| 22 | +library(tidyverse) # For data manipulation |
| 23 | +library(haven) # For importing .dta files |
| 24 | +``` |
| 25 | + |
| 26 | +# Reshaping Raw Data from Wide to Long |
| 27 | + |
| 28 | +We begin by loading the data from each sweep and merging these together |
| 29 | +into a single wide format data frame; see [Combining Data Across |
| 30 | +Sweeps](https://cls-data.github.io/docs/ncds-merging_across_sweeps.html) |
| 31 | +for further explanation on how this is achieved. Note, the names of the |
| 32 | +height and weight variables in Sweep 4 and Sweep 8 follow a similar |
| 33 | +convention, which is the exception rather than the rule in NCDS data. |
| 34 | +Below, we convert the variable names in the Sweep 4 data frame to upper |
| 35 | +case so that they closely match those in the Sweep 8 data frame. This |
| 36 | +will make reshaping easier. |
| 37 | + |
| 38 | +```r |
| 39 | +df_23y <- read_dta("23y/ncds4.dta", |
| 40 | + col_select = c("ncdsid", "dvwt23", "dvht23")) %>% |
| 41 | +rename_with(str_to_upper) |
| 42 | + |
| 43 | +df_50y <- read_dta("50y/ncds_2008_followup.dta", |
| 44 | + col_select = c("NCDSID", "DVWT50", "DVHT50")) |
| 45 | + |
| 46 | +df_wide <- df_23y %>% |
| 47 | + full_join(df_50y, by = "NCDSID") |
| 48 | +``` |
| 49 | + |
| 50 | +`df_wide` has 5 columns. Besides, the identifier, `NCDSID`, there are 4 |
| 51 | +columns for height and weight measurements at each sweep. Each of these |
| 52 | +4 columns is suffixed by two numbers indicating the age at assessment. |
| 53 | +We can reshape the dataset into long format (one row per person x sweep |
| 54 | +combination) using the `pivot_longer()` function so that the resulting |
| 55 | +data frame has four columns: one person identifier, a variable for age |
| 56 | +of assessment (`fup`), and variables for height and weight. We specify |
| 57 | +the columns to be reshaped using the `cols` argument, provide the new |
| 58 | +variable names in the `names_to` argument, and the pattern the existing |
| 59 | +column names take using the `names_pattern` argument. For |
| 60 | +`names_pattern` we specify `"^(.*)(\\d\\d)$"`, which breaks the column |
| 61 | +name into two pieces: the first characters (`"(.*)"`) and two digits at |
| 62 | +the end of the name (`"(\\d\\d)$"`). `names_pattern` uses regular |
| 63 | +expressions. `.` matches single characters, and `.*` modifies this to |
| 64 | +make zero or more characters. `\\d` is a special character denoting a |
| 65 | +digit. As noted, the final two digits character hold information on age |
| 66 | +of assessment; in the reshaped data frame the character is stored as a |
| 67 | +value in a new column `fup`. `.value` is a placeholder for the new |
| 68 | +columns in the reshaped data frame that store the values from the |
| 69 | +columns selected by `cols`; these new columns are named using the first |
| 70 | +piece from `names_pattern` - in this case `DVHT` (height) and `DVWT` |
| 71 | +(weight). |
| 72 | + |
| 73 | +```r |
| 74 | +df_long <- df_wide %>% |
| 75 | + pivot_longer(cols = matches("DV(HT|WT)\\d\\d"), |
| 76 | + names_to = c(".value", "fup"), |
| 77 | + names_pattern = "^(.*)(\\d\\d)$") |
| 78 | + |
| 79 | +df_long |
| 80 | +``` |
| 81 | + |
| 82 | +``` text |
| 83 | +# A tibble: 28,028 × 4 |
| 84 | + NCDSID fup DVHT DVWT |
| 85 | + <chr> <chr> <dbl+lbl> <dbl+lbl> |
| 86 | + 1 N10001N 23 1.63 59.4 |
| 87 | + 2 N10001N 50 NA 66.7 |
| 88 | + 3 N10002P 23 1.90 73.5 |
| 89 | + 4 N10002P 50 NA 79.4 |
| 90 | + 5 N10004R 23 1.65 76.2 |
| 91 | + 6 N10004R 50 NA NA |
| 92 | + 7 N10007U 23 1.63 52.2 |
| 93 | + 8 N10007U 50 NA 72.1 |
| 94 | + 9 N10009W 23 1.73 66.7 |
| 95 | +10 N10009W 50 1.7 78 |
| 96 | +# ℹ 28,018 more rows |
| 97 | +``` |
| 98 | + |
| 99 | +# Reshaping Raw Data from Long to Wide |
| 100 | + |
| 101 | +We can also reshape the data from long to wide format using the |
| 102 | +`pivot_wider()` function. In this case, we want to create two new |
| 103 | +columns for each sweep: one for height and one for weight. We specify |
| 104 | +the columns to be reshaped using the `values_from` argument, provide the |
| 105 | +old column names in the `names_from` argument, and use the `names_glue` |
| 106 | +argument to specify the convention to follow for the new column names. |
| 107 | +The `names_glue` argument uses curly braces (`{}`) to reference the |
| 108 | +values from the `names_from` and `.value` arguments. As we are |
| 109 | +specifying multiple columns in `values_from`, `.value` is a placeholder |
| 110 | +for the names of the variables selected in `values_from`. |
| 111 | + |
| 112 | +```r |
| 113 | +df_long %>% |
| 114 | + pivot_wider(names_from = fup, |
| 115 | + values_from = matches("DV(HT|WT)"), |
| 116 | + names_glue = "{.value}{fup}") |
| 117 | +``` |
| 118 | + |
| 119 | +``` text |
| 120 | +# A tibble: 14,014 × 5 |
| 121 | + NCDSID DVHT23 DVHT50 DVWT23 DVWT50 |
| 122 | + <chr> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> |
| 123 | + 1 N10001N 1.63 NA 59.4 66.7 |
| 124 | + 2 N10002P 1.90 NA 73.5 79.4 |
| 125 | + 3 N10004R 1.65 NA 76.2 NA |
| 126 | + 4 N10007U 1.63 NA 52.2 72.1 |
| 127 | + 5 N10009W 1.73 1.7 66.7 78 |
| 128 | + 6 N10011Q 1.68 1.7 63.5 95 |
| 129 | + 7 N10012R 1.96 NA 114. 133. |
| 130 | + 8 N10013S 1.78 NA 83.5 95.2 |
| 131 | + 9 N10014T 1.55 NA 57.2 63.5 |
| 132 | +10 N10015U 1.80 NA 73.0 78 |
| 133 | +# ℹ 14,004 more rows |
| 134 | +``` |
| 135 | + |
| 136 | +Note, in the original `df_wide` tibble, `DVHT23` and `DVWT23` were |
| 137 | +labelled numeric vectors - this class allows users to add metadata to |
| 138 | +variables (value labels, etc.). `DVHT50` and `DVWT50`, on the other |
| 139 | +hand, were standard numeric vectors. When reshaping to long format, |
| 140 | +multiple variables are effectively appended together. The final reshape |
| 141 | +variables can only have one set of properties. `pivot_longer()` merges |
| 142 | +variables together to preserve variables attributes, but in some cases |
| 143 | +will throw an error (where variables are of inconsistent types) or print |
| 144 | +a warning (where value labels are inconsistent). Note above, where we |
| 145 | +reshape `df_long` back to wide format, all weight and height variables |
| 146 | +now have labelled numeric type. |
0 commit comments