Skip to content

Commit 7cb5701

Browse files
committed
Add NCDS reshape page
1 parent 3330572 commit 7cb5701

6 files changed

+217
-4
lines changed

docs/ncds-data_discovery.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Data Discovery
4-
nav_order: 1
4+
nav_order: 2
55
parent: NCDS
66
format: docusaurus-md
77
---

docs/ncds-merging_across_sweeps.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Combining Data Across Sweeps
4-
nav_order: 2
4+
nav_order: 3
55
parent: NCDS
66
format: docusaurus-md
77
---

docs/ncds-reshape_long_wide.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
layout: default
3+
title: Reshaping Data from Long to Wide (or Wide to Long)
4+
nav_order: 4
5+
parent: NCDS
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
# Introduction
13+
14+
In this section, we show how to reshape data from long to wide (and vice
15+
versa). To demonstrate, we use data from Sweeps 4 (23y) and 8 (50y) on
16+
cohort member’s height and weight collected.
17+
18+
The packages we use are:
19+
20+
```r
21+
# Load Packages
22+
library(tidyverse) # For data manipulation
23+
library(haven) # For importing .dta files
24+
```
25+
26+
# Reshaping Raw Data from Wide to Long
27+
28+
We begin by loading the data from each sweep and merging these together
29+
into a single wide format data frame; see [Combining Data Across
30+
Sweeps](https://cls-data.github.io/docs/ncds-merging_across_sweeps.html)
31+
for further explanation on how this is achieved. Note, the names of the
32+
height and weight variables in Sweep 4 and Sweep 8 follow a similar
33+
convention, which is the exception rather than the rule in NCDS data.
34+
Below, we convert the variable names in the Sweep 4 data frame to upper
35+
case so that they closely match those in the Sweep 8 data frame. This
36+
will make reshaping easier.
37+
38+
```r
39+
df_23y <- read_dta("23y/ncds4.dta",
40+
col_select = c("ncdsid", "dvwt23", "dvht23")) %>%
41+
rename_with(str_to_upper)
42+
43+
df_50y <- read_dta("50y/ncds_2008_followup.dta",
44+
col_select = c("NCDSID", "DVWT50", "DVHT50"))
45+
46+
df_wide <- df_23y %>%
47+
full_join(df_50y, by = "NCDSID")
48+
```
49+
50+
`df_wide` has 5 columns. Besides, the identifier, `NCDSID`, there are 4
51+
columns for height and weight measurements at each sweep. Each of these
52+
4 columns is suffixed by two numbers indicating the age at assessment.
53+
We can reshape the dataset into long format (one row per person x sweep
54+
combination) using the `pivot_longer()` function so that the resulting
55+
data frame has four columns: one person identifier, a variable for age
56+
of assessment (`fup`), and variables for height and weight. We specify
57+
the columns to be reshaped using the `cols` argument, provide the new
58+
variable names in the `names_to` argument, and the pattern the existing
59+
column names take using the `names_pattern` argument. For
60+
`names_pattern` we specify `"^(.*)(\\d\\d)$"`, which breaks the column
61+
name into two pieces: the first characters (`"(.*)"`) and two digits at
62+
the end of the name (`"(\\d\\d)$"`). `names_pattern` uses regular
63+
expressions. `.` matches single characters, and `.*` modifies this to
64+
make zero or more characters. `\\d` is a special character denoting a
65+
digit. As noted, the final two digits character hold information on age
66+
of assessment; in the reshaped data frame the character is stored as a
67+
value in a new column `fup`. `.value` is a placeholder for the new
68+
columns in the reshaped data frame that store the values from the
69+
columns selected by `cols`; these new columns are named using the first
70+
piece from `names_pattern` - in this case `DVHT` (height) and `DVWT`
71+
(weight).
72+
73+
```r
74+
df_long <- df_wide %>%
75+
pivot_longer(cols = matches("DV(HT|WT)\\d\\d"),
76+
names_to = c(".value", "fup"),
77+
names_pattern = "^(.*)(\\d\\d)$")
78+
79+
df_long
80+
```
81+
82+
``` text
83+
# A tibble: 28,028 × 4
84+
NCDSID fup DVHT DVWT
85+
<chr> <chr> <dbl+lbl> <dbl+lbl>
86+
1 N10001N 23 1.63 59.4
87+
2 N10001N 50 NA 66.7
88+
3 N10002P 23 1.90 73.5
89+
4 N10002P 50 NA 79.4
90+
5 N10004R 23 1.65 76.2
91+
6 N10004R 50 NA NA
92+
7 N10007U 23 1.63 52.2
93+
8 N10007U 50 NA 72.1
94+
9 N10009W 23 1.73 66.7
95+
10 N10009W 50 1.7 78
96+
# ℹ 28,018 more rows
97+
```
98+
99+
# Reshaping Raw Data from Long to Wide
100+
101+
We can also reshape the data from long to wide format using the
102+
`pivot_wider()` function. In this case, we want to create two new
103+
columns for each sweep: one for height and one for weight. We specify
104+
the columns to be reshaped using the `values_from` argument, provide the
105+
old column names in the `names_from` argument, and use the `names_glue`
106+
argument to specify the convention to follow for the new column names.
107+
The `names_glue` argument uses curly braces (`{}`) to reference the
108+
values from the `names_from` and `.value` arguments. As we are
109+
specifying multiple columns in `values_from`, `.value` is a placeholder
110+
for the names of the variables selected in `values_from`.
111+
112+
```r
113+
df_long %>%
114+
pivot_wider(names_from = fup,
115+
values_from = matches("DV(HT|WT)"),
116+
names_glue = "{.value}{fup}")
117+
```
118+
119+
``` text
120+
# A tibble: 14,014 × 5
121+
NCDSID DVHT23 DVHT50 DVWT23 DVWT50
122+
<chr> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
123+
1 N10001N 1.63 NA 59.4 66.7
124+
2 N10002P 1.90 NA 73.5 79.4
125+
3 N10004R 1.65 NA 76.2 NA
126+
4 N10007U 1.63 NA 52.2 72.1
127+
5 N10009W 1.73 1.7 66.7 78
128+
6 N10011Q 1.68 1.7 63.5 95
129+
7 N10012R 1.96 NA 114. 133.
130+
8 N10013S 1.78 NA 83.5 95.2
131+
9 N10014T 1.55 NA 57.2 63.5
132+
10 N10015U 1.80 NA 73.0 78
133+
# ℹ 14,004 more rows
134+
```
135+
136+
Note, in the original `df_wide` tibble, `DVHT23` and `DVWT23` were
137+
labelled numeric vectors - this class allows users to add metadata to
138+
variables (value labels, etc.). `DVHT50` and `DVWT50`, on the other
139+
hand, were standard numeric vectors. When reshaping to long format,
140+
multiple variables are effectively appended together. The final reshape
141+
variables can only have one set of properties. `pivot_longer()` merges
142+
variables together to preserve variables attributes, but in some cases
143+
will throw an error (where variables are of inconsistent types) or print
144+
a warning (where value labels are inconsistent). Note above, where we
145+
reshape `df_long` back to wide format, all weight and height variables
146+
now have labelled numeric type.

quarto/ncds-data_discovery.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: "Data Discovery"
4-
nav_order: 1
4+
nav_order: 2
55
parent: NCDS
66
format: docusaurus-md
77
---

quarto/ncds-merging_across_sweeps.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: "Combining Data Across Sweeps"
4-
nav_order: 2
4+
nav_order: 3
55
parent: NCDS
66
format: docusaurus-md
77
---

quarto/ncds-reshape_long_wide.qmd

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
layout: default
3+
title: "Reshaping Data from Long to Wide (or Wide to Long)"
4+
nav_order: 4
5+
parent: NCDS
6+
format: docusaurus-md
7+
---
8+
9+
# Introduction
10+
11+
In this section, we show how to reshape data from long to wide (and vice versa). To demonstrate, we use data from Sweeps 4 (23y) and 8 (50y) on cohort member's height and weight collected.
12+
13+
The packages we use are:
14+
15+
```{r}
16+
#| warning: false
17+
# Load Packages
18+
library(tidyverse) # For data manipulation
19+
library(haven) # For importing .dta files
20+
```
21+
22+
```{r}
23+
#| include: false
24+
# setwd(Sys.getenv("mcs_fld"))
25+
```
26+
27+
# Reshaping Raw Data from Wide to Long
28+
29+
We begin by loading the data from each sweep and merging these together into a single wide format data frame; see [Combining Data Across Sweeps](https://cls-data.github.io/docs/ncds-merging_across_sweeps.html) for further explanation on how this is achieved. Note, the names of the height and weight variables in Sweep 4 and Sweep 8 follow a similar convention, which is the exception rather than the rule in NCDS data. Below, we convert the variable names in the Sweep 4 data frame to upper case so that they closely match those in the Sweep 8 data frame. This will make reshaping easier.
30+
31+
32+
```{r}
33+
df_23y <- read_dta("23y/ncds4.dta",
34+
col_select = c("ncdsid", "dvwt23", "dvht23")) %>%
35+
rename_with(str_to_upper)
36+
37+
df_50y <- read_dta("50y/ncds_2008_followup.dta",
38+
col_select = c("NCDSID", "DVWT50", "DVHT50"))
39+
40+
df_wide <- df_23y %>%
41+
full_join(df_50y, by = "NCDSID")
42+
```
43+
44+
`df_wide` has 5 columns. Besides, the identifier, `NCDSID`, there are 4 columns for height and weight measurements at each sweep. Each of these 4 columns is suffixed by two numbers indicating the age at assessment. We can reshape the dataset into long format (one row per person x sweep combination) using the `pivot_longer()` function so that the resulting data frame has four columns: one person identifier, a variable for age of assessment (`fup`), and variables for height and weight. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"^(.*)(\\d\\d)$"`, which breaks the column name into two pieces: the first characters (`"(.*)"`) and two digits at the end of the name (`"(\\d\\d)$"`). `names_pattern` uses regular expressions. `.` matches single characters, and `.*` modifies this to make zero or more characters. `\\d` is a special character denoting a digit. As noted, the final two digits character hold information on age of assessment; in the reshaped data frame the character is stored as a value in a new column `fup`. `.value` is a placeholder for the new columns in the reshaped data frame that store the values from the columns selected by `cols`; these new columns are named using the first piece from `names_pattern` - in this case `DVHT` (height) and `DVWT` (weight).
45+
46+
```{r}
47+
#| warning: false
48+
df_long <- df_wide %>%
49+
pivot_longer(cols = matches("DV(HT|WT)\\d\\d"),
50+
names_to = c(".value", "fup"),
51+
names_pattern = "^(.*)(\\d\\d)$")
52+
53+
df_long
54+
```
55+
56+
# Reshaping Raw Data from Long to Wide
57+
58+
We can also reshape the data from long to wide format using the `pivot_wider()` function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the `values_from` argument, provide the old column names in the `names_from` argument, and use the `names_glue` argument to specify the convention to follow for the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the names of the variables selected in `values_from`.
59+
60+
```{r}
61+
df_long %>%
62+
pivot_wider(names_from = fup,
63+
values_from = matches("DV(HT|WT)"),
64+
names_glue = "{.value}{fup}")
65+
```
66+
67+
Note, in the original `df_wide` tibble, `DVHT23` and `DVWT23` were labelled numeric vectors - this class allows users to add metadata to variables (value labels, etc.). `DVHT50` and `DVWT50`, on the other hand, were standard numeric vectors. When reshaping to long format, multiple variables are effectively appended together. The final reshape variables can only have one set of properties. `pivot_longer()` merges variables together to preserve variables attributes, but in some cases will throw an error (where variables are of inconsistent types) or print a warning (where value labels are inconsistent). Note above, where we reshape `df_long` back to wide format, all weight and height variables now have labelled numeric type.

0 commit comments

Comments
 (0)