select() does not keep variable.labels attribute created by foreign functions

I am not sure this is a dplyr problem per se, but thought I'd report it to see what you think and if I should take it somewhere else instead.

A dataframe that has column labels stored as a dataframe attribute `variable.labels` will lose them when using `dplyr::select()`. An example of such a dataframe is a labelled SPSS file imported with `foreign::read.spss()`. Other verbs in dplyr (as far as I have tested) do not lose the labels, which is why it is surprising and looks like a bug.

A workaround is to stick to Tidyverse packages and import the data with `haven::read_sav()`, which stores the `label` attribute per column.

But would it be possible for dplyr to cater for the first case, as the attribute comes from a base package (even though the square-bracket slicing doesn't even cater for it)? Or is it out of scope?

Here is a reproducible example:

``` r
library(foreign)
# import example SAV from foreign
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)

# labels are saved as a variable.labels attribute for the whole dataframe
str(dat_for)
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# dplyr::select() loses the variable.labels attribute:
library(dplyr, warn.conflicts = FALSE)
dat_for %>% select(1:2) %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# other verbs don't lose them, for example:
dat_for %>% mutate(test = "test") %>% str()
#> 'data.frame':    240 obs. of  14 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ test    : chr  "test" "test" "test" "test" ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% filter(AGE > 50) %>% str()
#> 'data.frame':    77 obs. of  13 variables:
#>  $ CASEID  : num  151 161 178 314 344 349 429 460 467 482 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 2 5 3 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  53 54 52 53 52 52 54 53 54 54 ...
#>  $ DBP58   : num  102 93 83 91 107 83 90 99 98 87 ...
#>  $ EDUYR   : num  12 9 9 7 NA 14 8 NA 8 11 ...
#>  $ CHOL58  : num  216 265 269 292 351 292 235 273 215 283 ...
#>  $ CGT58   : num  0 0 15 0 0 10 0 30 20 22 ...
#>  $ HT58    : num  67 65.5 68.8 60.9 68 64.4 74.9 67.3 67.4 68.9 ...
#>  $ WT58    : num  172 165 140 149 160 154 254 168 171 181 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: 3 7 4 NA 6 5 6 4 4 2 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 2 2 2 1 2 1 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 1 2 2 1 2 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% rename(ages = AGE) %>% str()
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ ages    : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# but to be fair, base slicing also loses the attribute:
dat_for[,1:2] %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# import same file with haven
library(haven)
dat_hav <- read_sav(file = sav)

# labels are attributes of columns
str(dat_hav$CASEID)
#>  num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>  - attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>  - attr(*, "format.spss")= chr "F4.0"
#>  - attr(*, "display_width")= int 0

# dplyr::select() keeps them
dat_hav %>% select(1:2) %>% str()
#> tibble[,2] [240 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ CASEID  : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>   ..- attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>   ..- attr(*, "format.spss")= chr "F4.0"
#>   ..- attr(*, "display_width")= int 0
#>  $ FIRSTCHD: dbl+lbl [1:240] 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 6, 2, 3, 5, 3, 3, 3, 3, ...
#>    ..@ label        : chr "FIRST CHD EVENT"
#>    ..@ format.spss  : chr "F1.0"
#>    ..@ display_width: int 0
#>    ..@ labels       : Named num [1:5] 1 2 3 5 6
#>    .. ..- attr(*, "names")= chr [1:5] "NO CHD" "SUDDEN  DEATH" "NONFATALMI" "FATAL   MI" ...
#>  - attr(*, "label")= chr "                       SPSS/PC+"
```

<sup>Created on 2021-03-30 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)</sup>

using:

* R 4.0.4
* dplyr 1.0.5
* foreign 0.8-81
* haven 2.3.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

select() does not keep variable.labels attribute created by foreign functions #5831

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

select() does not keep variable.labels attribute created by foreign functions #5831

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions