三行代码  ›  专栏  ›  技术社区  ›  Derf

使用宽格式和长格式的总结

  •  1
  • Derf  · 技术社区  · 3 周前

    我有一个数据帧,格式如下。状态有两个级别(PRE、POST)。

    SI_mean TU_平均 ED_mean 平均值(_M) DT_mean SK_mean ATT_mean 地位
    2.6 2.75 2.6 2.8 3.4 2.5 3.8 PRE
    3. 3. 2.4 2.4 3. 3. 4. PRE
    2.4 2.75 2.4 2.2 2.6 2.25 2.8 PRE

    我想用wilcox.test比较每列状态级别的值。所以我立即尝试,

    df |>
      summarise(across(contains("mean"),~wilcox.test(.x~Status)$p.value))
    

    但迎接我的是

    Error in `summarise()`:
    ℹ In argument: `across(contains("mean"), ~wilcox.test(.x ~ Status)$p.value)`.
    ℹ In row 1.
    Caused by error in `across()`:
    ! Can't compute column `SI_mean`.
    Caused by error in `wilcox.test.formula()`:
    ! grouping factor must have exactly 2 levels
    

    所以我改为使用长格式,它如预期的那样工作,

    df |> pivot_longer(contains("mean"),names_to = "Variable",values_to = "Mean") |>
      group_by(Variable) |>
      summarise(
        wilcox_p_value = wilcox.test(Mean~Status)$p.value
        )
    

    但为什么 summarise 在宽格式中失败?

    我只是对我误解的内容感兴趣 总结 功能,以及我将如何使它在宽格式上工作。

    数据

    df=structure(list(SI_mean = c(2.6, 3, 2.4, 3, 3, 3.2, 2.2, 4, 3.8, 
    2.8, 3.6, 2, 3.6, 3.6, 3.8, 3.2, 3, 4, 4, 3, 3.2, 4, 4, 3.2, 
    3.2, 3, 3.2, 3.8, 4, 4, 4, 3), TU_mean = c(2.75, 3, 2.75, 3, 
    3, 2.75, 3, 3.5, 3.75, 2.5, 3.25, 2, 3.5, 4, 3, 3.25, 3, 4, 4, 
    3, 4, 4, 4, 3.25, 3.25, 3, 3, 3.25, 4, 4, 4, 3), ED_mean = c(2.6, 
    2.4, 2.4, 3, 2.8, 4, 2, 3.8, 2.6, 2, 2.8, 2, 3, 3.4, 3, 1, 3, 
    4, 3.8, 3, 3, 4, 4, 3.2, 4, 2.6, 4, 4, 3.8, 3.6, 4, 3), MT_mean = c(2.8, 
    2.4, 2.2, 3, 2.8, 3.4, 2.2, 3.6, 3.4, 3, 2.6, 1.8, 3.4, 3, 4, 
    2, 3, 3.4, 3.4, 3, 4, 4, 4, 3.2, 4, 2.8, 4, 4, 3.8, 3.6, 4, 3
    ), DT_mean = c(3.4, 3, 2.6, 3, 3, 3.8, 2.4, 3.6, 3, 3, 2.8, 2.4, 
    3.6, 3.6, 3, 2.2, 3, 4, 4, 4, 3.6, 4, 4, 3.6, 3.8, 2.8, 4, 4, 
    4, 3.8, 4, 3), SK_mean = c(2.5, 3, 2.25, 3, 3, 3.5, 2.25, 4, 
    3.25, 2.25, 2.5, 2.5, 3.75, 3.75, 4, 1, 2, 4, 3.25, 3, 3.75, 
    4, 4, 2.75, 4, 3, 4, 4, 4, 4, 4, 3), ATT_mean = c(3.8, 4, 2.8, 
    3, 3, 3.8, 3, 3.6, 3, 4, 4, 3, 3.8, 3.6, 4, 3.8, 4, 4, 4, 4, 
    4, 4, 4, 3.6, 3.8, 3, 4, 4, 4, 4, 4, 4), Status = c("PRE", "PRE", 
    "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", 
    "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "POST", "POST", "POST", 
    "POST", "POST", "POST", "POST", "POST", "POST", "POST", "POST", 
    "POST", "POST", "POST", "POST")), class = c("rowwise_df", "tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -32L), groups = structure(list(
        .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
            10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
            21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
            32L), ptype = integer(0), class = c("vctrs_list_of", 
        "vctrs_vctr", "list"))), row.names = c(NA, -32L), class = c("tbl_df", 
    "tbl", "data.frame")))
    
    1 回复  |  直到 3 周前
        1
  •  1
  •   Rui Barradas    3 周前

    问题出在数据上。
    您的tibble按行分组,可以在的第二行中看到

    df %>% print(n = 1L)
    #> # A tibble: 32 × 8
    #> # Rowwise: 
    

    并且因此被逐行处理。然后,每个 Status 值就是该行中的值。但是 wilcox.test 需要一个具有两个级别的分组变量,并给出错误。

    解决方案是先取消数据分组,然后运行测试。

    suppressPackageStartupMessages(
      library(dplyr)
    )
    
    df %>% print(n = 1L)
    #> # A tibble: 32 × 8
    #> # Rowwise: 
    #>   SI_mean TU_mean ED_mean MT_mean DT_mean SK_mean ATT_mean Status
    #>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <chr> 
    #> 1     2.6    2.75     2.6     2.8     3.4     2.5      3.8 PRE   
    #> # ℹ 31 more rows
    
    attributes(df)
    #> $names
    #> [1] "SI_mean"  "TU_mean"  "ED_mean"  "MT_mean"  "DT_mean"  "SK_mean"  "ATT_mean"
    #> [8] "Status"  
    #> 
    #> $class
    #> [1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"
    #> 
    #> $row.names
    #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
    #> [26] 26 27 28 29 30 31 32
    #> 
    #> $groups
    #> # A tibble: 32 × 1
    #>          .rows
    #>    <list<int>>
    #>  1         [1]
    #>  2         [1]
    #>  3         [1]
    #>  4         [1]
    #>  5         [1]
    #>  6         [1]
    #>  7         [1]
    #>  8         [1]
    #>  9         [1]
    #> 10         [1]
    #> # ℹ 22 more rows
    
    df %>%
      ungroup() %>%
      summarise(across(contains("mean"), ~wilcox.test(.x ~ Status)$p.value))
    #> Warning: There were 7 warnings in `summarise()`.
    #> The first warning was:
    #> ℹ In argument: `across(contains("mean"), ~wilcox.test(.x ~ Status)$p.value)`.
    #> Caused by warning in `wilcox.test.default()`:
    #> ! cannot compute exact p-value with ties
    #> ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.
    #> # A tibble: 1 × 7
    #>   SI_mean TU_mean  ED_mean MT_mean  DT_mean SK_mean ATT_mean
    #>     <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
    #> 1  0.0147 0.00512 0.000527 0.00114 0.000114 0.00252  0.00613
    

    创建于2023-11-09 reprex v2.0.2