三行代码  ›  专栏  ›  技术社区  ›  Robert Frey

从Dataframe列中提取模式在字符串向量中匹配的字符串

  •  2
  • Robert Frey  · 技术社区  · 2 月前

    我有这个列的数据集,其中一个基本上是引号和状态名称,下面是一个例子: `

    library(tidyverse)
    df <- tibble(num = c(11,12,13), quote = c("In Ohio, there are plenty of hobos","Georgia, where the peaches are peachy","Oregon, no, we did not die of dysentery"))
    

    我想创建一个列来提取特定的状态。

    以下是我尝试过的:

    states <- state.name
    df <- df %>% mutate(state = na.omit(as.vector(str_match(quote,states)))[[1]])
    

    它获取此错误:

    Error in `mutate()`:
    ℹ In argument: `state = na.omit(as.vector(str_match(quote, states)))[[1]]`.
    Caused by error in `str_match()`:
    ! Can't recycle `string` (size 3) to match `pattern` (size 50).
    
    1 回复  |  直到 2 月前
        1
  •  1
  •   Ronak Shah    2 月前

    您需要将状态名称折叠成一个字符串,然后使用 str_extract 从中提取名称。

    library(dplyr)
    library(stringr)
    
    df %>% 
      mutate(state = str_extract(quote,str_c(state.name, collapse = "|")))
    
    #    num quote                                   state  
    #  <dbl> <chr>                                   <chr>  
    #1    11 In Ohio, there are plenty of hobos      Ohio   
    #2    12 Georgia, where the peaches are peachy   Georgia
    #3    13 Oregon, no, we did not die of dysentery Oregon 
    

    哪里 str_c 生成此字符串。

    str_c(state.name, collapse = "|")
    [1] "Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming"