Skip to content
This repository was archived by the owner on Aug 7, 2023. It is now read-only.

scroll_search: problem in bind_rows() #58

@ErikEngerd

Description

@ErikEngerd

We have an elasticsearch database that contains documents with varying fields. I believe this sometimes can result in bind_rows() failing with an error like 'bind_rows column can't be converted from list to character'. I believe this is caused by some queries returning different fields and then bind_rows() fails because the dataframes do not have the same columns.

Example:
......Error: Column threadcan't be converted from list to character

In my code I worked around it by retrieving the data in small chunks and then concatening the dataframes using plyr::rbind.fill() instead, but then I have to choose the retrieval time interval. Here choosing a lower interval also doesn't always work so it is not 100% reliable.

Below is some code showing what I am doing (it is not a complete example):

getElasticDataInTimeInterval = function(hostname, t0, t1, fields, 
                                        url = "http://localhost:9200",
                                        deltatminutes = 5) { 
  if (!is.POSIXct(t0)) { 
    t0 = getTime(t0)
  }
  if (!is.POSIXct(t1)) { 
    t1 = getTime(t1)
  }
  # workaround for problem in elasticsearchr
  # Because of paging, it can happen that in different pages
  # the fields differ. But then bind_rows() will fail since the columns
  # do not match. Therefore, we query smaller time intervals to avoid paging
  # and use plyr::rbind.fill() instead to avoid this issue. 
  timeintervals = seq(t0, t1, deltatminutes*60)
  res = list()
  for (i in 1:(length(timeintervals)-1)) { 
    t0 = timeintervals[[i]]
    t1 = timeintervals[[i+1]]
    
    dt0 = date(with_tz(t0, "UTC"))
    dt1 = date(with_tz(t1, "UTC"))
    dates = seq(dt0, dt1, 1)
    
    
    ressingle = lapply(dates, function(date) {
      cat("Fetching data for [", paste(t0), ",", paste(t1), "> \n")
      results = elastic(url, getElasticIndex(date), "doc") %search% (
        filterHostAndTime(hostname, t0, t1) + 
          do.call(selectFields, as.list(fields))
      )
      cat('ok\n')
      results
    })
    res = c(res, ressingle)
  }

  res %>% 
    #bind_rows() %>%
    plyr::rbind.fill() %>%
    rename(log_timestamp = '@timestamp') %>%
    mutate(log_timestamp  = with_tz(parse_datetime(log_timestamp), tzone = "Europe/Amsterdam"), 
           time.start = as.POSIXct(strptime(str_replace(time.start, ",", "."), "%Y-%m-%d %H:%M:%OS")),
           time.end   = as.POSIXct(strptime(str_replace(time.end, ",", "."),  "%Y-%m-%d %H:%M:%OS"))) %>%
    arrange(timestamp)
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions