-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Describe the bug
I would like to create a wordcloud in which the most used words are displayed by groups. Unfortunately, the maximum number of words is not evenly split between the groups like it is writen at the docs: "The maximum frequency will be split evenly across categories when comparison = TRUE.". I couldn't quite figure out what the split is based on, but it seems to have something to do with the relative importance of the words within each group.
Reproducible code
Dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L4OAKN
Bundestagganz <- readRDS("Corp_Bundestag_V2.rds")
datenneu1 <- subset (Bundestagganz, date <"2018-01-01")
datenneu1 %>%
corpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm(verbose = FALSE) %>%
dfm_group(groups = party) %>%
quanteda.textplots::textplot_wordcloud(comparison = TRUE, max.words = 10,title.size = 1)datenneu1 %>%
corpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm(verbose = FALSE) %>%
dfm_group(groups = party) %>%
quanteda.textplots::textplot_wordcloud(comparison = TRUE, max_words = 1000 , min_size = 0.1, max_size = 1) Expected behavior
My expectation is that max.words describes either the maximlae number of words per group or the maximum number of words in total but then split evenly between the groups.
System information
sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C LC_TIME=German_Germany.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] word2vec_0.3.4 udpipe_0.8.9 wordcloud_2.6 RColorBrewer_1.1-3 reshape2_1.4.4 quanteda.textplots_0.94.1 forcats_0.5.1 stringr_1.4.0
[9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.2 tidytext_0.3.4 dplyr_1.0.10
[17] ROCR_1.0-11 quanteda_3.2.1
loaded via a namespace (and not attached):
[1] httr_1.4.3 jsonlite_1.8.0 modelr_0.1.8 RcppParallel_5.1.5 assertthat_0.2.1 googlesheets4_1.0.0 cellranger_1.1.0 yaml_2.3.5 pillar_1.7.0 backports_1.4.1 lattice_0.20-45
[12] glue_1.6.2 digest_0.6.29 rvest_1.0.2 colorspace_2.0-3 htmltools_0.5.2 Matrix_1.5-1 plyr_1.8.7 pkgconfig_2.0.3 broom_0.8.0 haven_2.5.0 scales_1.2.0
[23] tzdb_0.3.0 googledrive_2.0.0 generics_0.1.2 ellipsis_0.3.2 withr_2.5.0 cli_3.3.0 magrittr_2.0.3 crayon_1.5.1 readxl_1.4.0 evaluate_0.15 stopwords_2.3
[34] tokenizers_0.2.1 janeaustenr_0.1.5 fs_1.5.2 fansi_1.0.3 SnowballC_0.7.0 xml2_1.3.3 data.table_1.14.2 tools_4.2.2 hms_1.1.1 gargle_1.2.0 lifecycle_1.0.1
[45] munsell_0.5.0 reprex_2.0.1 compiler_4.2.2 rlang_1.0.2 grid_4.2.2 rstudioapi_0.13 rmarkdown_2.13 gtable_0.3.0 DBI_1.1.2 R6_2.5.1 lubridate_1.8.0
[56] knitr_1.38 fastmap_1.1.0 utf8_1.2.2 fastmatch_1.1-3 stringi_1.7.6 Rcpp_1.0.8.3 vctrs_0.4.1 dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.30
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels

