Skip to content

Conversation

@schochastics
Copy link
Contributor

@schochastics schochastics commented Apr 5, 2025

Fix #22

@szhorvat
Copy link
Member

szhorvat commented Apr 5, 2025

Is it documented how networks are encoded into CSV on Netzschleuder, and if not, can we write a short description and include it in comments? I expect this won't be as simple to get working with the majority of dataset as it might seem.

@szhorvat
Copy link
Member

szhorvat commented Apr 5, 2025

Here's a stress-test ;-) Not too large, but plenty of attributes. https://networks.skewed.de/net/blumenau_drug

@schochastics
Copy link
Contributor Author

Is it documented how networks are encoded into CSV on Netzschleuder, and if not, can we write a short description and include it in comments? I expect this won't be as simple to get working with the majority of dataset as it might seem.

My plan is to write a script that runs through all networks once to see if there are any differences. My sampling so far showed that it looks pretty consistent

edges.csv: source,target, attributes
nodes.csv: name, attributes
gprobs.csv: same set of metadata

@szhorvat
Copy link
Member

szhorvat commented Apr 6, 2025

Here's a stress-test ;-) Not too large, but plenty of attributes. https://networks.skewed.de/net/blumenau_drug

This one has a logical vertex attribute that is not currently imported as such. I'm not sure how easy it is to detect that this is logical while maintaining good performance.

@schochastics
Copy link
Contributor Author

Here's a stress-test ;-) Not too large, but plenty of attributes. networks.skewed.de/net/blumenau_drug

This one has a logical vertex attribute that is not currently imported as such. I'm not sure how easy it is to detect that this is logical while maintaining good performance.

Good catch. I would add minty as a dependency for this

Copy link
Contributor

@maelle maelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I find that the function name is a mouthful 😸

@schochastics
Copy link
Contributor Author

@maelle how do I properly document the dots argument? check complains about them

@schochastics
Copy link
Contributor Author

The function now has custom ratelimiting. 20 calls per minute. Is that reasonable @szhorvat?

@schochastics
Copy link
Contributor Author

🎉 I find that the function name is a mouthful 😸

Indeed. Do you have a suggestion? I don't know how to make it shorter. graph_schleuder?

@szhorvat
Copy link
Member

szhorvat commented Apr 10, 2025

Some comments from my side:

Name: I think the name is good. It should be descriptive since it will be a relatively rarely seen function. Auto-complete makes this a non-issue.

Or do you mean that Netzschleuder is just hard to spell for people with no exposure to German?

directed parameter: I suppose you added this because the CSV format does not encode directedness? This must not be left to the user. The API allows for retrieving directedness, see e.g. https://networks.skewed.de/api/net/advogato The value from there can be used.

I get a warning with some networks, e.g. graph_from_netzschleuder('packet_delays')

Warning message:
`type_convert()` only converts columns of type 'character'.
- `df` has no columns of type 'character' 

Multi-network datasets:

These do not work currently, this is expected, right?

When they do start working, what will happen if only the collection name is given, not a subnetwork?

Can we support the slash syntax, like here? graph_from_netzscheluder('fresh_webs/AkatoreA')

Is it very un-R-like to also support the tuple syntax, graph_from_netzscheluder(c('fresh_webs', 'AkatoreA'))? This is friendly for programmatic access. It would be idiomatic both in Python and Mathematica.

With these approaches we could get rid of the network parameter completely and have a single descriptor for any given network. I don't like the network parameter because some data can be downloaded by specifying a single parameter, some by specifying two. It's better to have a single identifier that will always work, for any network.

@schochastics
Copy link
Contributor Author

Thanks for the feedback!

directed parameter

Yes that was due to the csv format and me not wanting to add dependency when this was still planned as an igraph function. I will implement this via the API. Same as bipartite

warning with some networks

Thanks, thats fixable.

Multi-network datasets

Do you mean that it should be possible to download a whole collection at once? Like a list of all fresh_webs networks? If we allow full collection download, we have to block some, like openstreetmaps. Thats just too huge. Maybe this goes in a separate function?

You are right currently we need collection name+network name to get one network from a collection. For consistency with gt, I think I will go with the slash notations.

Is it very un-R-like to also support the tuple syntax

Yes. When I saw this, I immediately thought you want to fetch two networks, not fresh_webs/AkatoreA

@szhorvat
Copy link
Member

Do you mean that it should be possible to download a whole collection at once?

No, I didn't meant to imply that, I was just asking about what your plans were. I vote for not allowing a full set download. But we still need to choose some response to providing only a set name instead of a single network.

This could be a simple not found, but perhaps it's nicer to tell the user to choose a specific network? Maybe the error message could even link to the relevant networks.skewed.de dataset page, to aid the user in selecting one?

@schochastics
Copy link
Contributor Author

ok yes I agree that full collection shouldnt be possible. I will think of a logic for this

@szhorvat
Copy link
Member

szhorvat commented Apr 10, 2025

Some more brainstorming: Now we have read_from_netzschleuder, which gives a dataframe, and graph_from_netzschleuder, which gives a network.

"read" to me does not say "dataframe", but perhaps this is because of my inexperience with R. There are "read" functions that don't produce dataframes, such as read_graph in igraph.

If we could find a more descriptive name that suggests "this downloads network data", then we can shorten netzschleuder to ns (for people who are scared of consonant clusters 😆 ), as it will no longer be how people recognize the function. There will also be a possibility for future additions of other data sources. If R were used by mathematicians, I'd say house of graphs :-)

The output format doesn't necessarily need to be split between two functions. It can also be controlled by a parameter, in which case we could have a single function. For example, network_from_ns('foo', output='graph') could get the graph, output='data.frame' would get a dataframe and output='metadata' would get the metadata returned by the API (no graph).

network_from is not a pretty name and I'm not totally happy with it, but it has the advantages:

  • it's not confused with a typical graph_from igraph function; this is an igraphdata function, which behaves differently, and can output multiple things in multiple formats
  • it's descriptive enough that netzschleuder could be shortened to ns without confusion

Other prefix ideas instead of network_from:

  • read_graph_from
  • get_graph_from
  • get_network
  • net_from
  • ...

@schochastics
Copy link
Contributor Author

While you brainstormed, I implemented what we discussed before. Now we have

pkgload::load_all("~/git/R_packages/igraphdata/")
#> ℹ Loading igraphdata
graph_from_netzschleuder("blumenau_drug")
#> IGRAPH 2849f38 UNW- 75 181 -- 
#> + attr: name (v/c), p_node (v/n), hormone (v/l), len_i (v/n), color
#> | (v/c), label (v/c), class (v/c), x (v/n), y (v/n), tau (e/n),
#> | severity (e/c), weight (e/n), color (e/c), gender (e/c),
#> | patients_norm (e/n), RRI.F (e/n), patients (e/n), tau_norm (e/n),
#> | RRI.M (e/n)
#> + edges from 2849f38 (vertex names):
#>  [1] DB01174--DB00741 DB01174--DB00860 DB01174--DB00635 DB01174--DB00571
#>  [5] DB01174--DB00367 DB01174--DB01223 DB00741--DB00252 DB00252--DB00537
#>  [9] DB00252--DB01223 DB00252--DB00695 DB00252--DB00359 DB00252--DB00338
#> [13] DB00252--DB00196 DB00252--DB01234 DB00252--DB00860 DB00252--DB00829
#> + ... omitted several edges
graph_from_netzschleuder("fresh_webs/AkatoreA")
#> IGRAPH 58361eb DNW- 85 227 -- 
#> + attr: name (v/c), x (v/n), y (v/n), weight (e/n)
#> + edges from 58361eb (vertex names):
#>  [1] 1 ->45 1 ->46 1 ->47 1 ->48 1 ->49 1 ->51 1 ->52 1 ->53 1 ->54 1 ->55
#> [11] 1 ->56 1 ->57 1 ->58 1 ->63 1 ->64 1 ->66 1 ->67 1 ->68 1 ->69 1 ->70
#> [21] 1 ->71 1 ->72 1 ->73 1 ->74 1 ->75 1 ->76 1 ->78 1 ->81 1 ->83 1 ->84
#> [31] 1 ->85 2 ->83 3 ->46 3 ->47 3 ->48 3 ->50 3 ->53 3 ->54 3 ->55 3 ->56
#> [41] 3 ->58 3 ->59 3 ->60 3 ->61 3 ->62 3 ->65 3 ->67 3 ->68 3 ->70 3 ->71
#> [51] 3 ->72 3 ->74 3 ->75 3 ->78 3 ->82 3 ->83 4 ->45 4 ->46 4 ->66 4 ->80
#> [61] 5 ->83 6 ->46 6 ->48 6 ->50 6 ->53 6 ->55 6 ->61 6 ->65 6 ->67 6 ->69
#> [71] 6 ->75 6 ->83 7 ->55 7 ->61 7 ->83 8 ->55 8 ->83 9 ->83 10->71 10->72
#> + ... omitted several edges
graph_from_netzschleuder("fresh_webs")
#> Error in `read_from_netzschleuder()` at igraphdata/R/netzschleuder.R:139:3:
#> ! fresh_webs is a collection and downloading a whole collection is not
#>   permitted.
#> ℹ see <https://networks.skewed.de/net/fresh_webs>
graph_from_netzschleuder("fresh_webs/nope")
#> Error in `read_from_netzschleuder()` at igraphdata/R/netzschleuder.R:139:3:
#> ! nope is not part of the collection fresh_webs.
#> ℹ see <https://networks.skewed.de/net/fresh_webs>

Created on 2025-04-10 with reprex v2.1.1

@schochastics
Copy link
Contributor Author

This is maybe personal, but I do not like it if the output of a function changes depending on a parameter like network_from_ns('foo', output='graph') would do.

I am not good in naming things, so I am not sure what the best option would be.

Copy link
Contributor

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a very useful function. Some general remarks:

  • Do we need to close the connections opened with unz() to avoid warnings? An easy way is withr::local_connection(unz(...)) if we don't mind the dependency.
  • I prefer [[ over [ for accessing single elements of a vector, and vapply() over sapply() : Strictness
  • Should we use the longer netzschleuder_ prefix instead of ns_ ?
  • I wonder if the df and graph functions should accept a metadata object. We'd add a class and provide a print() method for extra sugar.

Comment on lines 78 to 101
if (net_ident[1] == net_ident[2] && length(unlist(raw$nets)) > 1) {
cli::cli_abort(
c(
"{net_ident[1]} is a collection and downloading a whole collection is not permitted.",
"i" = "see {.url {collection_url}}"
)
)
} else if (net_ident[1] == net_ident[2]) {
return(raw)
} else {
idx <- which(unlist(raw[["nets"]]) == net_ident[2])
if (length(idx) == 0) {
cli::cli_abort(
c(
"{net_ident[2]} is not part of the collection {net_ident[1]}.",
"i" = "see {.url {collection_url}}"
),
call = call
)
}
raw[["analyses"]] <- raw[["analyses"]][[net_ident[2]]]
raw[["nets"]] <- raw[["nets"]][idx]
raw
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have examples for each of the code paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes all three are needed and The error handling should tell users what they are supposed to do to get the correct network

if (grepl("/", x)) {
return(strsplit(x, "/", fixed = TRUE)[[1]])
} else {
c(x, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting to me. I wonder if we can make the intent here clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mechanism is need for the API. If a collection has only one network, the url is
https://networks.skewed.de/net/7th_graders/files/7th_graders.csv.zip. SO the name is duplicated. For a network in a collection, the zip file has a different name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Should we then return a named list instead, to clarify the intent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good


download_file <- function(zip_url, token = NULL, file) {
resp <- make_request(zip_url, token)
writeBin(httr2::resp_body_raw(resp), file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the files always small enough to fit into memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some networks that go up to 10GB. Should we implement something that stops users to download files over a certain size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement something that stops users to download files over a certain size?

IMO it's not a bad idea. Many of our users are inexperienced, and the risk of accidentally trying to download something large is too high. But is it possible the determine the size in advance? Or would you use the edge and vertex count as a proxy for the size?

Do I understand correctly that httr2 does automatically cache requests, preventing the re-downloading of large data?

If you do this, I'd suggest having an option to override and download anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider is whether R has limits on the number of elements in dataframes, or number of rows it can read from CSVs. R still has this silly limitation of 32-bit integers, and while it can index with doubles now (weird!), not every object has support for such indexing. E.g. matrices don't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about raw data loaded into memory, which is what it looks like this code is doing. Can we let curl handle the downloading in this case, to write directly to disk in chunks? There might be a way to show a progress bar too, and the user can cancel if it's too large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will dig into httr2 a bit on this. I would like to keep using it here too for the rate limiting. I am sure there is something that can help us here. Otherwise my idea was to create an option or so that specifies a hard upper limit of edges/nodes that are permitted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rate limiting important for downloads, or just for API calls?

We can probably get the body in chunks and write piecemeal if really necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate limiting is the most important for files because that is the heaviest load on the server.

@schochastics
Copy link
Contributor Author

schochastics commented Apr 17, 2025

  • withr::local_connection(unz(...))

shouldn't on.exit(unlink(temp)) take care of that?

Should we use the longer netzschleuder_ prefix instead of ns_ ?

We discussed it further up that ns is prefered.

I wonder if the df and graph functions should accept a metadata object. We'd add a class and provide a print() method for extra sugar.

Good idea. I will explore that

@schochastics
Copy link
Contributor Author

Not sure the switch to a S3 method for ns_graph is good. The code gets a bit convoluted since all error handling is done by ns_meta to minimize API calls. ns_meta is also never called in ns_graph but indirectly via ns_df. So in my opinion, the current way is the cleanest, although I get that a S3 method and a "meta print" function would be nice

@szhorvat
Copy link
Member

I worked a bit more on trying to break things, and here's what I found:

Note that ns_graph('ambassador/KINSHIP') is correct syntax and downloads the network.

Let's misspell the subnetwork:

> ns_graph('ambassador/KINSIP')
Error in `rlang::abort()`:
! `call` must be a call or environment, not a primitive
  function.
Run `rlang::last_trace()` to see where the error occurred.

This error could be nicer.

Let's write nothing after /:

> ns_graph('ambassador/')
Error in if (net_ident[1] == net_ident[2] && length(unlist(raw$nets)) >  : 
  missing value where TRUE/FALSE needed

Let's add one more slash and an extra name: ns_graph('ambassador/KINSHIP/foo'). This behaves the same as ns_graph('ambassador/KINSHIP'). It should instead error.

Let's take a dataset that has only a single network, so it should not use the / syntax:

> ns_graph('packet_delays/foo')
Error in `rlang::abort()`:
! `call` must be a call or environment, not a primitive
  function.
Run `rlang::last_trace()` to see where the error occurred.

@szhorvat
Copy link
Member

szhorvat commented Apr 17, 2025

Should ns_metadata() work directly on collections instead of individual networks? For example, both ns_metadata('ambassador') and ns_metadata('ambassador/KINSHIP') could work. It seems to me that this might be useful.

@schochastics
Copy link
Contributor Author

Thanks for the thorough testing. Now try to break it again ;-).

I think you are right with ns_metadata. I will implement that too

@szhorvat
Copy link
Member

Now try to break it again ;-).

> ns_graph(c('a','b'))
Error in if (grepl("/", x)) { : the condition has length > 1

This sometimes happens when passing the wrong type, as that type of converted to character, which tends to result in a vector of more than one elements.

"foo/" is interpreted as "foo", but I'm not sure I would call this a real problem.

I'm out of ideas on how to break it.


Out of curiosity, are downloaded networks cached? It's my impression that httr2 does not automatically?

@schochastics
Copy link
Contributor Author

I was thinking about caching. If I find a way that is not too convoluted, then I implement it. Httr2 has a cache function but not sure how persistent it is. Creating a persistent rate limit was already quite convoluted

@szhorvat
Copy link
Member

This suggests that req_perform(), which you used, already does some caching. Perhaps that's sufficient.

https://httr2.r-lib.org/reference/req_cache.html

Copy link
Contributor

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A wide range from questions to nitpicks to more serious concerns.

DESCRIPTION Outdated
Comment on lines 28 to 29
cli,
httr2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those must be in "Suggests".

#remove trailing /
x <- sub("/$", "", x)
#remove double slash
x <- sub("//", "/", x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because @szhorvat tried to break the name parameter and these are just some cleaning of common user input errors that could occur

if (grepl("/", x)) {
return(strsplit(x, "/", fixed = TRUE)[[1]])
} else {
c(x, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Should we then return a named list instead, to clarify the intent?


download_file <- function(zip_url, token = NULL, file) {
resp <- make_request(zip_url, token)
writeBin(httr2::resp_body_raw(resp), file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about raw data loaded into memory, which is what it looks like this code is doing. Can we let curl handle the downloading in this case, to write directly to disk in chunks? There might be a way to show a progress bar too, and the user can cancel if it's too large.

nodes_df_raw <- utils::read.csv(unz(temp, node_file_name))
#suppress warning if no character columns found
nodes_df <- suppressWarnings(minty::type_convert(nodes_df_raw))
names(nodes_df)[1] <- "id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards implementing lintr just for this, here and elsewhere.

Suggested change
names(nodes_df)[1] <- "id"
names(nodes_df)[[1]] <- "id"

#' @rdname netzschleuder
#' @export
ns_graph <- function(name, token = NULL) {
graph_data <- ns_df(name, token = token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was not about S3, we can implement poor man's dispatch here (hand-rolled check if it's a character or a classed object). Users may want to inspect or do other things to the meta object before creating a graph or a data frame from it. We could also decouple downloading from df creation, but that's a bit of a can of worms.

@schochastics
Copy link
Contributor Author

I think I have caught up with all suggestions now. The big ones:

  • a parameter that prevents downloading of too big files
  • metadata can be used as input to ns_graph() and has its own print function

@krlmlr krlmlr changed the title feat: add support for netzschleuder feat: New ns_graph() and lower-level ns_df() and ns_metadata() to download from [netzschleuder](https://networks.skewed.de) Apr 28, 2025
@krlmlr krlmlr merged commit d80f4b6 into main Apr 28, 2025
6 checks passed
@krlmlr
Copy link
Contributor

krlmlr commented Apr 28, 2025

Thanks! Opening a new issue to address remaining comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for netzschleuder

6 participants