Skip to content

Closes #221: Feature request - Provide CSV files#228

Open
Gero1999 wants to merge 13 commits intomainfrom
221-feature-request-provide-csv-files
Open

Closes #221: Feature request - Provide CSV files#228
Gero1999 wants to merge 13 commits intomainfrom
221-feature-request-provide-csv-files

Conversation

@Gero1999
Copy link
Collaborator

@Gero1999 Gero1999 commented Jan 31, 2026

Thank you for your Pull Request! We have developed this task checklist from the Development Process Guide to help with the final steps of the process. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the admiral codebase remains robust and consistent.

Please check off each taskbox as an acknowledgment that you completed the task or check off that it is not relevant to your Pull Request. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the devel branch until you have checked off each task.


Implementation description

This pull request introduces the automated export of SDTM datasets to CSV format and adds two example CSV files to the repository. The changes improve reproducibility and make example datasets more accessible for external use or testing.

  • The data generation script (data-raw/create_sdtms_data.R) now saves a CSV version of each dataset to the inst/extdata/ directory, making the datasets directly available for non R-programmers.

  • The .Rbuildignore file is updated to exclude all CSV files in inst/extdata/ from the R package build, ensuring these files are not included in the built package by default for CRAN submissions.


  • Place Closes #<insert_issue_number> into the beginning of your Pull Request Title (Use Edit button in top-right if you need to update)
  • Code is formatted according to the tidyverse style guide. Run styler::style_file() to style R and Rmd files
  • Updated relevant unit tests or have written new unit tests, which should consider realistic data scenarios and edge cases, e.g. empty datasets, errors, boundary cases etc. - See Unit Test Guide
  • If you removed/replaced any function and/or function parameters, did you fully follow the deprecation guidance?
  • Update to all relevant roxygen headers and examples, including keywords and families. Refer to the categorization of functions to tag appropriate keyword/family.
  • Run devtools::document() so all .Rd files in the man folder and the NAMESPACE file in the project root are updated appropriately
  • Address any updates needed for vignettes and/or templates
  • Update NEWS.md if the changes pertain to a user-facing function (i.e. it has an @export tag) or documentation aimed at users (rather than developers)
  • Build pharmaversesdtm site pkgdown::build_site() and check that all affected examples are displayed correctly and that all new functions occur on the "Reference" page.
  • Address or fix all lintr warnings and errors - lintr::lint_package()
  • Run R CMD check locally and address all errors and warnings - devtools::check()
  • Link the issue in the Development Section on the right hand side.
  • Address all merge conflicts and resolve appropriately
  • Pat yourself on the back for a job well done! Much love to your accomplishment!

@Gero1999 Gero1999 linked an issue Jan 31, 2026 that may be closed by this pull request
@Gero1999 Gero1999 marked this pull request as ready for review January 31, 2026 07:44
@bundfussr
Copy link
Collaborator

I wonder if we should use DatasetJSON instead of CSV. Then we wouldn't lose the labels. And if we want to do the same in pharmaverseadam, it would be easier to handle date, datetime, and time variables.

What do you think?

@Gero1999
Copy link
Collaborator Author

Gero1999 commented Feb 3, 2026

This is actually a very nice idea, you are totally right!

@Gero1999 Gero1999 marked this pull request as draft February 3, 2026 19:01
@manciniedoardo
Copy link
Collaborator

@hski-github what do you think about the proposal to use dataset json instead? would this still suit your needs?

@hski-bayer
Copy link

hski-bayer commented Feb 4, 2026

There is pandas.read_json(URL). But the Json need to follow certain structures. There is 'split', 'records' and 'index'.

See https://pandas.pydata.org/docs/reference/api/pandas.read_json.html

But I don't see how meta data could be preserved resp utilized directly from this Json. In pandas.read_json there is parameter dtype which is used for defining datatypes, but this would require this Information in a separate file I guess.

@bundfussr
Copy link
Collaborator

There is pandas.read_json(URL). But the Json need to follow certain structures. There is 'split', 'records' and 'index'.

See https://pandas.pydata.org/docs/reference/api/pandas.read_json.html

But I don't see how meta data could be preserved resp utilized directly from this Json. In pandas.read_json there is parameter dtype which is used for defining datatypes, but this would require this Information in a separate file I guess.

It seems that there is not much support for python at the moment. The CDISC pilot focused more on SAS and R (see https://www.cdisc.org/sites/default/files/2023-10/2023-cdisc-dataset-json-plenary-v5_0.pdf).

I don't think that you can use pandas.read_json(URL) directly with a Dataset-JSON file. The structure is similar to orient = 'split' but doesn't match exactly. You would need a python module which provides similar functionality as the datasetjson R package. As a work-around you could use dataset-json to convert into XPT and then read them in python.

@hski-bayer
Copy link

hski-bayer commented Feb 4, 2026

Okay, agreed. Then please go back to the original proposal to create CSV, because that would nicely work with Python pandas.read_csv and then pharmaverse datasets can be used in Python.

@Gero1999
Copy link
Collaborator Author

Gero1999 commented Feb 4, 2026

Yep right, we can provide CSV then. But I think it would still be worthy to consider the idea of creating another issue to provide datasetJSON files as well. What do you think @manciniedoardo @bundfussr ?

In the meantime, I will create a package to allow the Python community to deal with datasetJSON as well as any other bio/pharma JSON standard. It might take me a bit to make something solid and publish, but I think it might be worthy

@Gero1999 Gero1999 marked this pull request as ready for review February 4, 2026 20:28
@bundfussr
Copy link
Collaborator

Yep right, we can provide CSV then. But I think it would still be worthy to consider the idea of creating another issue to provide datasetJSON files as well. What do you think @manciniedoardo @bundfussr ?

In the meantime, I will create a package to allow the Python community to deal with datasetJSON as well as any other bio/pharma JSON standard. It might take me a bit to make something solid and publish, but I think it might be worthy

Yes, I think it makes sense to create CSVs as a temporary solution. Once the python package is available we can replace the CSV files with dataset-JSON files and then also provide dataset-JSON files in pharmaverseadam.

Copy link
Collaborator

@bundfussr bundfussr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an item to the changelog?

I also think we should mention and link the CSV files somewhere on the webpage. @Lina2689 , @manciniedoardo , @Fanny-Gautier , any ideas?

@manciniedoardo
Copy link
Collaborator

I also think we should mention and link the CSV files somewhere on the webpage. @Lina2689 , @manciniedoardo , @Fanny-Gautier , any ideas?

Yes, somewhere near the top of the readme? maybe the data sources section could be renamed to "data" and then you could have subsections for data sources and data formats.

image

i also think the "How to update" section should also mention that csv versions of the datasets are also saved - what do you think @Lina2689?

@Lina2689
Copy link
Collaborator

create CSVs as a temporary solution. Once the python package is available we can replace the CSV files with dataset-JSON files and then also provide dataset-

Yeah, linking the CSV files on the webpage would be super helpful! We could add the link near the top, as suggested by @manciniedoardo. And, mentioning the point for CSV versions in the 'How to update' section is definitely useful for users who prefer that format.

@Gero1999 Gero1999 requested review from Lina2689, bundfussr and manciniedoardo and removed request for Lina2689 and manciniedoardo February 18, 2026 13:30
Copy link
Collaborator

@manciniedoardo manciniedoardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just left some comments and will leave @Lina2689 to do the final review/approval - thanks

@@ -1,5 +1,7 @@
# pharmaversesdtm <img src="man/figures/logo.png" align="right" width="200" style="margin-left:50px;" alt="pharmaverse sdtm hex"/>

> <sup>Interactive data exploration: <a href="https://pharmaverse.github.io/pharmaversesdtm/articles/preview-sdtm.html">Preview SDTM vignette</a></sup>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of this PR? @Lina2689

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, that is my bad I'll remove it later

Co-authored-by: Edoardo Mancini <53403957+manciniedoardo@users.noreply.github.com>
Copy link
Collaborator

@Fanny-Gautier Fanny-Gautier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo to correct in README. Thank you for the implementation!

Co-authored-by: Fanny Gautier <157114584+Fanny-Gautier@users.noreply.github.com>
NEWS.md Outdated

## Documentation

- Included CSV versions of all SDTM data under `extdata/sdtm-csv/` for ease of use of non R programmers. (#221)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the folder path, you are saving the csv files under inst/extdata and here the its mentioned under extdata/sdtm-csv/.

Gero1999 and others added 4 commits February 28, 2026 00:27
Co-authored-by: Lina Patil <157117024+Lina2689@users.noreply.github.com>
 remote-tracking branch 'origin/main' into 221-feature-request-provide-csv-files
@Gero1999 Gero1999 requested a review from Lina2689 February 27, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Provide CSV files

6 participants