Skip to content

Error model output#169

Merged
colinvwood merged 24 commits intoqiime2:devfrom
jordenrabasco:error_model_output
Aug 16, 2025
Merged

Error model output#169
colinvwood merged 24 commits intoqiime2:devfrom
jordenrabasco:error_model_output

Conversation

@jordenrabasco
Copy link
Copy Markdown
Contributor

@jordenrabasco jordenrabasco commented Sep 22, 2024

This pull request is to resolve issue #158

Big changes:

  • Q2-DADA2 'denoise-' commands output a collection[DADA2stats] rather than a DADA2STATS object
  • New Q2-DADA2 action stats_viz will visualize all DADA2stats in the collection[DADA2stats] (denoised stats, and error model stats) as a singular visualization with different tabs for each DADA2STATs object.
  • tests were updated accommodate the new output type
  • tests added for the visualization to make sure that all support files are generated

- generates new action to visualize dada2 stats in a distinct tab format
- alters dada2 stats output to be a collection[DADA2Stat] obj
-updates existing tests to accommodate this change in data type
-update existing tests to come into line with the new output collection[dada2stats]
-adds tests for the output error stats table
-adds tests for the error model vizualization
- fixes bug with vizualizing ccs stats
-removing blanks from test files to fix errors in qiime CI
@gregcaporaso
Copy link
Copy Markdown
Member

gregcaporaso commented Sep 26, 2024

Thanks @jordenrabasco!

Because this introduces a breaking change (changing the output to a Collection), we can't merge this before the 2024.10 release because we always give warning about upcoming breaking changes. We can mention that this change is coming though in the 2024.10 release notes, and then review/merge early in the 2025.4 development cycle so this new functionality is available in development versions of QIIME 2.

Sound ok? Let us know if you have any questions.

Todo:

  • add note about the breaking change to the 2024.10 release notes

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

@gregcaporaso Of course! That all sounds good to me. In the meantime, let me know if you need me to change anything with the implementation on my end to make it easier for you guys!

@hagenjp hagenjp self-assigned this Oct 3, 2024
@hagenjp hagenjp removed their assignment Nov 14, 2024
@colinvwood colinvwood self-assigned this Dec 3, 2024
@colinvwood
Copy link
Copy Markdown
Contributor

Hello @jordenrabasco, how do you feel about making the new visualizer only display the transition plots? I suggest this because we already have the metadata tabulate action to display the dada2 stats which makes the "Denoised-Stats" tab in the new visualizer redundant. The metadata tabulate action also is more full featured, for example it displays the column datatype and the direction of sorting. Additionally, one might want to call metadata tabulate on the transition stats so metadata tabulate will remain a key tool for inspecting dada2 output one way or the other.

One other quick thing I noticed while trying out these changes, try to name artifacts-and-visualizers-like-this.qza (no capitalization and hyphens instead of underscores), this is just a convention we try to follow. I'll circle back around to a more in-depth review once the visualizer has been ironed out.

@colinvwood
Copy link
Copy Markdown
Contributor

By the way, if at all possible it would be best if we could avoid adding additional R software to this repository. In fact, we're planning to transition some of the R code here to python using the rpy2 library (see #172). This makes things much easier for us to maintain. For example, the internal_plotErrors function that you added to run_dada.R would be much preferred in python. Or probably even better yet, you could simply reach out to dada2's plotErrors function for this purpose.

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

Hi @colinvwood I can change the names and start transferring things over to rpy2 no problem!

In terms of the tabs in the visualizer I initially incorporated the tab format with dada2-stats and the error-model-stats as it was requested in issue #158, should I remove the visualizer tab format entirely or keep the framework?

@colinvwood
Copy link
Copy Markdown
Contributor

Hello @jordenrabasco,

In terms of the tabs in the visualizer I initially incorporated the tab format with dada2-stats and the error-model-stats as it was requested in issue #158

I see. In the interest of reducing code duplication I think it makes sense to keep them separate. If it's easy to do so you could perhaps borrow the tabulate template from q2-metadata. This way the presentation is as users are used to.

I also think the denoising stats and the transition frequencies are quite distinct sets of information, and most often users will likely only be interested in (and only able to immediately understand) the former. Keeping the two separate also means that the denoising stats are only able to be visualized one way--with metadata tabulate. This keeps things simple.

What's your opinion from an ease of use perspective--separate visualizers or a tabbed one? If you prefer the tabbed approach I think it would be best if we borrowed the tabulate functionality for the denoising stats tab.

I can change the names and start transferring things over to rpy2 no problem!

Don't worry about rpy2. This is something that I would like to do in the future, but I think it only makes sense if the entire run_dada.R script gets converted all in one sweep. What would be helpful however is if the transitions plot were not implemented in R (because this will mean more refactoring to do later when the R script gets removed). Is there a reason we couldn't use the plotErrors function from the dada2 package (see here for an example)? It's quite straightforward to get an R plot into a qiime2 visualizer: you can just save the plot to disk in a temporary directory, copy it into the visualizer's output_dir, and reference it in the index.html file.

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

What's your opinion from an ease of use perspective --separate visualizers or a tabbed one?

Personally I think that the having all of the diagnostic information readily available within the same visualization via tabs would be very useful to the user as opposed to having it be separated out and needing to be viewed discreetly. In my experience people generally only look at the diagnostics when things go wrong downstream, so I think having everything in one place could be handy. However, this may also be a personal preference of mine.

If you prefer the tabbed approach I think it would be best if we borrowed the tabulate functionality for the denoising stats tab.

I thought about doing this but couldn't figure out how to incorporate the tabulate functionality within a tabbed visualizer. I could convert the visualizer into a pipeline but that would just output different .qzv objects for the denoising stats, and the error plot stats.

Don't worry about rpy2. This is something that I would like to do in the future, but I think it only makes sense if the entire run_dada.R script gets converted all in one sweep.

Ah okay my mistake sorry for misunderstanding!

What would be helpful however is if the transitions plot were not implemented in R (because this will mean more refactoring to do later when the R script gets removed).

The transition plots themselves are generated in _vizualizer.py in python however the preprocessing of the plot information is in R within internal_plotErrors. Do you want all the preprocessing to happen in python as well?

Is there a reason we couldn't use the plotErrors function from the dada2 package (see here for an example)? It's quite straightforward to get an R plot into a qiime2 visualizer: you can just save the plot to disk in a temporary directory, copy it into the visualizer's output_dir, and reference it in the index.html file.

plotErrors returns a ggplot2 object and when processing the data in the denoise- actions I was uncertain of how to save that as a .qza file and then incorporate it into the tabbed visualizer in the stats-viz action. It is my understanding that the temporary dir is deleted at the end of a qiime2 action. Is this correct or does it stay for the entirety of your qiime session? With this in mind I opted to preprocess the data, save the df as a .qza and then generate the plot via python in the visualizer itself. I didn't invoke the ploterrors function within the visualizer itself due to not wanting to expand the code base by adding an additional rscript (I refactored the q2-dada2 repo a while ago to get rid of tangential and redundant Rscipts). However, if you think this is the way to go I can implement it this way (i.e. visualizer calls Rscript to make the plot).

Also to be clear I am open to going with whatever you guys think is best!
Let me know what you think!

@colinvwood
Copy link
Copy Markdown
Contributor

@jordenrabasco,

Personally I think that the having all of the diagnostic information readily available within the same visualization via tabs would be very useful to the user

Sounds good, let's go forward with this approach.

I thought about doing this but couldn't figure out how to incorporate the tabulate functionality within a tabbed visualizer. I could convert the visualizer into a pipeline but that would just output different .qzv objects for the denoising stats, and the error plot stats.

This should definitely be doable. Were there any specific problems you ran into that I could help with?

plotErrors returns a ggplot2 object and when processing the data in the denoise- actions I was uncertain of how to save that as a .qza file and then incorporate it into the tabbed visualizer in the stats-viz action.

The temporary directory does go away once you exit the context manager for it, but as long as you copy the saved file into output_dir within the context manager (or save the file right into the output dir) everything works. Take a look at this function as a reference. I think this is probably going to be the best approach because its such a minimal amount of code--one plotErrors call and a few more lines to save the file into output_dir.

Hopefully this is enough to get started. Let me know if you get stuck on anything.

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Dec 15, 2024

Hi @colinvwood I am having trouble with a few things.

This should definitely be doable. Were there any specific problems you ran into that I could help with?

I am a bit confused here, do you mean you can render a .qzv file within an html tabbed framework index file or is there another way to go about this?

The temporary directory does go away once you exit the context manager for it, but as long as you copy the saved file into output_dir within the context manager (or save the file right into the output dir) everything works. Take a look at this function as a reference. I think this is probably going to be the best approach because its such a minimal amount of code--one plotErrors call and a few more lines to save the file into output_dir.

I am having some issues changing over the code to the format you provided more specifically saving the errorplots via ggsave into the output dir. From my understanding only a visualizer has the output_dir as the default first arg and denoise_* is registered as a method and so it doesn't have access to the output_dir address. Do you know of a way around this?

Sorry if I am misunderstanding and thanks so much!

@colinvwood
Copy link
Copy Markdown
Contributor

Hey @jordenrabasco,

I am a bit confused here, do you mean you can render a .qzv file within an html tabbed framework index file or is there another way to go about this?

You'll have to do some work to make the tabulate visualizer (or the template which it extends) be tabbed. We use jinja2 to manage the various visualizer templates so if you don't have any experience with that you might want to take some time to read its documentation. It looks like the table itself is made with jQuery but you probably shouldn't have to mess with that. The tabulate visualizer is in the q2_metadata repository and the base templates are in the q2templates repository.

All of this is maybe an argument against making a single tabbed visualizer and for making two separate visualizers...

I am having some issues changing over the code to the format you provided more specifically saving the errorplots via ggsave into the output dir. From my understanding only a visualizer has the output_dir as the default first arg and denoise_* is registered as a method and so it doesn't have access to the output_dir address. Do you know of a way around this?

You don't want to make the visualization in the denoise-* method, but in the visualization. My understanding was the the input needed by the plotErrors function was the new output that you added in this PR. Thus the visualizer would take this input, call plotErrors, and then save the figure to the output_dir. Let me know if I'm misunderstanding anything.

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Dec 16, 2024

@colinvwood

You'll have to do some work to make the tabulate visualizer (or the template which it extends) be tabbed. We use jinja2 to manage the various visualizer templates so if you don't have any experience with that you might want to take some time to read its documentation. It looks like the table itself is made with jQuery but you probably shouldn't have to mess with that. The tabulate visualizer is in the q2_metadata repository and the base templates are in the q2templates repository.

Ah okay I understand now, I thought you wanted me to utilize the tabulate visualizer action itself within the code, instead of augmenting the code to be extended into tabbed format . Apologies for misunderstanding.

All of this is maybe an argument against making a single tabbed visualizer and for making two separate visualizers...

Quite possibly. I will try to see if I can change the table over into tabulate format within a tabbed format and if not I will revisit this idea. If that's okay with you?

You don't want to make the visualization in the denoise-* method, but in the visualization. My understanding was the the input needed by the plotErrors function was the new output that you added in this PR. Thus the visualizer would take this input, call plotErrors, and then save the figure to the output_dir. Let me know if I'm misunderstanding anything.

Ah okay, so the new output from the denoise functions is not the error model output but an augmented dataframe to allow for some preprocessing and to allow for saving both fwd/rev reads within the same obj, as QIIME doesn't allow for optional outputs. I can try and change things over to output the model information instead of the augmented dataframe, however if I do this I will need to create another R script to then import this data back into R and call ploterrors(). I considered this apporach previously but didn't want to expand the code base further via another R file. Is this an approach that would work better for you guys?

Let me know what you think!

@colinvwood
Copy link
Copy Markdown
Contributor

Quite possibly. I will try to see if I can change the table over into tabulate format within a tabbed format and if not I will revisit this idea. If that's okay with you?

Of course, sounds good.

Ah okay, so the new output from the denoise functions is not the error model output but an augmented dataframe to allow for some preprocessing and to allow for saving both fwd/rev reads within the same obj, as QIIME doesn't allow for optional outputs. I can try and change things over to output the model information instead of the augmented dataframe, however if I do this I will need to create another R script to then import this data back into R and call ploterrors(). I considered this apporach previously but didn't want to expand the code base further via another R file. Is this an approach that would work better for you guys?

I think it's totally fine in this case because the R will be so minimal/boilerplate. Eventually it can be refactored using rpy2.

@colinvwood colinvwood removed their assignment Feb 24, 2025
@gregcaporaso
Copy link
Copy Markdown
Member

@jordenrabasco, are you still available to work on this for the 2025.4 release? We'd need to have it wrapped up in the next couple of weeks if so. If not, we can hold off till the next release (2025.10).

@colinvwood colinvwood moved this from In Development to In Review in 2025.4 🌻 Apr 3, 2025
@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Apr 3, 2025

Hi @colinvwood,

I vaguely remember the error plots being visible in a previous version of this PR,

I believe we talked about this option however we never implemented.

What was the reasoning behind the stats-viz visualizer being a separate registered entity (as opposed to the dada2-* functions outputting the visualization directly)?

To my knowledge there is not a way within QIIME 2 to save a visualization (.QZV file) as well as a QIIME 2 data artifact (.QZA file) outside of generating a pipeline. Additionally, the plots for reverse/forward reads are generated independently within the DADA2 workflow. This means that without the intermediate step of either concatenating the images or saving the data together and then parsing it later on (as we do now), an optional output would be needed (fwd plot, rev plot) which QIIME 2 does not allow for. Another reason is that if the data from the intermediate error_stats file is exported from QIIME 2 it will produce a file that is comprehensible to the user and present the values depicted in the graph. I felt this was of importance as it would allow the user to investigate their own data without external resources.

A problem with this approach is that the new stats-viz errors if given the old dada2 stats file (which it seems like it should accept given the input semantic type).

If I am understanding correctly, you tried to input the dada2 read stats into the error stats visualizer? If this is the case the old dada2 stats file should error as it is not dada2 error data, but instead dada2 read retention data. A simple fix to avoid this confusion would be to rename the action error-stats-viz. Does this work on your end?

Let me know what you think!

@colinvwood
Copy link
Copy Markdown
Contributor

Hello @jordenrabasco,

To my knowledge there is not a way within QIIME 2 to save a visualization (.QZV file) as well as a QIIME 2 data artifact (.QZA file) outside of generating a pipeline.

I think a pipeline makes a lot of sense since this new visualizer is completely coupled to the dada2 method.

This means that without the intermediate step of either concatenating the images or saving the data together and then parsing it later on (as we do now), an optional output would be needed (fwd plot, rev plot) which QIIME 2 does not allow for.

The visualizer can remain as is, but only be called from the pipeline (and not registered independently).

If I am understanding correctly, you tried to input the dada2 read stats into the error stats visualizer? If this is the case the old dada2 stats file should error as it is not dada2 error data, but instead dada2 read retention data.

Right, I expected it to fail, but the issue is that it should not have run in the first place. An action/visualizer that accepts semantic type X should run on all instances of X. Essentially, the new error plot stats should be a different semantic type since it is not interchangeable with other instances of the semantic type it has been registered as.

The pipeline technically would resolve this issue, as the call to the visualizer would be encapsulated in it, but even so I think it should be registered as separate semantic type for the reasons above.

@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Apr 3, 2025

@colinvwood,

Right, I expected it to fail, but the issue is that it should not have run in the first place. An action/visualizer that accepts semantic type X should run on all instances of X. Essentially, the new error plot stats should be a different semantic type since it is not interchangeable with other instances of the semantic type it has been registered as.

Ah I see, sorry for the misunderstanding!

The pipeline technically would resolve this issue, as the call to the visualizer would be encapsulated in it, but even so I think it should be registered as separate semantic type for the reasons above.

So just for clarity on my end, you would like me to convert all denoise-* actions into pipeline actions that will output the dada2 read stats obj, the dada2 error stats obj within the denoising stats folder and the dada2 error stats visualization as a separate output? Additionally you would like me to change the semantic type of the error stats obj and make it so that the error stats-viz action is only available internally to the pipeline actions?

Copy link
Copy Markdown
Contributor

@colinvwood colinvwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jordenrabasco,

Yes, converting to pipelines is one option. The other is registering the error model stats as a different semantic type. Both of these solve the issue that is the error plot visualizer throwing an error when given a valid input type.

- changes to kebab case
- changes input name to be more explicit
-attempts to make new semantic type within dada2 folder
@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Apr 9, 2025

Hi @colinvwood,

The minor changes involving kebab case and the naming conventions have been pushed however, I am struggling to generate a new semantic type that is specific to the score-viz action.
Could you please provide some guidance on how to accomplish this?

-Jorden

@colinvwood colinvwood moved this from In Review to Needs Review in 2025.4 🌻 Apr 15, 2025
@colinvwood colinvwood moved this from Needs Review to In Review in 2025.4 🌻 Apr 15, 2025
@hagenjp hagenjp removed this from 2025.4 🌻 Apr 17, 2025
@github-project-automation github-project-automation bot moved this to Backlog in 2025.10 👻 Apr 17, 2025
@hagenjp hagenjp moved this from Backlog to In Review in 2025.10 👻 Apr 17, 2025
@colinvwood
Copy link
Copy Markdown
Contributor

Hello @jordenrabasco, sorry for the delay in getting back to you.

See here for an example of creating a new semantic type. See here for an example of registering a semantic type and linking it to a format.

Let me know if these examples answer your questions, or if there's anything that still isn't clear!

@gregcaporaso
Copy link
Copy Markdown
Member

@jordenrabasco, need anything on this? Feel free to DM me on the forum if you want to discuss at all.

- semantic type added for error model
@jordenrabasco
Copy link
Copy Markdown
Contributor Author

@gregcaporaso @colinvwood I think we should be all set! I just pushed the requested changes in latest push.
Let me know if there are any changes once you look it over, apologies for the delay on this!

-change descriptions
@jordenrabasco
Copy link
Copy Markdown
Contributor Author

jordenrabasco commented Aug 9, 2025

@colinvwood, the requested changes have been pushed. It looks like the build is failing on the automated tests, I am unsure how to fix this. Let me know if there is anything else I need to do on my end!

@colinvwood colinvwood linked an issue Aug 15, 2025 that may be closed by this pull request
@colinvwood
Copy link
Copy Markdown
Contributor

Thanks @jordenrabasco 👍🏻

@colinvwood colinvwood merged commit d2abd17 into qiime2:dev Aug 16, 2025
4 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Changelog Needed in 2025.10 👻 Aug 16, 2025
@colinvwood colinvwood moved this from Changelog Needed to Completed in 2025.10 👻 Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Completed
Status: Completed

Development

Successfully merging this pull request may close these issues.

ENH: include error model plots in denoise-* output

5 participants