Skip to content

Alaska2#6

Open
otoien wants to merge 6 commits intopaulbrodersen:alaskafrom
otoien:alaska2
Open

Alaska2#6
otoien wants to merge 6 commits intopaulbrodersen:alaskafrom
otoien:alaska2

Conversation

@otoien
Copy link

@otoien otoien commented Feb 21, 2022

I created the Alaska2 branch and completed the modifications to improve handling of the annotation headings etc. It can be merged back into the alaska branch.
(Outstanding issue is that Local Patient ID is not written back to the headings as they should with the setHeader command, but it is not important as the date/time and part of the local recording ID is preserved, and the filenames are matched to the recording files. [This could possibly be due to a bug or feature of the EDF library beyond my knowledge]).

…runcated and automated annotation files.

Added reading and writing EDF headers from the original file to the truncated and automated annotation files. This will preserves start time of the annotations. At this stage the header is in both cases read from the data file (might need improvement).
1. In 00_truncate_edf_annotations.py,  heading of new file should be taken from original annotated file, not the data file.

2. In configuration.py, changed 'Undefined' token to 'Sleep stage 1' to align with undefined/unscoreable token in Polyman. Added a token for limb movement that could accidentally be included by Polyman, to avoid exceptiong if code is present, also coded as 0, asme as Sleep stage ? We do not expect to score limb movement, but Polyman does not allow to delete this code if one accidentally presses m for limb movement.
3.  time_resolution has now been set to 10s as it causes less wigging in transitions, more in line with the manual scorings.

4. removed some comments not needed any longer
00_truncate_edf_annotations.py should not be allowed to overwrite original manual annotation files, as the edf library writes annotations organized differently than the Polyman viewer used for manual scoring (each new annotation a record, while in Polyman all are located at record 0), so it might not be possible to revise them in Polyman after the truncation. A new column, 'file_path_manual_state_annot_orig', has been introduced that holds the file path of the the original manual annotations. It writes to the path listed under 'file_path_manual_state_annotation', which is then used as before the modification.

An example spreadsheet is provided in the root folder.
Small error in new heading corrected, not sure how it sneaked back in.
@otoien
Copy link
Author

otoien commented Feb 21, 2022

Please hold out a little with the request as I need to check something - might just be my config files.

This fixes issue paulbrodersen#7 by loading time__resolution into the truncation routine and using it to calculate total time of raw signal to a value truncated to a whole number of epochs, and using that to calculate needed truncation of the EDF annotations.
Tested working with 10s and 5s time_resolution on all data in Bear_selection.
Update to solve issue paulbrodersen#8 with inconsistent time axis in plot 2-4 when time_resolution is set to >1 second.
@otoien
Copy link
Author

otoien commented Feb 24, 2022

@paulbrodersen, as both issue #7 and #8 now has been solved it is now OK to merge with the main alaska branch as I do not see any further problems.

@paulbrodersen
Copy link
Owner

Hi Oivind

Thank you for the pull request, and all the work that went into it. I love the bug fixes and I like a lot of the other changes. However, I will probably not merge the PR as is, as

  1. I would like to split the commits a bit differently,
  2. some of the changes belong on the master branch, and
  3. some of the proposed changes require a bit further thought IMO.

Instead, I will copy over some of the changes that I think are uncontroversial and then we can discuss the remainder.

For future reference, it helps to keep PRs as small as possible. That makes it much easier to merge at least some of the changes quickly. For example, each one of your commits could have easily been a separate PR, and then I could have simply merged the parts that I completely agree with.

Copy link
Owner

@paulbrodersen paulbrodersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1ac5cc0: I am trying to understand why this issue has to be resolved with code. Why are you not simply backing up the original files?

Copy link
Owner

@paulbrodersen paulbrodersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 812e259: I am having trouble understanding what is going wrong here. Which part of the code base throws an error if the total_time_in_seconds is not an exact multiple of time_resolution?

Copy link
Owner

@paulbrodersen paulbrodersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 8cc78d5: issue addressed in 23101a4

import pandas

from argparse import ArgumentParser
from collections import Iterable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python v3.3 changes to collections addressed in 58924e0

'Sleep stage R' ,
'Sleep stage R (artefact)' ,
'undefined' ,
'Sleep stage ?' , # ? = Undefined/unscorable code in Polyman
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polyman terminology for undefined states adopted in 7059e9f.

for (start, stop), state in zip(intervals, states):
writer.writeAnnotation(start, stop-start, state)
if header:
writer.setHeader(header)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for header keyword argument added in d0c0e1a.

Copy link
Owner

@paulbrodersen paulbrodersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding bed44a4: headers are preserved when truncating files with commit c4c7f3b.

with EdfReader(dataset['file_path_raw_signals']) as f:
edf_header = f.getHeader() # ?read header from the original file

export_hypnogram(dataset['file_path_automated_state_annotation'], predicted_states, predicted_intervals, edf_header)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is implemented in c020140. However, the headers are only copied, if the target file for the annotations is indeed in an EDF format.

@otoien
Copy link
Author

otoien commented Mar 9, 2022

Regarding 1ac5cc0: I am trying to understand why this issue has to be resolved with code. Why are you not simply backing up the original files?

Of course best practise is to always back up original manual scores, but as files are getting reviewed it is better to always be able to read input to the truncation routine from the original Polyman manual annotation folder and then output truncated files into a separate annotation folder used for the Somnotate processing. Otherwise files already truncated may be input to the routine and tried truncated again, with possible consequences for the headings. (Before the below fix with time_resoution, I also ended up with truncated file being read again and never achieving the right length, but that is probably not be the case any longer).

To begin with I did not realize that the original manual scores would be overwritten with a different format, but luckily I had then already copied the scores to a different folder. So this might also be an issue with fresh users of the software, not realizing that their original manual annotations gets overwritten, for instance specifying the folder location pointed to by Polyman. By providing an example configuration CSV file, it becomes obvious that there is an input and a different output to the truncation routine to avoid overwriting.

In other words, I would argue that a program should never be allowed to overwrite files it did not generate itself without asking for explicit permission.

Regarding 812e259: I am having trouble understanding what is going wrong here. Which part of the code base throws an error if the total_time_in_seconds is not an exact multiple of time_resolution?

Perhaps this is already figured out, but for the record it was in _automated_state_annotation.py, check_inputs. Here is the archived console output with the Traceback error message back when that happened:

Traceback (most recent call last):
File "C:\somnotate\example_pipeline\02_test_state_annotation.py", line 103, in
accuracy[ii] = annotator.score(signal_arrays[ii], np.abs(state_vectors[ii]))
File "C:\somnotate\somnotate_automated_state_annotation.py", line 156, in score
self._check_inputs([signal_array], [state_vector])
File "C:\somnotate\somnotate_automated_state_annotation.py", line 177, in _check_inputs
assert len(arr) == len(vec), "The lengths of the signal array ({}) and the corresponding state vector ({}) do not match!".format(len(arr), len(vec))
AssertionError: The lengths of the signal array (86350) and the corresponding state vector (86345) do not match!

The reason this is needed is seems to be that when data are preprocessed, they are already truncated to a whole number of epochs/time_resolutions. So the truncation procedure for the of the annotations also need to do the same, not just using the total time of the recording. Otherwise there might be a mismatch.

@otoien
Copy link
Author

otoien commented Mar 9, 2022

Hi Oivind

Thank you for the pull request, and all the work that went into it. I love the bug fixes and I like a lot of the other changes. However, I will probably not merge the PR as is, as

  1. I would like to split the commits a bit differently,
  2. some of the changes belong on the master branch, and
  3. some of the proposed changes require a bit further thought IMO.

Instead, I will copy over some of the changes that I think are uncontroversial and then we can discuss the remainder.

For future reference, it helps to keep PRs as small as possible. That makes it much easier to merge at least some of the changes quickly. For example, each one of your commits could have easily been a separate PR, and then I could have simply merged the parts that I completely agree with.

Thanks for the feedback. Still finding my way around git, I am sorry for not realizing that my many smaller pushes of code changes from my computer to the alaska2 branch would not be reflected in structuring code of the pull request - I sort of did not want to bother you before I had a solution that worked. I keep learning and will keep that in mind going forward. I am happy if some of the changes even can contribute to the main branch.

@paulbrodersen
Copy link
Owner

In other words, I would argue that a program should never be allowed to overwrite files it did not generate itself without asking for explicit permission.

Eh, that is not how most software operates. if you open a text document in word, the original document is modified; if you open an image in photoshop, the default is to overwrite the raw image with the processed image. Also, the name of the script explicitly says that it is truncating the annotations, not copying a shortened version.

But alright: I don't feel particularly strongly about the issue raised per se. I do dislike that I wrote that script in the first place, as input sanitation really should be out of scope and responsibility of the user. If you try to process a jpg image with photoshop but the file does not follow the jpg standard then photoshop does not sanitize the image for you. Analogously, the data and the annotations should be aligned before an "external" library like somnotate comes into the picture.
Doing input sanitation for a user always carries the risk of making too many assumptions. For example, in this script we assume that annotations and data are always well aligned at the start but out of sync at the end, which seems dangerous to me (but maybe you disagree). Finally, this script is the main reason the alaska branch even exists. Not having the script as part of the example pipeline would limit further balkanization of the code base. All of that being said, as you can see by the existence of 00_convert_sleepsign_files.py this is not the first time I made this particular blunder. So maybe having these "adapters" is unavoidable.

Apart from these fundamental issues, I also dislike some of the currently proposed changes. In particular, the distinction between file_path_manual_state_annot_orig and file_path_manual_state_annotation is anything but intuitive. Given that you changed the variable name at least once, I am sure you see that issue, too.

I am unsure how to proceed on this front. If you have any suggestions how we retain usability of the code base but move input sanitation out of scope, then I am all ears.

@paulbrodersen
Copy link
Owner

paulbrodersen commented Mar 9, 2022

So the truncation procedure for the of the annotations also need to do the same, not just using the total time of the recording. Otherwise there might be a mismatch.

Fair enough. I will incorporate your changes.

@paulbrodersen
Copy link
Owner

I keep learning and will keep that in mind going forward.

You are doing great. "Keep the scope of pull requests small." is literally the number one suggestion on every How-to guide for making pull requests, so I think everybody makes that particular mistake the first time they try to merge some code into someone else's code base. I certainly did.

@otoien
Copy link
Author

otoien commented Mar 11, 2022

In other words, I would argue that a program should never be allowed to overwrite files it did not generate itself without asking for explicit permission.

Eh, that is not how most software operates. if you open a text document in word, the original document is modified; if you open an image in photoshop, the default is to overwrite the raw image with the processed image. Also, the name of the script explicitly says that it is truncating the annotations, not copying a shortened version.

Thanks for the comments . The comparison to Word etc. is not quite fair though, as one has to actively take action with File-Save to overwrite the original file. (Adobe Acrobat reader is a different rouge case, as it keeps changing file attributes just on load and closing which causes problems for file sync programs.)

But alright: I don't feel particularly strongly about the issue raised per se. I do dislike that I wrote that script in the first place, as input sanitation really should be out of scope and responsibility of the user. If you try to process a jpg image with photoshop but the file does not follow the jpg standard then photoshop does not sanitize the image for you. Analogously, the data and the annotations should be aligned before an "external" library like somnotate comes into the picture. Doing input sanitation for a user always carries the risk of making too many assumptions. For example, in this script we assume that annotations and data are always well aligned at the start but out of sync at the end, which seems dangerous to me (but maybe you disagree). Finally, this script is the main reason the alaska branch even exists. Not having the script as part of the example pipeline would limit further balkanization of the code base. All of that being said, as you can see by the existence of 00_convert_sleepsign_files.py this is not the first time I made this particular blunder. So maybe having these "adapters" is unavoidable.

Apart from these fundamental issues, I also dislike some of the currently proposed changes. In particular, the distinction between file_path_manual_state_annot_orig and file_path_manual_state_annotation is anything but intuitive. Given that you changed the variable name at least once, I am sure you see that issue, too.

I am unsure how to proceed on this front. If you have any suggestions how we retain usability of the code base but move input sanitation out of scope, then I am all ears.

I was not necessarily suggesting the modification with file_path_manual_state_annot_orig and file_path_manual_state_annotation to go into master. I certainly understand your wish to keep things clean, but it would simplify the further processing in the alaska branch to keep it. My first naming of the input file in the code was not very clear, sorry for that, so I changed it to something easier to understand. Your initial naming was kept for all output and further processing. An issue I have is that the Polyman folder with the manual does not only contain the scores from this data set, but from any data set that is analyzed with Polyman. I will definitely not want Somnotate to write directly to this folder - too easy to forget the backup. After truncation by Somnotate or edit with EDFbrowser, the EDF file can no longer be revised in Polyman. Initially Somnivore would also not read these files, but that has been fixed. We can instead use FreeFileSynce and filter the file of the Polymanfolder on the 3 suffixes used in the Bear_selection to synchronize the to a processing folder. (I am still worried that someone in the future would point the CSV file to the wrong folder and ruin all the original manual scores without understanding that file format is changed, but then again the same mistake could be done in the naming of the output folder, so nothing is bombproof in that respect). When I was working on this I had this round dance where reprocessing the annotation file with the truncation routine multiple times never would bring the annotation to the right length, but that was before the length discrepancy was solved.

Regarding the length issue, there is no risk for desynchronization at the end. It is simply caused by the fact our record length for the EDF data is 5 seconds, while the epoch during manual scoring is 10 seconds. So if a file is 2045 seconds long, Polyman will extend the last manual score to 2050 seconds. When Somnotate sees the data, the last Time_resolution interval pre-processing apparently reads ends at 2040 seconds (understandable that it need to be a whole number of time_resolutions), so the truncation procedure will need the sample interval information to truncate correctly. Ideally I should have truncated each data file on the last 10 second border instead of the last record border, but the final time_resolution (10s) to be used was not known at the time that the data were converted, and files have now of course already been distributed to the manual scorers. I do not think the comparison to sanitizing a corrupt jpg is fair here, as the user cannot always foresee what time resolution would be working best beforehand. Hypothetically one could for instance have correctly truncated the data at the 10s border and used a 10s epoch length of the manual annotations that aligns perfectly, but during analysis it was found that the best results were obtained at 30s time resolution in Somnotate by some reason. Then truncation would be needed anyway, and one would have the same alignment problem. And after all this is a standalone routine that does not interfere with the rest of the code, so why regret it has been created.

@otoien
Copy link
Author

otoien commented Mar 12, 2022

Giving the first paragraph of my comment a little more thought, you are probably right that it gets too complicated and tedious having to enter two variants of the file_path_manual_state_annotation in the spreadsheet. (The copying can as noted be effected externally by using for instance FreeFileSync instead.) So I suggest to reverse the specific changes I made in commit 812e259 and commit 1ac5cc0 so that there is only one version of file_path_manual_state_annotation, but then add the following (italicized) in the documentation under
"1. 00_truncate_edf_annotations.py
Truncate the EDF annotation files to match the length of the EDF data files.
-Make sure to run this on a separate copy of your manual annotation files, as the files are overwritten in a format that for instance the Polyman viewer cannot modify, so further revisions of the annotations are unsupported."

Let me know what further action should be taken from my side (i.e. testing or code modifications).

An alternative afterthought: If one wanted to completely get rid of the truncation routine, a much more elegant and safer solution would instead be to check the length of the annotations against length of the preprocessed raw data every time the annotations are loaded, and remove possible excessive annotation epochs on the fly based on the same principle as the truncation routine, but without writing anything back to the annotation file. It would at least need modification to both 02_test_state_annotation.py and 03_train_state_annotation.py. However this is beyond my current understanding of the code at this stage if even feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants