Home

Welcome to the WGSExtractRepo wiki!

Git repo move plan

Basically how I’m (Mac) going is:

1.⁠ ⁠Migrate the current stable and known to work system into a git repo + binary manifest json system. Change as little as possible so I don’t break something inadvertently during this move and scope creep the refactoring needed to achieve this goal.

2.⁠ ⁠⁠Make a release script that creates installer packages that reproduce the current dev release process.

3.⁠ ⁠⁠Make a dev-init / dev-launch system for developers to do active development with the git repo

4.⁠ ⁠⁠Test that this works clean install & reinstall / update, launches and is able to do it’s actions on the 3 platforms + multiple installation methods <— I am here

5.⁠ ⁠⁠Once the move is complete and we have the same functionality install wise we start “unpacking” and fixing other things up

At the end of it we have a git repo we can collaborate with, can create install pacakages quickly, actively develop on that git repo and have a test system to know that our changes haven’t broken the current build

Info dump of ideas / todos / context (mostly Randy responding to Mac inquiries):

The manual has a long laundry list of items that have been suggested or thought of along the way. GIT is just one of hundreds. Upgrades of yleaf to v3 is another. Adding IGV would be a nice one in many respects. I could never figure out how to create the JSON they needed to load the desired BAM/CRAM and/or VCF on startup. And then also to add some interesting tracks like all the Y SNP names, Gene Names, etc. People are always wanting to run that tool. Starting up gene.iobio.io using a supplied file is another interesting one. Proper installers on the platforms (e.g. ndis on Windows, .dmg on MacOS, etc) are others.
I would sure like to turn all the reference folder into download and build on demand. So many cases of multiple versions of files with different ref model coordinate and sequencing names. Also, some large files like snpdb that are 50GB on their own and needed to do VCF annotation.
The stuff you have in base_reference was just me getting around the fact that I wanted to do simple, patch releases and not force an upgrade of a large binary blob too frequently. The spreadsheet will change often and I used to keep it on github like the json files. But found that unreliable as people needed to still run when not able to reach github. (The internet is not as accessible and clean to everyone around the world.) I do not recall why the chain files for liftover may have had to be updated. Must have found a bug in the previous files or something. Key was the installer and program blobs were just source files that changed often and are easy to update for people. So I would sneak things in there while doing these patch updates. The seed genomes file should really be auto read and updated over the network, if reachable, before being used. Likely the whole program for that matter.
the seed genomes is important and something unique we generate and use. But the binary chain files come from a third party and should only be part of the larger blob reference and not in github.
- So if I publish a new ref blob with those patches, we could remove that from the git and make things cleaner. I think I already have a new one there. Just did not add it to the json for v4 yet.
FYI we can test basic functionality with mini-files. And build that into the GIT release process. Using them does raise some pop-up alerts regarding poor 30x WGS results. But we can add a flag to bypass those. Some have wanted a command-line capability as it is. The current and original structure is still too intertwined of GUI and OO data transformation code. (I did take over the code after it was developed in the first year of 2019 and only 1k SLOC.) Issue is many "bugs" come up that only the full, resource intensive standard files will trigger. (Like coordinate sorting a BAM on a MacOS. Or non-deterministic aspects of samtools / bcftools on variant calling.) -FYI, everywhere there is a JSON file entry for a blob, there is usually a make script for that blob at its top level. Cygwin, msys2, bioinfo, bioinfo-msys2, tools, and reference. They are usually removed in the final installation as well. Those files and the companion doc files are very important to define how to recreate the blob. Not sure I saw those anywhere on the GIT yet.
- The blobs are created automatically; as much as I could make them. In some cases, hand editing is required as an algorithmic matching was not deemed feasible. Only the reference library is not built with a script. The doc is still being built to identify the correct locations of the sources for all the included files. Some were never found or no longer exist and so we may have to source from get.wgse.io
- The reference/seed_genomes spreadsheet for reference models you already know about. I created that and try to keep it current.
- The reference/microarray folder is unique to WGSE and created manually by myself and the previous (original) author Marko Bauer. (Marko did not want his name out there and so I refer to him as Marko Farmer in the code.) I am working to overhaul that folder content and the microarray generation code. Well, for the past 3+ years. Never seem to have a large enough block of time to complete it.

The limit with samtools sort is the # of open files. Samtools tries to open hundreds of files to do the final merge sort. MacOS has a kernel setting (by default) of 250 open files. Linux and Windows is set to over 1,000. One can issue a kernel update command to extend that limit on some platforms. But it requires a reboot to take effect. So cannot be done at program start time.

release.json points to a latest-release.json file. That file points to all the original packages which can recreate all source files. But note that some of the packages it points too have to be separately made. For example, the cygwin release of bioinformatic tools. Most of this is scripted but some is still not as versions change and the type of edit required of files changes as well. And versions make a big difference. For example, DISCVRSeq.jar was updated to now use JRE 11+ (17+) where it was using JRE8. Some of the jar's package inside them various versions of the GATK tools; which limits the JRE version.

scripts/make_release.sh and scripts/make_release.txt description is the general way I make a release now. It must be run on an MacOS Mx machine due to restrictions by apple on software downloaded via ZIP file packages.

I historically used VMs for Apple and Linux. But they are not accurate for the resources the program sees running natively. So, in general, am not using VM's for MacOS or Windows. Ubuntu I use WSLG for and Linux micromamba the same. There is Cygwin and MSys releases for Windows. I do have updates for some of these binary packages being used on an internal v5 code base I have. They are not back ported to v4 as the code bases there cannot deal with the updated versions.

Big Task Acceleration Idea

All the AWS, Google and Azure services tend to have accelerators available for certain bioinformatic tasks. Mainly alignment and variant calling. On a 16 core Intel CPU with fast SSD, you can do alignment of a 30x WGS in about 8 wall clock hours. That is about the best possible with standard desktops. Xeon or similar higher end servers can cut it to 2-4 hours. nvidia acceleration can achieve 1-2 hours. Acceleration is only in blatantly parallel SIMD type tasks that graphics processors are built for. And does require different (rewrite of the) software; which modifies the results slightly. A number of bioinformatic tools are not deterministic. Which leads to some of our software integrity check issues with WGSE.

Coordinate sorting a BAM is another big resource intensive task. It is parallel if you do bubble sort first and then do a final merge sort; which is how samtools operates. Hence why it wants so many open files that the MacOS kernel does not allow by default. It divides the total amount of available memory by the number of processors. Then does bubble sort of a chunk that fits in memory per processor. Saving those intermediate results into files. Then merge sorts the intermediate file set (and memory blocks not yet written out). Often hundreds to a thousand or more files -- depending on the amount of memory in the system.

This is inefficient as nothing is kept compressed. So 300-500 GB of disk space of the uncompressed SAM. They could easily bgzip compress the bubble sort intermediate files. Then only uncompress a block at a time, in memory as needed, of the intermediate files. This little extra overhead would greatly save on intermediate disk space. And thus likely make up for the overhead of compression with the saved disk IO. Something I have been meaning to implement in samtools, verify the overall savings and then submit to the git repository.

Infodump of gotchas

"my experience with bioinformatics stuff is it’s really easy to do something obscurely wrong and get output that is incorrect without obvious errors or warnings" (? quote from Mac?)
- Wait till you see that minor updates to Bioinformatics tools cause results to be different on different platforms. Macos is giving me headaches as they changed the gzip program slightly so md5sums of zipped files are different on Apple silicon machines than Apple Intel machines. Still trying to wrap my head around how to fix things like the seed genomes file to account for that.
- My v5 code md5 checks everywhere because people have trouble downloading large blobs. Users in Morocco use cell phone hotspots as their internet connection. Very unreliable and drops often. Making it difficult to get a larger file to download. So need to allow for restart and check for accuracy when all done.
Apple silicon sucks at Bioinformatics. I used to use VM Ware VMs for MacOS and Linux releases. Now WSL/G is working better for Linux on Intel hardware.
MacOS will not accept downloaded apps that are not created on their hardware (some key they place in the zip file). So I bought a macmini just to make release zips. And now test Apple hardware releases that way as well. It will be interesting to see what zips created on git do on MacOS.
"for windows dev, are you using wsl, msys2 or cygwin? Is the msys2 stuff actually usable or is it just cygwin in practice?"
- I am using the installed blobs I created for development; just like a user release would see. Often cygwin64. Albeit I often have the shortcut turned on for the aligner to be used in WSL (developed before we could even get it to operate on Windows Cygwin64; still much faster). We had to rewrite the ralloc routine that was missing in cygwin64 to let the aligner do its shared files in memory concept it wants. A basic kernel feature of Unix/Linux that Windows lacks. msys2 is there because one of the main users just hates cygwin64 and, maybe like you with Brew vs Ports, played around to try and make that work, I took his start and finished it off to make the release possible. It is slightly more limited (there was something missing I could not make available). Overall, the cygwin64 release is still much more native and efficient than VMs of the other OS's. This is mostly due to the extensive resource utilization of bioinformatics -- CPU. memory, disk space -- both amount available and IO performance. WGSE in WSLG Ubuntu or VMWare MacOS takes more time than Windows 11 with Cygwin64. No matter how much I try to tune the VM parameters for file IO and memory availability. May be better on Linux if it were native. As most of the tools are developed in *Nix systems. If I could switch home management stuff over to the MacMini, maybe I would consider a native linux box for development. But I still need native Windows to make the bioinformatic tool releases that 50+% of the users rely on.

Dev env of devs

@RandyHarr : Windows x86 w/ PyCharm + WSL / VMs for linux. Mac mini to sign binaries / zips and test some mac stuff
- "WSLG being available to run WGSE without VMWare was a big relief and improvement (once they fixed the WSLG file I/O issues they had initially)."
@theontho : M1 Pro w/ various vscode based AI editors. Windows 11 desktop PC to test Windows.

Userbase Guess

Non-USA people mainly use Linux. MS Windows is not very popular outside the USA. EU standardized on Linux and LibreOffice and the like many years ago. Middle East, India, etc have never liked MS since Windows 7 and later.

So, for WGSE, in USA (~60% of users): Windows (60%), MacOS (30%) and Linux (10%). Everywhere else: Linux (75%), MacOS (15%) and Windows (10%). This is my best guess based on support issues without hard numbers to back it up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly