Integration Support: KNIME #7388

stellarpower · 2023-10-28T15:55:15Z

stellarpower
Oct 28, 2023

Admittedly I don't have much experience using it, however, KNIMEis a FOSS application somewhat popular with data analysts, researchers, etc., that aims to bring complex data workflows into a reproducible but low-code environment, using a highly-visual interface. I keep finding I need to remind myself to use it more often, as doing things using Spreadsheets/working from file, or writing scripts is not always an efficient use of time.

To give an example of its use, I noticed a lot of those I work with, myself included, as programmers tended to automate anything they needed to do in a script once they realised that task was going to be repetitive. It is common for them to write Python scripts reading in and writing out arbitrary CSV structures from the filesystem, manipulate data using pandas, etc. - and then that data is opened, and browsed in LibreOffice, or sometimes visualised with MPL or in a Jupyter notebook.

However, this can sometimes be overkill. A lot of boilerplate is needed when writing some of these scripts, and it tends to be very similar each time - reading data in, writing it out, possibly parsing command-line arguments. In a similar way, REST-based interfaces are an awful waste of time, when we can simply use one of several RPC mechanisms when I have code on one machine and code on another machine and they need to talk to each other. All this wrapping up of opening and manipulating files distracts from the actual numerical work that is specific to our application and where getting this right is crucial. I myself have recently spent a good amount of time with a gradually-increasing mess of scripts interchanging some file formats, and whilst I plan to write a proper application for this in due course, I realised KNIME is probably a much faster way to rip headers out of tables, interpolate data, add and discard columns, and work with it visually to transform it, than doing all this strictly in source code. A lot of time is wasted doing basic things over and over, and once I have saved my workflow, I can run this from the command-line if I want to operate on arbitrary files from the shell. I think in the same ways seasoned programmers often know how to do as little as possible and when to go and find a library or established standard to avoid writing anything from scratch, I am learning over time that knowing when to avoid writing any code is very important. I reckon a number of Quarto users might find they can get more done more quickly if they wrote code only for the important parts of their pipelines.

So, KNIME is a tabular-based system (built on top of Apache Arrow), that works with data in graph-based workflows - similar to the pipelines that have become popular in the media industry and Blender, and personally I have used performing signals processing in Neuromore Studio. As it uses Arrow, data is typed and columns are keyed more like database tables - none of the horrible problems people put up with in spreadsheets. IF you want to interpolate a number to a string, you do just that, and add or replace a column in the table. Data sources and sinks are then nodes; data processing is also performed in individual nodes, and "subroutines" can be made by grouping these into metanodes and components. As well as having data flow in the arcs between nodes, we are also able to send variables through the pipeline - both for processing on their own, and referencing in the setup of the nodes.

This image here from this article gives some idea of the contrast between the approaches:

and I agree that for me, the right gives a much quicker intuitive idea of what is happening with our data.

Altogether, it offers a halfway-house between working with program text (powerful, laborious) and the various MS Office-style desktop applications (rapid, but blunt) that are available, being both powerful and configurable but also quicker and easier to use. I can view the data tables at any point in the process - so I can inspect the direct output of a node and check it's what I think it should be - even easier than being able to inspect objects directly in a REPL, one of the things I believe that has made scripting languages so popular these days.

Whilst some of the design has aged a little in my opinion, and that some of the nodes seem to be oriented more towards business users (so not experienced in programming => what the nodes tdo can be a bit too granular), one of the best things is that you have full access to several programming languages, if you want to. Written in Java, there are nodes for evaluating arbitrary Java functions against the data tables as part of the pipeline, and it has very easy support for throwing a JAR file at it to use an external library. The Python nodes also allow direct operation within Python for those who are more comfortable there, with the data interpreted to/from a pandas DataFrame automagically as part of the process. I have used this successfully to simulate some mortgage payments form some different scenarios - we can iterate and set up the input data in KNIME very quickly - and then just call into an existing package and not solve an already-solved problem to perform the amortisation calculations. I didn't have to waste any time writing all the glue code, nor did I have to then find a formula for the amortisation and debug that. I just assemble the entire pipeline using KNIME for the structure of processing data (and it can do it a lot better than I would ever achieve in the time available), and using someone else's code where possible to perform the numerical work.

Anyway - overall, this feels like a very good fit for Quarto. Understandably it's a little different to the integration with notebooks and similar workflows - but, I will need to analyse some data form an experiment shortly, and write up a report on it, and having thought back to KNIME, my plan is to make my life easier by using this to perform a lot of the data wrangling, and any numerical work I need to do can just be written in Java using a suitable package - or if I got stuck, then numpy and scipy are available, just as they would be using a notebook etc. I can render and format a table suitable for printing within KNIME, and also plot my data, all from the GUI, and I feel like the whole reproducible workflow fits very well with Quarto's high-level aims of integrating the data analysis directly with the reporting of the results. Even iterating through the files in a directory to perform the workflow, can be done using ready-available nodes, so crystallising exactly what it is we have done alongside the logic of how we did it is, and integrating this into published documents, is all available under one roof. And as mentioned - anything missing is quite easy to add, with the ability to add custom nodes and extend outwards through one of several programming languages available.

I imagine the amount of work to achieve this could be significant, and I don't know much about Quarto's architecture - but if folks agree that an integration of some sort as one of the backends to the data processing makes sense, I think it could be very powerful, and boost the popularity of both tools.

Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration Support: KNIME #7388

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Integration Support: KNIME #7388

Uh oh!

stellarpower Oct 28, 2023

Replies: 0 comments

stellarpower
Oct 28, 2023