Current code was not written with efficiency in mind, but for large datasets this current process can max out an 8-10GB memory allocation. Would be good to refactor this to improve performance for larger datasets.