Skip to content

Commit be2600d

Browse files
authored
Merge pull request #237 from int-brain-lab/v3.4.2
This version fixes the saving and loading of insertions parquet tables for offline processing. ### Fixed - the insertions table have the minimal meta-data to allow reloading after a cache dump
2 parents 5714446 + db84ffb commit be2600d

File tree

10 files changed

+132
-9
lines changed

10 files changed

+132
-9
lines changed

CHANGELOG.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# Changelog
2-
## [Latest](https://github.com/int-brain-lab/ONE/commits/main) [3.4.1]
2+
3+
## [Latest](https://github.com/int-brain-lab/ONE/commits/main) [3.4.2]
4+
This version fixes the saving and loading of insertions parquet tables for offline processing.
5+
6+
### Fixed
7+
8+
- the insertions table have the minimal meta-data to allow reloading after a cache dump
9+
10+
## [3.4.1]
311
This version fixes issues with corrupt REST cache and REST validation errors.
412

513
### Modified

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ pip install ONE-api
2121
For using ONE with a local cache directory:
2222
```python
2323
from one.api import One
24-
one = One(cache_dir='/home/user/downlaods/ONE/behavior_paper')
24+
one = One(cache_dir='/home/user/downloads/ONE/behavior_paper')
2525
```
2626

2727
To use the default setup settings that connect you to the [IBL public database](https://openalyx.internationalbrainlab.org):

docs/notebooks/recording_data_access.ipynb

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,75 @@
237237
"\n",
238238
"</div>"
239239
]
240+
},
241+
{
242+
"cell_type": "markdown",
243+
"metadata": {},
244+
"source": [
245+
"## Guide to running ONE with multi-processes workflows.\n",
246+
"\n",
247+
"When using ONE with multiple processes and making simultaneous requests to the remote database, these requests can lead to connection errors like JSON.DecodeError and HTTPError.\n",
248+
"\n",
249+
"To get around these errors, it is very useful to generate the above mentioned parquet files containing the cache data. Then ONE can be initialized in local mode and the saved cache can be loaded. This ensures no additional database requests are made during the parallel run, and hence no connection errors.\n",
250+
"\n",
251+
"Below is an example, where all the probe-insertions for a project which have the spike sorting data are first queried and then the cache is saved. \n",
252+
"## Example\n"
253+
]
254+
},
255+
{
256+
"cell_type": "code",
257+
"execution_count": null,
258+
"metadata": {},
259+
"outputs": [],
260+
"source": [
261+
"from itertools import batched\n",
262+
"\n",
263+
"from one.api import ONE\n",
264+
"from one.converters import datasets2records\n",
265+
"from one.alf.cache import merge_tables\n",
266+
"\n",
267+
"# To generate the cache, we need to use the remote mode of ONE\n",
268+
"one = ONE()\n",
269+
"#Get the list of eids that have the spikes.times.npy dataset.\n",
270+
"eids = one.search(project='u19_proj1_multiareacom', datasets='spikes.times.npy')\n",
271+
"# Update datasets\n",
272+
"#Batching is being done to avoid multiple requests to the database.\n",
273+
"for batch in batched(map(str, eids), 50):\n",
274+
" # these rest queries update the one._cache object in memory\n",
275+
" dsets = one.alyx.rest('datasets', 'list', django=f'session__in,{batch}')\n",
276+
" df = datasets2records(dsets)\n",
277+
" merge_tables(one._cache, datasets=df, origin=one.alyx.base_url)\n",
278+
"\n",
279+
"\n",
280+
"#Provide the location of the directory where the cache will be saved.\n",
281+
"one.save_cache(\"mutli_area_cache\")"
282+
]
283+
},
284+
{
285+
"cell_type": "markdown",
286+
"metadata": {},
287+
"source": [
288+
"The above script will save `datasets.pqt` and `sessions.pqt` file in the directory `mutli_area_cache`.\n",
289+
"\n",
290+
"You can then initialize ONE in local mode within your parallelized scripts - whether using SLURM, joblib, or multiprocessing."
291+
]
292+
},
293+
{
294+
"cell_type": "code",
295+
"execution_count": null,
296+
"metadata": {},
297+
"outputs": [],
298+
"source": [
299+
"import pandas as pd\n",
300+
"from one.api import ONE\n",
301+
"from joblib import Parallel, delayed\n",
302+
"\n",
303+
"if __name__ == \"__main__\":\n",
304+
" one = ONE(mode='local', tables_dir=\"/path/to/mutli_area_cache\")\n",
305+
" eid_list = ['eid1', 'eid2', 'eid3']\n",
306+
" results = Parallel(n_jobs=-1, verbose=10)(delayed(one.list_collections)(eid) for eid in eid_list)\n",
307+
" print(results)"
308+
]
240309
}
241310
],
242311
"metadata": {

one/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
"""The Open Neurophysiology Environment (ONE) API."""
2-
__version__ = '3.4.1'
2+
__version__ = '3.4.2'

one/alf/cache.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -538,13 +538,18 @@ def merge_tables(cache, strict=False, origin=None, **kwargs):
538538
cache[table] = pd.concat(frames).sort_index()
539539
updated = datetime.datetime.now()
540540
# Update the table metadata with the origin
541+
table_meta = cache['_meta']['raw'].get(table, {})
541542
if origin is not None:
542-
table_meta = cache['_meta']['raw'].get(table, {})
543543
if 'origin' not in table_meta:
544-
table_meta['origin'] = set(origin)
544+
table_meta['origin'] = set(ensure_list(origin))
545545
else:
546546
table_meta['origin'].add(origin)
547547
cache['_meta']['raw'][table] = table_meta
548+
# Makes sure that the `date_created` field exists for a new table
549+
if 'date_created' not in table_meta.keys():
550+
table_meta['date_created'] = datetime.datetime.now().isoformat(
551+
sep=' ', timespec='minutes')
552+
cache['_meta']['raw'][table] = table_meta
548553
cache['_meta']['modified_time'] = updated
549554
return updated
550555

one/tests/test_alyxclient.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from pathlib import Path
44
import unittest
55
from unittest import mock
6+
import http.client
67
import urllib.parse
78
import random
89
import weakref
@@ -602,6 +603,42 @@ def test_download_cache_tables_auth(self, download_file_mock, zipfile_mock):
602603
finally:
603604
self.ac._token = token
604605

606+
@mock.patch('one.webclient.urllib.request')
607+
@mock.patch('builtins.open')
608+
def test_http_server_auth(self, open_mock, urllib_mock):
609+
"""Test for http_download_file authentication and headers."""
610+
url_response_mock = mock.MagicMock(spec_set=http.client.HTTPResponse)
611+
# Simulate file content then end of file
612+
url_response_mock.read.side_effect = [b'file content', None]
613+
urllib_mock.urlopen.return_value = url_response_mock
614+
# When a username and password are set in the parameters, should attempt to authenticate
615+
with tempfile.TemporaryDirectory() as temp_dir:
616+
file_name, md5 = wc.http_download_file(
617+
'https://example.com/file.txt',
618+
target_dir=temp_dir,
619+
username='user',
620+
password='pass',
621+
return_md5=True,
622+
chunks=(4, 12),
623+
headers={'Custom-Header': 'value'}
624+
)
625+
expected = Path(temp_dir).joinpath('file.txt')
626+
# Check file is written to expected location
627+
self.assertEqual(expected, Path(file_name))
628+
open_mock.assert_called_once_with(expected, 'wb')
629+
fid_mock = open_mock()
630+
fid_mock.write.assert_called_once_with(b'file content')
631+
fid_mock.close.assert_called_once()
632+
# Check urlopen called with correct auth header
633+
urllib.request.HTTPPasswordMgrWithDefaultRealm.assert_called_once()
634+
manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
635+
manager.add_password.assert_called_once_with(None, 'https://example.com', 'user', 'pass')
636+
# Check the request headers
637+
urllib.request.urlopen.assert_called_once()
638+
req, = urllib.request.urlopen.call_args[0]
639+
req.add_header.assert_any_call('Custom-Header', 'value')
640+
req.add_header.assert_any_call('Range', 'bytes=4-15') # Chunks
641+
605642

606643
class TestMisc(unittest.TestCase):
607644
def test_update_url_params(self):

one/tests/test_converters.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,10 @@ def test_pid2eid(self):
346346
# Check cache table updated
347347
self.assertIn('insertions', self.one._cache)
348348
self.assertIn(self.eid, self.one._cache['insertions'].index)
349+
# Makes sure the meta data of the newly created insertion table is populated
350+
meta_data = self.one._cache['_meta']['raw']['insertions']
351+
self.assertIn('date_created', meta_data.keys())
352+
self.assertEqual(meta_data['origin'], {'https://openalyx.internationalbrainlab.org'})
349353
# Local mode should now work
350354
self.assertEqual((self.eid, 'probe00'), self.one.pid2eid(self.pid, query_type='local'))
351355
# Test behaviour when pid not found

one/tests/test_one.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1602,7 +1602,7 @@ def test_list_datasets(self):
16021602
self.one._cache['datasets'] = self.one._cache['datasets'].iloc[0:0].copy()
16031603

16041604
dsets = self.one.list_datasets(self.eid, details=True, query_type='remote')
1605-
expected_n_datasets = 267 # this may change after a BWM release or patch
1605+
expected_n_datasets = 280 # this may change after a BWM release or patch
16061606
self.assertEqual(expected_n_datasets, len(dsets))
16071607
self.assertEqual(1, dsets.index.nlevels, 'details data frame should be without eid index')
16081608

one/webclient.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -413,7 +413,7 @@ def http_download_file(full_link_to_file, chunks=None, *, clobber=False, silent=
413413
Directory in which files are downloaded; defaults to user's Download directory
414414
return_md5 : bool
415415
If True an MD5 hash of the file is additionally returned
416-
headers : list of dicts
416+
headers : dict
417417
Additional headers to add to the request (auth tokens etc.)
418418
419419
Returns

requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
ruff
2-
numpy>=1.18
3-
pandas>=1.5.0
2+
numpy>=1.18, <2.4 # waiting for numba to support 2.4
3+
pandas>=1.5.0, <3.0.0 # pandas 3 regex on strings issue <one.tests.test_one.TestONECache testMethod=test_filter>
44
tqdm>=4.32.1
55
requests>=2.22.0
66
iblutil>=1.14.0

0 commit comments

Comments
 (0)