-
Notifications
You must be signed in to change notification settings - Fork 146
ASV Benchmarks: Real storage tests
Having a real storage test would require a single test to be able to be executed against any supported by ArcticDB storage - Amazon S3, LMDB, GCP etc, without any change of test logic but only changing the parameters - like number of rows per dataframe etc. That requires creation of small abstraction over Arctic class that will allow us to take advantages of persistancy of storages like S3, and also emulate that with LMDB. Such abstraction is TestLibraryManager class.
The class main purpose is to provide ASV test developer with a way to manage lifecycle of persistent/or shared/ libraries and modifiable/or private/ libraries that only the particular test can access on any of the supported storages.
- A modifiable /or private/ library is such that is isolated and reachable only for the instance of the test that is currently working on one's machine. Any instances of the same tests on other machines will create their onw modifiable/or private/ libraries. This way each instance of the runs in isolation. The main important rule here is that those libraries should be cleaned either before the test is executed as part of the setup or as part of the cleanup. As recomendation it would be make sure to clean all libraries that could remain from previous execution in the setup of the test, as this makes sure the test runs in clean/controlled environment in any test execution
- A persistent/or shared/ library is such that can be created once and read many times by a particular ASV test. Its benefit is to have creation of the needed for test infrastructure of library and symbols done once and then reused many times. Additional benefit of such storage is the ability to simulate aging over time - if each run of test only adds data to libraries symbols without destroing them that simulates the typical aging process which is natural for all actual usage scenarios.
Another important characteristic of this class is that it provides a way to resuse that logic outside of ASV in other type of tests or tests of the class functionality itsilf. In other way it is not coupled at all with ASV.
The class provides minimal set of needed methods to achieve any user scenario for any of those storages. It does not implement all methods that Arctic class does have thus when needed can be exteneded. Here are practical examples of this class usage.
# Following code will create one persistent library
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE") # Defined the type of storage and a name for the library
# NOTE: The 'FINALIZE' string will be part of the name not the whole name, but this string should be unique
# NOTE 2: The name of library MUST NOT be a substring of another library name.
# In other words do not plan to have 'FINALIZE' and 'FINALIZE_MORE' tests.
# Should be 'FINALIZE_BASIC' and 'FINALIZE_MORE'
lib = lm.get_library(LibraryType.PERSISTENT) # create one persistent library if does not exist
assert lm.has_library(LibraryType.PERSISTENT) # yes we have this library created
# Creation of several persistent libraries
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
for name in ["first", "second", "third"]:
lib = lm.get_library(LibraryType.PERSISTENT, name)
assert lm.has_library(LibraryType.PERSISTENT, name)
# 'name' is the optional suffix of the actual library name that will be created:
# <somthing>_FINALIZE_name
# where something may be different so your test must not rely on trying to reconstruct
# the actual name, there is another method to get full name:
lib_name = lm.get_library_name(LibraryType.PERSISTENT, 'my_suffix')
# NOTE: as seen the suffix is not mandatory if you plan to have one library for your test
# Use of modifiable libraries
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
lm.clear_all_benchmark_libs() # Use this to clear all previous not cleaned libs
# clears only modifiable libraries will not clear any persistent library
# will clear all libraries for this test/benchmark from any processes
# therefore useful to be added in setup_cache() method as precondition
for name in ["first", "second", "third"]:
lib = lm.get_library(LibraryType.MODIFIABLE, name)
assert lm.has_library(LibraryType.MODIFIABLE, name)
lm.clear_all_modifiable_libs_from_this_process() # will clear all created in the loop libraries
# differs from other clear method that it clears only the libs created from current process
# therefore it is useful in teardown() method
# From time to time the storage should be cleaned as it may still have traces of libraries
# not cleaned due to malfunctions etc. In that case the machine private space in shared
# storages can be cleaned with:
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
lm.remove_all_modifiable_libs_for_machine()
# Not that this can be automated process iterating through all tests
# especially for github
# HOW TO CLEAR PERSISTENT LIBRARIES
# there are cases when you want to delete persistent libraries for a test/benchmark
# in order for them to be recreated due to a change in parameters that trigger different symbol
# structure in this case and ONLY IN THIS CASE use:
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
lm.remove_all_persistent_libs_for_this_test()
# That process is intended to be manual, attended one and never automated!
# Should not be used in tests/benchmarks
Note, that each ASV test for a single benchmark runs in a separate process on same machine. Therefore the modifyable libraries for each process of the same test/benchmark will be different, in their own space. Thus you do not need to worry when creating libraries in ASV test. That is handled automatically by above methods
In order to do most of benchamark tests, first the structure of libraries and symbols need to be prepared. To help preparation class LibraryPopulationPolicy along with populate_library and populate_library_if_missing can be used. Note, that usage of those utilities is not bound to ASV, thus it is possible those to be reused outside ASV benchmark tests also.
Note that LibraryPopulationPolicy uses by default VariableSizeDataframe class to generate dataframes. If you need specific dataframe type, consider implementing one inheriting DataFrameGenerator base class
Generally there are 3 different types of structures that those utilities will help you create:
- A single library with several symbols each having same or different number of rows and fixed columns
logger = get_console_logger()
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
# This configures generation of 2 symbols with 10 and 20 rows. The number of rows can later be used to get symbol name.
# Note that this defined that all symbols will have fixed number of columns = 5
lpp = LibraryPopulationPolicy(logger).set_parameters([10,20], 5)
populate_library(lm, lpp, LibraryType.PERSISTENT)
lib = lm.get_library(LibraryType.PERSISTENT)
symbol = lpp.get_symbol_name(10) # to access the symbol we need its name
data = lib.read(symbol).data
symbol = lpp.get_symbol_name(20) # to access the symbol we need its name
data = lib.read(symbol).data
- A single library with several symbols each having same or different number of columns and fixed rows
# Using above example replace the definition with:
lpp = LibraryPopulationPolicy(some_logger).set_parameters(3, [10,20])
# This configures generation of 2 symbols with 10 and 20 columns. The number columns can later be used to get symbol name.
# Note that this defined that all symbols will have fixed number of rows = 3
- Populating a library with many identical symbols
ASV benchmark tests that would use real storage libraries and utilities described here should create the benchmark class inhering from AsvBase class, like this example:
Note, that library manager is created as class variable, but also this class variable is exposed via get_library_manager() method. That is on purpose and the tests should use the method rather than the constant variable of the class.
class AWSReadWrite(AsvBase):
"""
This class is for general read write tests
Uses 1 persistent library for read tests
Uses 1 modifiable library for write tests
"""
rounds = 1
number = 3 # invokes 3 times the test runs between each setup-teardown
repeat = 1 # defines the number of times the measurements will invoke setup-teardown
min_run_count = 1
warmup_time = 0
timeout = 1200
param_names = ["num_rows"]
params = [1_000_000, 2_000_000]
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="READ_WRITE")
def get_logger(self) -> Logger:
return get_console_logger(self)
def get_library_manager(self) -> TestLibraryManager:
return AWSReadWrite.library_manager
def get_population_policy(self) -> LibraryPopulationPolicy:
lpp = LibraryPopulationPolicy(self.get_logger(), AllColumnTypesGenerator()).set_parameters(AWSReadWrite.params)
return lpp
def setup_cache(self):
'''
In setup_cache we only populate the persistent libraries if they are missing.
'''
manager = self.get_library_manager()
policy = self.get_population_policy()
populate_library_if_missing(manager, policy, LibraryType.PERSISTENT)
manager.log_info() # Logs info about ArcticURI - do always use last
def setup(self, num_rows):
self.population_policy = self.get_population_policy()
self.symbol = self.population_policy.get_symbol_name(num_rows)
...........
def teardown(self, num_rows):
# We could clear the modifiable libraries we used
self.get_library_manager().clear_all_modifiable_libs_from_this_process()
You can inherit a test in order to reuse its logic but supplied with different set of parameters. In that case you must once again re-implement get_library_manager() and setup_cache() methods, and if needed you might need to reipmplement also other methods if they directly access class variables. This is due to how ASV works. It gets parameters not from methods but from class variables. Additionally if you do not implement setup_cache in the child class you will not have a setup_cache method for your class. ASV will not execute it although is inherited for your class.
Considering all those notes in order to construct logic that can be safely inherited caonsider transfering all setup_cash method logic in helper method and in setup_cache simply invoke the method with current benchmark parameters like in this example:
class AWS30kColsWideDFLargeAppendTests(AWSLargeAppendTests):
rounds = 1
number = 3 # invokes 3 times the test runs between each setup-teardown
repeat = 1 # defines the number of times the measurements will invoke setup-teardown
min_run_count = 1
warmup_time = 0
timeout = 1200
params = [2_500, 5_000] #[100, 150] for test purposes
param_names = ["num_rows"]
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="APPEND_LARGE_WIDE")
number_columns = 3_000
def get_library_manager(self) -> TestLibraryManager:
return AWS30kColsWideDFLargeAppendTests.library_manager
def setup_cache(self):
return self.initialize_cache(AWS30kColsWideDFLargeAppendTests.warmup_time,
AWS30kColsWideDFLargeAppendTests.params,
AWS30kColsWideDFLargeAppendTests.number_columns,
AWS30kColsWideDFLargeAppendTests.number)
.......
Note that above class inherits the logic from parent benchmark class AWSLargeAppendTests and here only binds the logic with current test parameters and not using parent test ones.
When we are developing benchmark tests that utilize only modifyable librararies which are private to the test, we do not need to worry about anything. But as soon as our tests case needs to work with persistent library we have to be more careful, as our libraries are not going to be deleted over time unless we do that explicitly outside of a test, and also we have to be careful not to touch other libraries on the persistant storage for other tests.
One simple feature that can help us during that phase is the "Test Mode" of TestLibraryManager class. It is activated for the instance of the class when we invoke set_test_mode() method. Thus what you need to do when starting development of a new test or making changes to existing is simply to put this at the end of library_manager declaration like this:
class MyTests(AsvBase):
...
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="APPEND_LARGE_WIDE").set_test_mode()
...
Now the test will create libraries not on the production persistent storage space but on the test persistent storage space.
When finishing development just remove the method call. (Even if stays the test will run without a problem, just from time to time test persistent storage might be wiped out by anyone, but that is ok since the library will be recreated again)
Your tests that utilize peristent libraries would not need to delete them, but only check if they exist in order to run tests. And since such a test will create the structure once and then check only if it exists based on checking if libraries exist you will have problem if you just modify the parameters of the test and expect all to be ok. In fact if you run such test it might run OK, but will produce wrong results as it executes on old structure of libraries. Therefore that is not the procedure to make changes to already existing tests
When you need to change parameters for certain test you might need to remove previous persistent libraries for that benchmark test which had been created with different structure. In that case you can use following lines, replacing the first one woth your actual benchmark definition of the TestLibraryManager:
lm = TestLibraryManager(Storage.AMAZON, "FINALIZE")
lm.remove_all_persistent_libs_for_this_test()
That executed from any machine will remove the persisten library for the benchmark
ArcticDB Wiki