|
1 | | -(** |
2 | | -
|
3 | | - The main goal is to delete files randomly and prioritizing larger files, |
4 | | - but still giving the probability for all files to be deleted. |
5 | | -
|
6 | | - Notation: |
7 | | - 1. s(i) - the size of i-th file, where i = 0..m-1 with m being the |
8 | | - total number of files; |
9 | | - 2. Sum(x(i)) = x(0) + ... x(m-1) - is the sigma operator; |
10 | | - 3. T = Sum(s(i)) - the total size of the cache; |
11 | | - 4. p(i) = s(i)/T - the discrete probability distrubution of the file |
12 | | - sizes in cache, likelihood that a randomly chosen file from the |
13 | | - cache will have size s(i). |
14 | | - 5. F(i) = p(i) + p(i-1) + ... + p(0) |
15 | | - cumulative discrete distribution function (CDF). |
16 | | - F(i) we can generate a random number u in range 0..1, |
17 | | - using a uniform random number generator, and then find such k that |
18 | | - F(k-1) < u <= F(k). |
19 | | - 6. |s| = Sum(p(i) * s(i)) = (1/T) * Sum(s(i)^2) - the expected value |
20 | | - of the size of a cache entry |
21 | | - 7. |n| = t/|s| - the expected number of deletions that we need to |
22 | | - make to delete t bytes, e.g. if we want to delete half: |
23 | | - |n| = T^2 / (2*Sum(s(i)^2) |
24 | | -
|
25 | | - Example: |
26 | | - sizes = {4, 6, 3, 1, 6} |
27 | | - the total size of the cache is Sum(sizes(i)) = 20 |
28 | | - the PDF is p(i) = {4/20; 6/20; 3/20; 1/20; 6/20} |
29 | | - and CDF is F(i) = {4/20; 10/20; 13/20; 14/20; 20/20} |
30 | | -
|
31 | | - We don't want to use floating points, there will be too many big and |
32 | | - small numbers and overflows and we finally want to get an |
33 | | - index. We will use rational numbers, since formulas 4. and 5. have the |
34 | | - same denominator (namely T) we can use only numenators. |
35 | | -
|
36 | | - On the high-level, we need to generate a random value between 0 and |
37 | | - T, and find such k that F(k-1) < S <= F(k), the k-th file will be |
38 | | - our candidate for removal. We can repeat sampling until we get |n| |
39 | | - files (of course deleting the same file twice won't free twice of |
40 | | - its size, so we had to keep in mind which files we already selected |
41 | | - and repeat until we get |n| distinct files) |
42 | | - Of course, we don't want to have a linear search for intervals, but |
43 | | - we can see, that F(i) partitions the set of sizes (0...T) into m-1 |
44 | | - subsets, so we can represent F as a finite mapping, e.g., with our |
45 | | - example, |
46 | | -
|
47 | | - [0,3] -> 0 |
48 | | - [4,9] -> 1 |
49 | | - [10,12] -> 2 |
50 | | - [13,13] -> 3 |
51 | | - [14,19] -> 4 |
52 | | -
|
53 | | - Since intervals are not intersecting, we don't need to use |
54 | | - Interval_map here, we just need to use the common Map from core |
55 | | - with the closest_key (`Less_or_equal_to`` function. So once we |
56 | | - generated a random size u we call for the closest_key for u and |
57 | | - pick the associated value as the index of the file that we will |
58 | | - delete. E.g., let's choose randomly a value from the range of |
59 | | - 0...19, if it in range from 0..3 we will pick the first file, or if |
60 | | - it is in range from 4,9, e.g., 5, then closest_key will return 4,1, |
61 | | - so we will remove the second file. So we managed to get away from |
62 | | - ugly floats and got the desired distribution with no rounding |
63 | | - errors. |
64 | | -
|
65 | | - Now, after we have selected |n| distinct files we can shuffle them and |
66 | | - delete without worrying that some other process already deleted one |
67 | | - of those files. All the processes are using the same sequence of |
68 | | - pseudorandom files, so they will select approximately equal files |
69 | | - for deletion. |
70 | | -
|
71 | | - And finally, we don't want to make our recursive selection depend |
72 | | - from |n|, so instead of selecting |n| files for removal we will |
73 | | - select as many files as we need to remove requested size. |
74 | | -*) |
| 1 | +(** The main goal is to delete files randomly and prioritizing larger files, but |
| 2 | + still giving the probability for all files to be deleted. |
| 3 | +
|
| 4 | + Notation: 1. s(i) - the size of i-th file, where i = 0..m-1 with m being the |
| 5 | + total number of files; 2. Sum(x(i)) = x(0) + ... x(m-1) - is the sigma |
| 6 | + operator; 3. T = Sum(s(i)) - the total size of the cache; 4. p(i) = s(i)/T - |
| 7 | + the discrete probability distrubution of the file sizes in cache, likelihood |
| 8 | + that a randomly chosen file from the cache will have size s(i). 5. F(i) = |
| 9 | + p(i) + p(i-1) + ... + p(0) cumulative discrete distribution function (CDF). |
| 10 | + F(i) we can generate a random number u in range 0..1, using a uniform random |
| 11 | + number generator, and then find such k that F(k-1) < u <= F(k). 6. |s| = |
| 12 | + Sum(p(i) * s(i)) = (1/T) * Sum(s(i)^2) - the expected value of the size of a |
| 13 | + cache entry 7. |n| = t/|s| - the expected number of deletions that we need |
| 14 | + to make to delete t bytes, e.g. if we want to delete half: |n| = T^2 / |
| 15 | + (2*Sum(s(i)^2) |
| 16 | +
|
| 17 | + Example: |
| 18 | + {v |
| 19 | + sizes = {4, 6, 3, 1, 6} |
| 20 | + the total size of the cache is Sum(sizes(i)) = 20 |
| 21 | + the PDF is p(i) = {4/20; 6/20; 3/20; 1/20; 6/20} |
| 22 | + and CDF is F(i) = {4/20; 10/20; 13/20; 14/20; 20/20} |
| 23 | + v} |
| 24 | +
|
| 25 | + We don't want to use floating points, there will be too many big and small |
| 26 | + numbers and overflows and we finally want to get an index. We will use |
| 27 | + rational numbers, since formulas 4. and 5. have the same denominator (namely |
| 28 | + T) we can use only numenators. |
| 29 | +
|
| 30 | + On the high-level, we need to generate a random value between 0 and T, and |
| 31 | + find such k that F(k-1) < S <= F(k), the k-th file will be our candidate for |
| 32 | + removal. We can repeat sampling until we get |n| files (of course deleting |
| 33 | + the same file twice won't free twice of its size, so we had to keep in mind |
| 34 | + which files we already selected and repeat until we get |n| distinct files) |
| 35 | + Of course, we don't want to have a linear search for intervals, but we can |
| 36 | + see, that F(i) partitions the set of sizes (0...T) into m-1 subsets, so we |
| 37 | + can represent F as a finite mapping, e.g., with our example, |
| 38 | +
|
| 39 | + [0,3] -> 0 [4,9] -> 1 [10,12] -> 2 [13,13] -> 3 [14,19] -> 4 |
| 40 | +
|
| 41 | + Since intervals are not intersecting, we don't need to use Interval_map |
| 42 | + here, we just need to use the common Map from core with the closest_key |
| 43 | + (`Less_or_equal_to`` function. So once we generated a random size u we call |
| 44 | + for the closest_key for u and pick the associated value as the index of the |
| 45 | + file that we will delete. E.g., let's choose randomly a value from the range |
| 46 | + of 0...19, if it in range from 0..3 we will pick the first file, or if it is |
| 47 | + in range from 4,9, e.g., 5, then closest_key will return 4,1, so we will |
| 48 | + remove the second file. So we managed to get away from ugly floats and got |
| 49 | + the desired distribution with no rounding errors. |
| 50 | +
|
| 51 | + Now, after we have selected |n| distinct files we can shuffle them and |
| 52 | + delete without worrying that some other process already deleted one of those |
| 53 | + files. All the processes are using the same sequence of pseudorandom files, |
| 54 | + so they will select approximately equal files for deletion. |
| 55 | +
|
| 56 | + And finally, we don't want to make our recursive selection depend from |n|, |
| 57 | + so instead of selecting |n| files for removal we will select as many files |
| 58 | + as we need to remove requested size. *) |
75 | 59 |
|
76 | 60 | open Core |
77 | 61 | open Bap.Std |
|
0 commit comments