Skip to content

Commit f34b9ab

Browse files
committed
2 parents c31f6fe + 55e96c1 commit f34b9ab

File tree

2 files changed

+100
-0
lines changed

2 files changed

+100
-0
lines changed

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# BigArrayList
2+
3+
A BigArrayList is basically an ArrayList that can handle a much larger amount of data. It is used the same way as a regular ArrayList (same function signatures), however it stores data that cannot fit into memory on disk. Currently a BigArrayList supports the following operations.
4+
5+
1. Adding elements to the end of the list
6+
2. Getting and Setting elements
7+
3. Removing elements from the list
8+
9+
## BigArrayList Size
10+
11+
The number of elements a BigArrayList can hold is 2^63-1 elements. This number is currently considered a theoretical limit, since it would take too much time and space to store that many elements. A more practical limit would be based on the combination of available RAM and disk space because the amount of space will likely be the limiting factor.
12+
13+
## Internal Working
14+
15+
BigArrayList is internally made up of an ArrayList of ArrayList objects, which represents the number of cache blocks and the elements in each cache block. The size of the outer ArrayList is equal to the number of cache blocks, where each element represents a cache block. The size of the inner ArrayList is equal to the size of a cache block, where each element is an element of data in the cache block. This is the largest amount of data that can be held in memory at a given time. All other data is stored on disk and swapped into memory when needed. Default values are provided for the number of cache blocks and their size; however the programmer has the option to set these values.
16+
17+
BigArrayList uses an LRU cache replacement policy to determine which block of data should be swapped out of memory and written to disk. Data is only written to disk if it has changed since it was read in from disk. This is a small optimization to prevent unnecessary file I/O for content that has not changed (likely due to read-only operations).
18+
19+
## Where data is stored
20+
21+
All files written to disk are stored in one folder, which can be specified by the programmer. Each BigArrayList instance has its own file prefix in order to distinguish one instance from another. This is done automatically by analyzing the files in the designated folder. The first file will always be named in the form of “<memoryInstanceNumber >_memory_0.jobj", where the variable “memoryInstanceNumber” uniquely defines the instance. A loop in the program starts with this variable set to zero and will continue to loop and increment the variable until a file name does not exist. This allows for multiple BigArrayList objects to be used in a single program as well as an array of BigArrayList objects.
22+
23+
24+
## Types of serialization
25+
26+
1. Regular Object
27+
2. Memory-mapped
28+
3. [FST](https://github.com/RuedigerMoeller/fast-serialization/wiki/Serialization)
29+
4. Memory-mapped + FST
30+
31+
## Code Example
32+
33+
34+
import com.dselent.bigarraylist.BigArrayList;
35+
36+
public class SimpleExample
37+
{
38+
public static void main(String args[]) throws Exception
39+
{
40+
//create a BigArrayList of Longs
41+
//cache block size = 1 million
42+
//cache blocks = 4
43+
BigArrayList<Long> bal = new BigArrayList<Long>(1000000, 4);
44+
45+
//add 10 million elements
46+
for(long i=0; i<10000000; i++)
47+
{
48+
bal.add(i);
49+
}
50+
51+
//get the element at index 5
52+
System.out.println(bal.get(5));
53+
54+
//set the element at index 5
55+
bal.set(5, 100l);
56+
57+
//get the element at index 5
58+
System.out.println(bal.get(5));
59+
60+
//clear contents on disk
61+
bal.clearMemory();
62+
}
63+
}
64+
65+
66+
## How to Use
67+
68+
Use BigArrayList as if it were a regular ArrayList. There are more constructor options to specify the amount of data in memory and fewer functions. Make sure to call "clearMemory()" when done using the object. Download the javadocs in the doc folder for more information and the BigArrayList.jar file to conveniently add the library to existing projects.
69+
70+
## Notes + Warnings
71+
Random operations are slow and should be avoided.
72+
73+
Some types of serialization will clear the contents on disk automatically when your program terminates and some will not. It is recommended to use the "clearMemory()" function when you are done using the BigArrayList. If your program crashes for any reason, you are responsible to clear any contents on disk.
74+
75+
You should treat storing any element retrieved from a BigArrayList as if it were a copy-by-value. The reason for this is because the content in a BigArrayList can be serialized and deserialized during any operation. Therefore, upon deserialization, a new object is created. Any old references in the program are now referencing a different object than what is being stored in the BigArrayList. If you retrieve an element from a BigArrayList and change it, make sure to save it back to the list.
76+
77+
## How to Build
78+
Import normally as a Gradle project which will handle all dependencies (current fast serialization library). The SimpleTest.java file can be run as a standard Java application to test the build.

paper.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: 'BigArrayList: A Java API which Allows Programmers to Transparently Use Disk Space as Additional Memory'
3+
tags:
4+
- BigArrayList
5+
- ArrayList
6+
authors:
7+
- name: Douglas Selent
8+
orcid: 0000-0003-2322-6734
9+
affiliation: 1
10+
affiliations:
11+
- name: Worcester Polytechnic Institute
12+
index: 1
13+
date: 6 July 2017
14+
bibliography: paper.bib
15+
---
16+
17+
# Summary
18+
19+
A BigArrayList is a Java library that allows programmers to easily store and operate on data that is too large to store in memory. It has the same function signatures as the existing ArrayList class and automatically handles all I/O operations to reduce the learning curve for programmers. The BigArrayList data structure maps a group of ArrayList objects stored in memory, to a set of files on disk. This allows programmers to work with larger amounts of data without the need to create their own I/O mechanism. The goal of this library is to provide a generic and easy-to-use solution that automatically uses disk space as extra memory for data that is too large to store in memory. A common use for this library is for operating on large data sets such as [ASSISTments data sets](https://sites.google.com/site/assistmentsdata/home), which the library was originally used for.
20+
DOI <insert DOI here>.
21+
22+
# References

0 commit comments

Comments
 (0)