-
Notifications
You must be signed in to change notification settings - Fork 10
Expand file tree
/
Copy pathChangeLog
More file actions
143 lines (111 loc) · 6.03 KB
/
ChangeLog
File metadata and controls
143 lines (111 loc) · 6.03 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
2014-01-11 Deniz Yuret <dyuret@ku.edu.tr>
* dlib.c: TODO: Having _d_memsize updated in _d_ malloc
routines make them non-thread-safe. The routines work but the
variable should be ignored. We could use pragmas to make the
operations atomic. The same applies to all dlib routines that use
global variables, in particular dalloc routines. We need to
rethink the design.
2014-01-10 Deniz Yuret <dyuret@ku.edu.tr>
* dlib.h: DONE: have dbg that prints out message if NDEBUG is
not defined.
2014-01-05 Deniz Yuret <dyuret@ku.edu.tr>
* dlib.c: DONE: zlib sucks. Old version is slow. New version is
suspect. xz does not have a similar lib. Best to have a suffix
based version that automatically converts recognized file
extensions into pipes with appropriate commands.
2013-12-29 Deniz Yuret <dyuret@ku.edu.tr>
* dlib.c (_d_error): trying to find the best way to report memory.
Here are before and after lm load numbers for fastsubs (page size
4096 bytes, lm load time 96 secs):
sbrk(0): 8M 456M (x4 = 1825M 1793M)
/proc/stat.vss: 15M 1806M (bytes)
/proc/stat.rss: 206 437474 (pages)
/proc/stat.utime: 0 9366
/proc/stat.stime: 0 147
/proc/statm.size: 3891 441010 (pages=15M 1806M bytes)
/proc/statm.data: 520 437639 (pages=2M 1792M bytes data+stack)
/proc/status.1: VmPeak=VmSize=15564kB VmData=1224kB VmStk=856kB VmExe=32kB (text) VmLib=3172kB (shared libs)
/proc/status.2: VmPeak=2244M VmSize=1764M VmData=1749M VmStk=856kB VmExe=32kB VmLib=3172kB
mallinfo.arena: 135K 448M (system bytes)
mallinfo.uordblks: 1408 445M (in use bytes)
mallinfo.hblkhd: 1052K 1343M (arena+hblkhd=1791627264)
malloc_stats.total system bytes: (arena+hblkhd=1791627264)
malloc_stats.total in use bytes: (uordblks+hblkhd=1788366864)
Seems mallinfo.arena or uordblks definitely was not the whole
story and hblkhd had to be added. More serious problem is
mallinfo has 32 bit ints and fails when more than 4G is allocated,
which makes it a no go.
1. malloc_usable_size could be used in _d_malloc etc. It is a GNU
extension. We'd have to consistently use _d_malloc etc. The size
returned may not exactly equal the size requested.
2. we can use the /proc/self/stat*. They do not give very
accurate numbers (i.e. the total size used by the process rather
than what we explicitly allocated). They will work on any linux
system. No need to instrument malloc etc.
* dlib.c (_d_error): Going back to using clock rather than time.
Here is what I wrote before in procinfo.h:
Issue: clock() starts giving negative numbers after a while. I
checked the header files, CLOCKS_PER_SEC is defined to be 1000000
and the return value of clock, clock_t, is defined as signed long.
This means after 2000 seconds things go negative. Need to use some
other function for longer programs.
Switched to using times, which works with _SC_CLK_TCK defined as
100 so the overflow should take longer.
That also didn't work, finally decided to use the time function
which has a resolution of seconds.
The wrap-around problem is not going to happen on a 64 bit
platform and most dlib routines assume 64 bits and I am probably
not going to try to make it 32 bit compatible.
2013-12-26 <dyuret@ku.edu.tr>
* dlib.h: Hash tables implemented. Still ugly and general for
loop not possible because of types being general. Experiments
show that the linear search idea does not make a difference for
D_HMIN 0-16, slows down for D_HMIN > 16. Counting bigrams on
wsj.tok.gz (5187874 126170376 690948662) takes 35 seconds (5 secs
zcat, 5 secs forline etc) and 212M memory. Vocab size is 608467
and uniq bigrams are 10755423. Space is dominated by uniq
bigrams, the total hash table space allocated is for 20252586
bigrams (costing 8 bytes each), almost double the necessary
amount. test_bigram spends 16 bytes, test_symtable 8 bytes by
using a symbol table. Might be worth considering sparsehash for
small tables, which could halve the memory. Also might be worth
experimenting with obstacks. Should continue reimplementing
features.pl in C.
2013-12-14 <dyuret@ku.edu.tr>
* foreach_line: NULL seems to cause compiler problems, represent
stdin with "". Arbitrary limit on line length disturbing, could
use GNU getline instead, which has a much nicer interface. Two
problems: we did not want to depend on GNU extensions (which can
be solved by including getline code in dlib). Also we added zlib
support, which has gzgets but not gzgetline! Better check the
implementation of getline.
OK, getline uses internals of the *FILE pointer. A more portable
implementation would be to base it on fgets. However it turns out
c99 does not even define popen! It seems I will have to use
_GNU_SOURCE after all?
* sglib: Here is the philosophy behind sglib:
let's take a program defining an array of two-dimensional points
represented by their coordinates:
struct point {
int x;
int y;
} thePointArray[SIZE];
and a macro defining lexicographical ordering on points:
#define CMP_POINT(p1,p2) ((p1.x!=p2.x)?(p1.x-p2.x):(p1.y-p2.y))
In such case the following line of code is sorting the array using
heap sort:
SGLIB_ARRAY_SINGLE_HEAP_SORT(struct point, thePointArray, SIZE, CMP_POINT);
Note that no pointers are involved in this implementation.
This is great. But it is ugly. We'd like sort(thePointArray) if
possible. Also somebody needs to take care of malloc, resize
etc. What is the best way to do this for a hash?
* dlib: Deniz's C library. glib disappoints. code blows up
without error when hash and array sizes increase. unacceptable.
have to write my own version. need hashes (symbols?) and dynamic
arrays. foreach macros and error mechanism. code using dlib
should be easy to read. should not rely on gnu extensions,
although it would be nice to have obstacks, error, malloc
debugging, getline etc. just the c99 or c11 standard, cannot do
without declarations in code, stdint, bool seem useful etc. aim
no warnings with pedantic. inspired by sglib, define most things
as macros. have documentation in dlib.h, avoid dlib.c if you can.