Skip to content

Commit eadd5a3

Browse files
authored
Merge pull request #51 from allenai/cleanup
Abstracting SHAP utils, removing SHAP pin so we can use any SHAP version and added tests for that. Also, fixed some mypy errors.
2 parents 9a4d33e + 900eade commit eadd5a3

File tree

14 files changed

+664
-234
lines changed

14 files changed

+664
-234
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ Our trained, released models are in the `s3` folder referenced above, and are ca
182182
Note that by default we are using the `--use_cache` flag, which will cache all the features so future reruns are faster. There are two things to be aware of: (a) the cache is stored in RAM and can be huge (100gb+) and (b) if you intend to change the features and rerun, you'll have to turn off the cache or the new features won't be used.
183183

184184
## Licensing
185-
The code in this repo is released under the Apache 2.0 license (license included in the repo. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).
185+
The code in this repo is released under the Apache 2.0 license. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).
186186

187187
## Citation
188188

data/s2and_name_tuples_filtered.txt

Lines changed: 0 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -232,8 +232,6 @@ abi,abigail
232232
abigail,abi
233233
jenn,jennifer
234234
jennifer,jenn
235-
maria,marcia
236-
marcia,maria
237235
theodore,ted
238236
ted,theodore
239237
danial,daniel
@@ -362,8 +360,6 @@ julia,julie
362360
julie,julia
363361
rudolf,rudi
364362
rudi,rudolf
365-
antonio,andrea
366-
andrea,antonio
367363
manel,manuel
368364
manuel,manel
369365
jc,jean
@@ -372,8 +368,6 @@ katy,katherine
372368
katherine,katy
373369
nicola,nikki
374370
nikki,nicola
375-
marian,maria
376-
maria,marian
377371
ash,ashley
378372
ashley,ash
379373
desmond,des
@@ -408,12 +402,8 @@ deb,debra
408402
debra,deb
409403
cindy,cynthia
410404
cynthia,cindy
411-
peter,patrick
412-
patrick,peter
413405
john,johannes
414406
johannes,john
415-
marina,mariana
416-
mariana,marina
417407
mick,michael
418408
michael,mick
419409
arthur,art
@@ -496,8 +486,6 @@ tracy,tracey
496486
tracey,tracy
497487
michal,michael
498488
michael,michal
499-
antonio,andre
500-
andre,antonio
501489
val,valerie
502490
valerie,val
503491
walt,walter
@@ -564,8 +552,6 @@ anna,ania
564552
ania,anna
565553
jeff,jefferson
566554
jefferson,jeff
567-
daniel,danielle
568-
danielle,daniel
569555
nickolas,nick
570556
nick,nickolas
571557
amirali,amir
@@ -1071,12 +1057,6 @@ marc,marcus
10711057
marcus,marc
10721058
tomas,tomasz
10731059
tomasz,tomas
1074-
christian,christopher
1075-
christopher,christian
1076-
alex,alastair
1077-
alastair,alex
1078-
gabriella,gabriel
1079-
gabriel,gabriella
10801060
elizabeth,elise
10811061
elise,elizabeth
10821062
terrence,terrance
@@ -1099,10 +1079,6 @@ molly,mary
10991079
mary,molly
11001080
jerome,jerry
11011081
jerry,jerome
1102-
melanie,melissa
1103-
melissa,melanie
1104-
edwin,edmund
1105-
edmund,edwin
11061082
jeanne,jc
11071083
jc,jeanne
11081084
randolph,randy
@@ -1629,26 +1605,16 @@ serhii,serhiy
16291605
serhiy,serhii
16301606
kyung,kyoung
16311607
kyoung,kyung
1632-
eduardo,editorial
1633-
editorial,eduardo
16341608
serhii,sergiy
16351609
sergiy,serhii
1636-
muhammad,maria
1637-
maria,muhammad
16381610
dmytro,dmitrii
16391611
dmitrii,dmytro
1640-
ej,emma
1641-
emma,ej
16421612
hasan,hassan
16431613
hassan,hasan
16441614
dimitry,dmitry
16451615
dmitry,dimitry
1646-
wang,weisheng
1647-
weisheng,wang
16481616
amirhossein,amihossein
16491617
amihossein,amirhossein
1650-
antonio,alfredo
1651-
alfredo,antonio
16521618
dmitrii,dmitriy
16531619
dmitriy,dmitrii
16541620
sajikumar,sreedharan
@@ -5091,16 +5057,12 @@ wangli,wanli
50915057
wanli,wangli
50925058
chidambaram,chidabaram
50935059
chidabaram,chidambaram
5094-
pui,ph
5095-
ph,pui
50965060
janio,jano
50975061
jano,janio
50985062
nazek,nazeek
50995063
nazeek,nazek
51005064
thandavaryan,thandavarayan
51015065
thandavarayan,thandavaryan
5102-
loc,lg
5103-
lg,loc
51045066
pradhat,prabhat
51055067
prabhat,pradhat
51065068
jin,jiao
@@ -5571,10 +5533,6 @@ tatyna,tatyana
55715533
tatyana,tatyna
55725534
artemiy,artemy
55735535
artemy,artemiy
5574-
thanh,thao
5575-
thao,thanh
5576-
paola,paolo
5577-
paolo,paola
55785536
zinoviy,zenoviy
55795537
zenoviy,zinoviy
55805538
hossam,hosssam
@@ -5589,8 +5547,6 @@ antonio,antnio
55895547
antnio,antonio
55905548
binyam,biniam
55915549
biniam,binyam
5592-
kin,ko
5593-
ko,kin
55945550
tayfun,taifun
55955551
taifun,tayfun
55965552
nasser,naseer
@@ -7629,12 +7585,8 @@ xiangcehng,xiangcheng
76297585
xiangcheng,xiangcehng
76307586
herbert,herbet
76317587
herbet,herbert
7632-
zhou,zhi
7633-
zhi,zhou
76347588
kuo,kou
76357589
kou,kuo
7636-
ne,nhw
7637-
nhw,ne
76387590
philipp,phillipp
76397591
phillipp,philipp
76407592
vanphanom,vanpahnom
@@ -11515,22 +11467,12 @@ tykhon,tikhon
1151511467
tikhon,tykhon
1151611468
ryhei,ryohei
1151711469
ryohei,ryhei
11518-
pra,pim
11519-
pim,pra
11520-
rjl,rob
11521-
rob,rjl
1152211470
norshariani,norshairani
1152311471
norshairani,norshariani
1152411472
anatolii,anatoalii
1152511473
anatoalii,anatolii
11526-
lcs,lucy
11527-
lucy,lcs
1152811474
alena,anna
1152911475
anna,alena
11530-
mc,mei
11531-
mei,mc
11532-
jon,jh
11533-
jh,jon
1153411476
vladinir,vladimir
1153511477
vladimir,vladinir
1153611478
aida,ayda
@@ -11549,9 +11491,6 @@ yevhen,yeuvgen
1154911491
yeuvgen,yevhen
1155011492
junichi,jyunichi
1155111493
jyunichi,junichi
11552-
jun,js
11553-
js,jun
11554-
davis,david
1155511494
raineesh,rajneesh
1155611495
rajneesh,raineesh
1155711496
necar,nesar
@@ -11560,10 +11499,6 @@ neschiclyaev,neschislyaev
1156011499
neschislyaev,neschiclyaev
1156111500
hojjatollah,hojatollah
1156211501
hojatollah,hojjatollah
11563-
rc,roi
11564-
roi,rc
11565-
emd,eve
11566-
eve,emd
1156711502
luciani,luciane
1156811503
luciane,luciani
1156911504
fabiano,fabiana
@@ -11598,10 +11533,6 @@ gulen,guelen
1159811533
guelen,gulen
1159911534
hongnyoung,hongnyung
1160011535
hongnyung,hongnyoung
11601-
eva,en
11602-
en,eva
11603-
ali,ah
11604-
ah,ali
1160511536
jongyeon,joneyeon
1160611537
joneyeon,jongyeon
1160711538
sergyi,serhii
@@ -11648,12 +11579,8 @@ yubyeol,yubeol
1164811579
yubeol,yubyeol
1164911580
kebenesa,kebanesa
1165011581
kebanesa,kebenesa
11651-
ea,emma
11652-
emma,ea
1165311582
abulkadir,abdulkadir
1165411583
abdulkadir,abulkadir
11655-
aas,ana
11656-
ana,aas
1165711584
gourab,gaurav
1165811585
gaurav,gourab
1165911586
abdoreza,abdorreza
@@ -11718,14 +11645,10 @@ olav,olaf
1171811645
olaf,olav
1171911646
chuancun,chuncun
1172011647
chuncun,chuancun
11721-
laura,luca
11722-
luca,laura
1172311648
serguey,sergei
1172411649
sergei,serguey
1172511650
stefano,stefhano
1172611651
stefhano,stefano
11727-
denny,danny
11728-
danny,denny
1172911652
li,le
1173011653
le,li
1173111654
nuzianna,nunzianna
@@ -11800,8 +11723,6 @@ maryam,mayram
1180011723
mayram,maryam
1180111724
huyan,hyun
1180211725
hyun,huyan
11803-
john,joann
11804-
joann,john
1180511726
seungkeol,seungkul
1180611727
seungkul,seungkeol
1180711728
georhii,georgiy
@@ -11814,12 +11735,8 @@ mutaher,mutahar
1181411735
mutahar,mutaher
1181511736
amar,amit
1181611737
amit,amar
11817-
kee,kt
11818-
kt,kee
1181911738
xianonan,xiaonan
1182011739
xiaonan,xianonan
11821-
maron,marlon
11822-
marlon,maron
1182311740
bangshui,bangshuai
1182411741
bangshuai,bangshui
1182511742
yafeng,yufeng

requirements.in

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
numpy==1.24.3
21
scikit-learn==1.2.2
32
text-unidecode==1.3
43
requests==2.24.0
@@ -9,13 +8,14 @@ fastcluster==1.2.6
98
genieclust==1.1.4
109
pycld2==0.41
1110
fasttext==0.9.2
12-
shap==0.36.0
1311
matplotlib==3.7.1
1412
seaborn==0.12.2
1513
tqdm==4.49.0
1614
strsimpy==0.2.0
1715
jellyfish==0.8.2
16+
numpy==1.24.3
1817
orjson
18+
shap
1919

2020
# For CI and testing
2121
pytest==6.0.2

requirements_py_311.in

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
numpy==1.24.3
21
scikit-learn==1.2.2
32
text-unidecode==1.3
43
requests==2.24.0
@@ -7,13 +6,14 @@ pandas>=1.2
76
lightgbm==3.0.0
87
fastcluster==1.2.6
98
genieclust==1.1.4
10-
shap==0.36.0
119
matplotlib==3.7.1
1210
seaborn==0.12.2
1311
tqdm==4.49.0
1412
strsimpy==0.2.0
1513
jellyfish==0.8.2
14+
numpy==1.24.3
1615
orjson
16+
shap
1717

1818
# For CI and testing
1919
pytest==8.4.1

s2and/data.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@
4040

4141
logger = logging.getLogger("s2and")
4242

43+
# Global variable for multiprocessing
44+
global_preprocess: bool
45+
4346

4447
class NameCounts(NamedTuple):
4548
first: Optional[int]
@@ -856,9 +859,11 @@ def split_blocks_helper(
856859
x.append(block_id)
857860
y.append(len(signature))
858861

862+
# Explicitly set n_init to silence upcoming sklearn default-change warning
859863
clustering_model = KMeans(
860864
n_clusters=self.num_clusters_for_block_size,
861865
random_state=self.random_seed,
866+
n_init=10,
862867
).fit(np.array(y).reshape(-1, 1))
863868
y_group = clustering_model.labels_
864869

0 commit comments

Comments
 (0)