Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ Our trained, released models are in the `s3` folder referenced above, and are ca
Note that by default we are using the `--use_cache` flag, which will cache all the features so future reruns are faster. There are two things to be aware of: (a) the cache is stored in RAM and can be huge (100gb+) and (b) if you intend to change the features and rerun, you'll have to turn off the cache or the new features won't be used.

## Licensing
The code in this repo is released under the Apache 2.0 license (license included in the repo. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).
The code in this repo is released under the Apache 2.0 license. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).

## Citation

Expand Down
83 changes: 0 additions & 83 deletions data/s2and_name_tuples_filtered.txt
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,6 @@ abi,abigail
abigail,abi
jenn,jennifer
jennifer,jenn
maria,marcia
marcia,maria
theodore,ted
ted,theodore
danial,daniel
Expand Down Expand Up @@ -362,8 +360,6 @@ julia,julie
julie,julia
rudolf,rudi
rudi,rudolf
antonio,andrea
andrea,antonio
manel,manuel
manuel,manel
jc,jean
Expand All @@ -372,8 +368,6 @@ katy,katherine
katherine,katy
nicola,nikki
nikki,nicola
marian,maria
maria,marian
ash,ashley
ashley,ash
desmond,des
Expand Down Expand Up @@ -408,12 +402,8 @@ deb,debra
debra,deb
cindy,cynthia
cynthia,cindy
peter,patrick
patrick,peter
john,johannes
johannes,john
marina,mariana
mariana,marina
mick,michael
michael,mick
arthur,art
Expand Down Expand Up @@ -496,8 +486,6 @@ tracy,tracey
tracey,tracy
michal,michael
michael,michal
antonio,andre
andre,antonio
val,valerie
valerie,val
walt,walter
Expand Down Expand Up @@ -564,8 +552,6 @@ anna,ania
ania,anna
jeff,jefferson
jefferson,jeff
daniel,danielle
danielle,daniel
nickolas,nick
nick,nickolas
amirali,amir
Expand Down Expand Up @@ -1071,12 +1057,6 @@ marc,marcus
marcus,marc
tomas,tomasz
tomasz,tomas
christian,christopher
christopher,christian
alex,alastair
alastair,alex
gabriella,gabriel
gabriel,gabriella
elizabeth,elise
elise,elizabeth
terrence,terrance
Expand All @@ -1099,10 +1079,6 @@ molly,mary
mary,molly
jerome,jerry
jerry,jerome
melanie,melissa
melissa,melanie
edwin,edmund
edmund,edwin
jeanne,jc
jc,jeanne
randolph,randy
Expand Down Expand Up @@ -1629,26 +1605,16 @@ serhii,serhiy
serhiy,serhii
kyung,kyoung
kyoung,kyung
eduardo,editorial
editorial,eduardo
serhii,sergiy
sergiy,serhii
muhammad,maria
maria,muhammad
dmytro,dmitrii
dmitrii,dmytro
ej,emma
emma,ej
hasan,hassan
hassan,hasan
dimitry,dmitry
dmitry,dimitry
wang,weisheng
weisheng,wang
amirhossein,amihossein
amihossein,amirhossein
antonio,alfredo
alfredo,antonio
dmitrii,dmitriy
dmitriy,dmitrii
sajikumar,sreedharan
Expand Down Expand Up @@ -5091,16 +5057,12 @@ wangli,wanli
wanli,wangli
chidambaram,chidabaram
chidabaram,chidambaram
pui,ph
ph,pui
janio,jano
jano,janio
nazek,nazeek
nazeek,nazek
thandavaryan,thandavarayan
thandavarayan,thandavaryan
loc,lg
lg,loc
pradhat,prabhat
prabhat,pradhat
jin,jiao
Expand Down Expand Up @@ -5571,10 +5533,6 @@ tatyna,tatyana
tatyana,tatyna
artemiy,artemy
artemy,artemiy
thanh,thao
thao,thanh
paola,paolo
paolo,paola
zinoviy,zenoviy
zenoviy,zinoviy
hossam,hosssam
Expand All @@ -5589,8 +5547,6 @@ antonio,antnio
antnio,antonio
binyam,biniam
biniam,binyam
kin,ko
ko,kin
tayfun,taifun
taifun,tayfun
nasser,naseer
Expand Down Expand Up @@ -7629,12 +7585,8 @@ xiangcehng,xiangcheng
xiangcheng,xiangcehng
herbert,herbet
herbet,herbert
zhou,zhi
zhi,zhou
kuo,kou
kou,kuo
ne,nhw
nhw,ne
philipp,phillipp
phillipp,philipp
vanphanom,vanpahnom
Expand Down Expand Up @@ -11515,22 +11467,12 @@ tykhon,tikhon
tikhon,tykhon
ryhei,ryohei
ryohei,ryhei
pra,pim
pim,pra
rjl,rob
rob,rjl
norshariani,norshairani
norshairani,norshariani
anatolii,anatoalii
anatoalii,anatolii
lcs,lucy
lucy,lcs
alena,anna
anna,alena
mc,mei
mei,mc
jon,jh
jh,jon
vladinir,vladimir
vladimir,vladinir
aida,ayda
Expand All @@ -11549,9 +11491,6 @@ yevhen,yeuvgen
yeuvgen,yevhen
junichi,jyunichi
jyunichi,junichi
jun,js
js,jun
davis,david
raineesh,rajneesh
rajneesh,raineesh
necar,nesar
Expand All @@ -11560,10 +11499,6 @@ neschiclyaev,neschislyaev
neschislyaev,neschiclyaev
hojjatollah,hojatollah
hojatollah,hojjatollah
rc,roi
roi,rc
emd,eve
eve,emd
luciani,luciane
luciane,luciani
fabiano,fabiana
Expand Down Expand Up @@ -11598,10 +11533,6 @@ gulen,guelen
guelen,gulen
hongnyoung,hongnyung
hongnyung,hongnyoung
eva,en
en,eva
ali,ah
ah,ali
jongyeon,joneyeon
joneyeon,jongyeon
sergyi,serhii
Expand Down Expand Up @@ -11648,12 +11579,8 @@ yubyeol,yubeol
yubeol,yubyeol
kebenesa,kebanesa
kebanesa,kebenesa
ea,emma
emma,ea
abulkadir,abdulkadir
abdulkadir,abulkadir
aas,ana
ana,aas
gourab,gaurav
gaurav,gourab
abdoreza,abdorreza
Expand Down Expand Up @@ -11718,14 +11645,10 @@ olav,olaf
olaf,olav
chuancun,chuncun
chuncun,chuancun
laura,luca
luca,laura
serguey,sergei
sergei,serguey
stefano,stefhano
stefhano,stefano
denny,danny
danny,denny
li,le
le,li
nuzianna,nunzianna
Expand Down Expand Up @@ -11800,8 +11723,6 @@ maryam,mayram
mayram,maryam
huyan,hyun
hyun,huyan
john,joann
joann,john
seungkeol,seungkul
seungkul,seungkeol
georhii,georgiy
Expand All @@ -11814,12 +11735,8 @@ mutaher,mutahar
mutahar,mutaher
amar,amit
amit,amar
kee,kt
kt,kee
xianonan,xiaonan
xiaonan,xianonan
maron,marlon
marlon,maron
bangshui,bangshuai
bangshuai,bangshui
yafeng,yufeng
Expand Down
4 changes: 2 additions & 2 deletions requirements.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
numpy==1.24.3
scikit-learn==1.2.2
text-unidecode==1.3
requests==2.24.0
Expand All @@ -9,13 +8,14 @@ fastcluster==1.2.6
genieclust==1.1.4
pycld2==0.41
fasttext==0.9.2
shap==0.36.0
matplotlib==3.7.1
seaborn==0.12.2
tqdm==4.49.0
strsimpy==0.2.0
jellyfish==0.8.2
numpy==1.24.3
orjson
shap

# For CI and testing
pytest==6.0.2
Expand Down
4 changes: 2 additions & 2 deletions requirements_py_311.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
numpy==1.24.3
scikit-learn==1.2.2
text-unidecode==1.3
requests==2.24.0
Expand All @@ -7,13 +6,14 @@ pandas>=1.2
lightgbm==3.0.0
fastcluster==1.2.6
genieclust==1.1.4
shap==0.36.0
matplotlib==3.7.1
seaborn==0.12.2
tqdm==4.49.0
strsimpy==0.2.0
jellyfish==0.8.2
numpy==1.24.3
orjson
shap

# For CI and testing
pytest==8.4.1
Expand Down
5 changes: 5 additions & 0 deletions s2and/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@

logger = logging.getLogger("s2and")

# Global variable for multiprocessing
global_preprocess: bool


class NameCounts(NamedTuple):
first: Optional[int]
Expand Down Expand Up @@ -856,9 +859,11 @@ def split_blocks_helper(
x.append(block_id)
y.append(len(signature))

# Explicitly set n_init to silence upcoming sklearn default-change warning
clustering_model = KMeans(
n_clusters=self.num_clusters_for_block_size,
random_state=self.random_seed,
n_init=10,
).fit(np.array(y).reshape(-1, 1))
y_group = clustering_model.labels_

Expand Down
Loading