Skip to content

Can't replicate data preprocessing #98

@morphy380

Description

@morphy380

I am confused how the data was transformed between the variable genes matrix and the full matrix provided. It seems the variable genes matrix is normalized + log-transformed but I don't get perfect correlation after this transformation. Could you provide the code for that preprocessing ?

PATH = '../../git/wot/notebooks/data/'
CELL_DAYS_PATH = 'data/cell_days.txt'
FULL_DS_PATH = 'data/ExprMatrix.h5ad'
VAR_DS_PATH = 'data/ExprMatrix.var.genes.h5ad'
FLE_COORDS_PATH ='data/fle_coords.txt'

coord_df = pd.read_csv(PATH+FLE_COORDS_PATH, index_col='id', sep='\t')
days_df = pd.read_csv(PATH+CELL_DAYS_PATH, index_col='id', sep='\t')
mask = [ind in days_df.index for ind in coord_df.index]
adataf = sc.read_h5ad(PATH+FULL_DS_PATH)[mask]
adata = sc.read_h5ad(PATH+VAR_DS_PATH)

df = adata[:1000,:].to_df()
dff = adataf[:1000,:].to_df()
dff_hvg = dff[df.columns]
assert np.all(df.index==dff.index)

dff_norm_hvg = np.log1p(10000*dff_hvg.div(dff_hvg.sum(axis=1),axis=0))
dff_norm = np.log1p(10000*dff.div(dff.sum(axis=1),axis=0))
#correlation is not one
plt.scatter(x=dff_norm['Fam150a'],y=df['Fam150a'],s=1)
#correlation is not one also with hvg
plt.scatter(x=dff_norm_hvg['Fam150a'],y=df['Fam150a'],s=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions