QSong-github · Dee-chen · Mar 7, 2022 · Mar 7, 2022 · Mar 7, 2022 · Mar 7, 2022
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/scGCN.iml b/.idea/scGCN.iml
diff --git a/.idea/vcs.xml b/.idea/vcs.xml
diff --git a/.idea/workspace.xml b/.idea/workspace.xml
diff --git a/README.md b/README.md
@@ -1,75 +1,56 @@
-scGCN is a Graph Convolutional Networks Algorithm for Knowledge Transfer in Single Cell Omics
+scRNA_GCN流程是在scGCN流程的基础上进行的升级和完善，支持单细胞RNA基因表达矩阵作为输入，用以跨物种之间的细胞类型比较。
+## 概述
 
-This is a TensorFlow implementation of scGCN for leveraging and label transfer across differnt single cell datasets.
+基于卷积神经网络（Convolutional Neural Networks）的跨物种单细胞卷积神经网络细胞类型注释软件是一个linux平台的python软件，旨在帮助没有过多的编程知识的用户在下游分析单细胞转录组测序（scRNA-seq）及相关组学数据。软件提供了一种方便的算法来对不同物种的基因表达相似性进行分析，从而在无需知道跨物种之间的物种特异性marker的情况下，根据参考物种的细胞类型注释对待测物种的细胞类型进行注释。并且，在细胞类型注释完成后，还能简单地比较分析不同物种细胞类型的基因表达差异。软件操作难度低，利于没有编程基础的研究人员学习并使用。
 
-[![DOI](https://zenodo.org/badge/294531199.svg)](https://zenodo.org/badge/latestdoi/294531199)
-
-## Overview
-
-Single-cell omics represent the fastest-growing genomics data type in the literature and the public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of the single-cell omics. The current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Benchmarked with other label transfer methods on different single cell omics datasets, scGCN has consistently demonstrated superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow and provided here. 
-
-## Requirements
+## 需要安装的包
 * setuptools >= 40.6.3
 * numpy >= 1.15.4
 * tensorflow >= 1.15.0
 * networkx >= 2.2
 * scipy >= 1.1.0
 
-## Installation
+## 流程的安装
 
-Download scGCN:
+下载流程：
 ```
-git clone https://github.com/QSong-github/scGCN
+git clone https://github.com/Dee-chen/scGCN/tree/cdy
 ```
-Install requirements and scGCN:
+安装依赖:
 
 ```bash
 python setup.py install
 ```
-The general installation time is less than 10 seconds, and have been tested on mac OS and linux system. 
 
-## Run the demo
 
-load the example data using the data_preprocess.R script
-In the example data, we include the data from Mouse (reference) and Human (query) of GSE84133 dataset. The reference dataset contains 1,841 cells and the query dataset contains more cells (N=7,264) and 12,182 genes. 
+## 运行示例
+
+在示例文档中，有两个物种的相关基因表达矩阵数据
 ```bash
 cd scGCN
-Rscript data_preprocess.R # load example data 
-python train.py # run scGCN
+Rscript data_preprocess.R # 加载和预处理示例数据
+python train.py # 运行主程序
 ```
-All output will be shown in the output_log.txt file. Performance will be shown at the bottom. 
-We also provide the Seurat performance on this reference-qeury set (as in Figure 4), by run 
-
 ```
 Rscript Seurat_result.R
 ```
 
-## Input data
-
-When using your own data, you have to provide 
-* the raw data matrix of reference data and cell labels
-* the raw data matrix of query data
-
-## Output
-
-The output files with scGCN predicted labels will be stored in the results folder.
-
-## Model options 
-
-We also provide other GCN models includidng GAT (Veličković et al., ICLR 2018), HyperGCN (Chami et al., NIPS 2019) and GWNN (Xu et al., ICLR 2019) for optional use.
-
-## Detecting unknown cells
+## 输入数据
 
-For the query data that have cell types not appearing in reference data, we provide a screening step in our scGCN model using two statistical metrics, entropy score and enrichment score. If certain cells in query data have higher entropy and lower enrichment, these cells should be assigned as unknown cells. Specifically, choose check_unknown=TRUE in the function 'save_processed_data' to detect unknown cells.
+当使用自定义数据时，需要提供一个待测物种的基因表达矩阵，和一个参考物种的基因表达矩阵以及对应的细胞类型标签数据
+* 基因表达矩阵是数据框结构，行是细胞，列是基因
+* 细胞类型标签是单列数据框结构，以"type"作为单列数据的表头
 
-## Reproduction instructions
+## 输出
 
-The above scripts can reproduce the quantitative results in our manuscript based on our provided data.
+输出的数据信息会在results文件夹中，可视化结果则在当前运行的主程序目录下
 
-## Cite
+## 差异分析
+在原始scGCN的基础上，增加了初步的差异分析功能，用以可视化跨物种同一细胞类型的主要差异基因
 
-Please cite our paper and the related GCN papers if you use this code in your own work:
+## 引用
 
+原始scGCN的文章引用信息：
 ```
 Song, Q., Su, J., & Zhang, W. (2020). scGCN: a Graph Convolutional Networks Algorithm for Knowledge Transfer in Single Cell Omics. bioRxiv.
 ```

diff --git a/main.py b/main.py
@@ -0,0 +1,98 @@
+import argparse
+import threading
+import time
+import os
+import configparser
+import logging
+import coloredlogs
+import sys
+import subprocess
+import multiprocessing
+import pandas as pd
+
+class NewConfigParser(configparser.ConfigParser):
+    def optionxform(self, optionstr):
+        return optionstr
+
+def err_exit():
+    sys.exit('\033[1;31;47m!!The program exited abnormally, please check the log file !!\033[0m')
+
+def main():
+    pwd= os.getcwd()
+    parser = argparse.ArgumentParser(description='scGCN pipline')
+    parser.add_argument('-i', type=str, required=True, metavar='input_list',help='The count matix and cell type.')
+    parser.add_argument('-o', type=str, metavar='outputdir', default=os.sep.join([pwd, 'output']),
+                        help='The output dir.default:' + os.sep.join([pwd, 'output']))
+    parser.add_argument('--log', type=str, metavar='logfile',help='log file name,or log will print on stdout')
+    parser.add_argument('--debug', action='store_true', default=False, help='The log file will contain the output of each software itself, which is convenient for finding errors (-log is required)')
+
+    args = parser.parse_args()
+
+
+    if not args.log:
+        coloredlogs.install(
+                fmt='%(asctime)s: %(levelname)s\t%(message)s',
+                evel='info'.upper(), stream=sys.stdout
+            )
+    else:
+        if args.debug:
+            l='debug'
+        else:
+            l='info'
+        print('Logs will be written to {}'.format(args.log))
+        if os.path.exists(args.log):
+            os.remove(args.log)
+        logging.basicConfig(
+                    filename=args.log,
+                    filemode='a',
+                    format='%(asctime)s: %(levelname)s\t%(message)s',
+                    datefmt='%H:%M:%S',
+                    level=l.upper()
+                    )
+    home_dir=os.path.dirname(os.path.abspath(__file__))
+
+    logging.info("Start checking the input file..")
+    try:
+        input_list=pd.read_csv(args.i, sep='\t', header=0)
+    except FileNotFoundError:
+        logging.error('Checking input file :[ERR] -- The input file :%s  does not exist' %(args.input_list))
+        err_exit()
+    if input_list.columns.tolist()!=["Species","count","type","dbclass"]:
+        logging.error('The input file :%s format error.' % (args.i))
+        err_exit()
+    for f in input_list['count']:
+        if not os.path.exists(f):
+            logging.error('The input count file :%s does not exist.'%f)
+            err_exit()
+    for f in input_list['type']:
+        if not os.path.exists(f):
+            logging.error('The cell type count file :%s does not exist.'%f)
+            err_exit()
+
+    outpath=os.path.abspath(args.o)
+    os.mkdir(outpath)
+
+    for ref in input_list[input_list.dbclass=="ref"].itertuples():
+        for query in input_list[input_list.dbclass=="query"].itertuples():
+
+            os.mkdir(os.sep.join([outpath,ref.Species+"_"+query.Species]))
+            logging.info('Start processing %s to %s  type predictions.' % (ref.Species, query.Species))
+            logging.info('Data pre-processing...')
+            os.chdir(os.sep.join([outpath,ref.Species+"_"+query.Species]))
+            logging.info(['Rscript',os.sep.join([home_dir,'data_preprocess.R']),ref.count,ref.type,query.count,query.type])
+            stdout=subprocess.run(['Rscript',os.sep.join([home_dir,'data_preprocess.R']),ref.count,ref.type,query.count,query.type],stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            logging.debug('{} stdout:\n'.format('Data pre-processing') +stdout.stdout.decode('utf-8'))
+            logging.debug('{} stderr:\n'.format('Data pre-processing') +stdout.stderr.decode('utf-8'))
+            logging.info("Data pre-processing finish.")
+
+            logging.info("Cell type predictions...")
+            stdout = subprocess.run(['python',os.sep.join([home_dir,'train.py'])],stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            logging.debug('{} stdout:\n'.format('Data pre-processing') +stdout.stdout.decode('utf-8'))
+            logging.debug('{} stderr:\n'.format('Data pre-processing') +stdout.stderr.decode('utf-8'))
+            logging.info("Cell type predictions finish.")
+
+    logging.info("ALL finish.")
+
+if __name__ == '__main__':
+    main()
+
diff --git a/scGCN/.Rhistory b/scGCN/.Rhistory
diff --git a/scGCN/__pycache__/data.cpython-38.pyc b/scGCN/__pycache__/data.cpython-38.pyc
diff --git a/scGCN/__pycache__/graph.cpython-38.pyc b/scGCN/__pycache__/graph.cpython-38.pyc
diff --git a/scGCN/__pycache__/layers.cpython-38.pyc b/scGCN/__pycache__/layers.cpython-38.pyc
diff --git a/scGCN/__pycache__/models.cpython-38.pyc b/scGCN/__pycache__/models.cpython-38.pyc
diff --git a/scGCN/__pycache__/sankey.cpython-38.pyc b/scGCN/__pycache__/sankey.cpython-38.pyc
diff --git a/scGCN/__pycache__/utility.cpython-38.pyc b/scGCN/__pycache__/utility.cpython-38.pyc
diff --git a/scGCN/__pycache__/utils.cpython-38.pyc b/scGCN/__pycache__/utils.cpython-38.pyc
diff --git a/scGCN/checkpoints/best_validation.data-00000-of-00001 b/scGCN/checkpoints/best_validation.data-00000-of-00001
diff --git a/scGCN/checkpoints/best_validation.index b/scGCN/checkpoints/best_validation.index
diff --git a/scGCN/checkpoints/best_validation.meta b/scGCN/checkpoints/best_validation.meta
diff --git a/scGCN/checkpoints/checkpoint b/scGCN/checkpoints/checkpoint
@@ -0,0 +1,2 @@
+model_checkpoint_path: "best_validation"
+all_model_checkpoint_paths: "best_validation"
diff --git a/scGCN/data.py b/scGCN/data.py
@@ -29,7 +29,7 @@ def input_data(DataDir,Rgraph=True):
     lab_data2 = data2.reset_index(drop=True)  #.transpose()
     lab_label1.columns = ['type']
     lab_label2.columns = ['type']
-
+    
     types = np.unique(lab_label1['type']).tolist()
 
     random.seed(123)
@@ -83,14 +83,15 @@ def input_data(DataDir,Rgraph=True):
     lab_train2 = pd.concat([label_train1, lab_label2])
 
     #' save objects
-
+    types_all = np.unique([*lab_label1['type'],*lab_label2['type']]).tolist()
     PIK = "{}/datasets.dat".format(DataDir)
     res = [
         data_train1, data_test1, data_val1, label_train1, label_test1,
-        label_val1, lab_data2, lab_label2, types
+        label_val1, lab_data2, lab_label2, types_all
     ]
 
     with open(PIK, "wb") as f:
         pkl.dump(res, f)
-
+    with open("data.txt","w") as out:
+        out.write(str(res))
     print('load data succesfully....')
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		model_checkpoint_path: "best_validation"
		all_model_checkpoint_paths: "best_validation"