跳至正文

Gene Ontology on NeuralKG

NeuralKG is a python-based library for diverse representation learning of knowledge graphs implementing Conventional KGEs, GNN-based KGEs, and Rule-based KGEs. We provide comprehensive documents for beginners and an online website to organize an open and shared KG representation learning community.

This article uses this open source toolkit to easily and conveniently train on datasets of biomolecules. You only need to provide the dataset in the required format, and select the required model and hyperparameters to start training.

See NeuralKG for installation tutorial.

Data introduction and processing

NeuralKG needs to process the data into the following five files:

entities.dict
relations.dict
train.txt
valid.txt
test.txt

Where entities.dict and relationships.dict is ID\tname for each line, and ID is the sequence number starting from 0; train.txtvalid.txttest.txt is Head entity\tRelation\tTail entity for each.

Gene Ontology

The Gene Ontology(GO) resource provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria. It is widely used to support scientific research, and has been cited in tens of thousands of publications.

GO is also at the hub of a major effort to represent the vast amount of biomedical knowledge in a computable form. GO is linked to many other biomedical ontologies, and is a foundation for research applying computer science in biology and medicine.

An ontology is a formal representation of a body of knowledge within a given domain. Ontologies usually consist of a set of classes (or terms or concepts) with relations that operate between them. It contains 47229 entities, 9 relationships, and 110146 triples. The Gene Ontology describes our knowledge of the biological domain with respect to three aspects:

  • Molecular Function: Molecular-level activities performed by gene products. Molecular function terms describe activities that occur at the molecular level, such as “catalysis” or “transport”
  • Cellular Component: The locations relative to cellular structures in which a gene product performs a function, either cellular compartments (e.g., mitochondrion), or stable macromolecular complexes of which they are parts (e.g., the ribosome).
  • Biological Process: The larger processes, or ‘biological programs’ accomplished by multiple molecular activities. Examples of broad biological process terms are DNA repair or signal transduction. Examples of more specific terms are pyrimidine nucleobase biosynthetic process or glucose transmembrane transport.

See About the GO for dataset details and download links.

Configuration

Neuralkg provides two ways to configure parameters: use YAML format configuration file to adjust various parameters, or use command line to adjust during training. See parameter description for specific parameter functions. You can simply use litmodel_name to select Conventional KGEs, GNN-based KGEs, and Rule-based KGEs, model_name to select a specific model.

The YAML file can be obtained by modifying the example of configs, and run the model by:

python --load_config --config_path <your-config.yaml>

The script file can be obtained by modifying the example of scripts, and run the model by:

bash <your-script.sh>

Here, we choose TransE, ComplEx and RotatE of KGE to test the model.

Results

ModelMRRHit@1Hit@3Hit@10
TransE0.3420.1990.4350.584
ComplEx0.3910.3180.4320.533
RotatE0.4640.3930.5050.601

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注