跳至正文

NeuralKG on DrugBank

NeuralKG is a python-based library for diverse representation learning of knowledge graphs implementing Conventional KGEs, GNN-based KGEs, and Rule-based KGEs. We provide comprehensive documents for beginners and an online website to organize an open and shared KG representation learning community.

This article uses this open source toolkit to easily and conveniently train on datasets of biomolecules. You only need to provide the dataset in the required format, and select the required model and hyperparameters to start training.

See NeuralKG for installation tutorial.

Data introduction and processing

NeuralKG needs to process the data into the following five files:

entities.dict
relations.dict
train.txt
valid.txt
test.txt

Where entities.dict and relationships.dict is ID\tname for each line, and ID is the sequence number starting from 0; train.txtvalid.txttest.txt is Head entity\tRelation\tTail entity for each.

DrugBank

In DrugBank, conditions are often specific medical states, including diseases, symptoms, and other health-related characteristics or problems. Conditions may also be used to describe other clinical human phenomena, such as procedures, therapies, and the presence or absence of certain genes.

Information regarding conditions is taken from a variety of reputable sources, including academic journals, product labels, and clinical trials. In addition, conditions include synonyms that aid in searches - for example, “gout flares” and “acute gout” refer to the same condition.

DrugBank started as a project to make it easier for academic researchers to get detailed structured information about drugs at the University of Alberta, which combins detailed drug data and comprehensive drug target information. There are 13791 drug items, including 2653 small molecule drugs, 1417 biotechnology (protein / peptide) drugs, 131 nutrients and 6451 experimental drugs. Each DrugCard entry contains more than 200 data fields, half of which are used for drug / chemical data and the other half for drug target or protein data.

Drugbank supports downloading all drug information in different formats. The full version is XML, the structure is SDF, the externally linked data is CSV, the protein identifier is CSV, and the target sequence is FASTA.

See DrugBank for more details and download links.

Here we use the processed RDF format data in paper A physical embedding model for knowledge graphs published in JIST2019, clicking this link for its open source code. After converting it to the format required by NeuralKG, the experiment can be started.

Configuration

Neuralkg provides two ways to configure parameters: use YAML format configuration file to adjust various parameters, or use command line to adjust during training. See parameter description for specific parameter functions. You can simply use litmodel_name to select Conventional KGEs, GNN-based KGEs, and Rule-based KGEs, model_name to select a specific model.

The YAML file can be obtained by modifying the example of configs, and run the model by:

python --load_config --config_path <your-config.yaml>

The script file can be obtained by modifying the example of scripts, and run the model by:

bash <your-script.sh>

Here, we choose TransE, ComplEx and RotatE of KGE to test the model.

Results

ModelMRRHit@1Hit@3Hit@10
TransE0.0540.0030.0800.149
ComplEx0.1340.1230.1370.153
RotatE0.0930.0490.0910.194

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注