NeuralKG is a python-based library for diverse representation learning of knowledge graphs implementing Conventional KGEs, GNN-based KGEs, and Rule-based KGEs. We provide comprehensive documents for beginners and an online website to organize an open and shared KG representation learning community.
This article uses this open source toolkit to easily and conveniently train on datasets of biomolecules. You only need to provide the dataset in the required format, and select the required model and hyperparameters to start training.
See NeuralKG for installation tutorial.
Data introduction and processing
NeuralKG needs to process the data into the following five files:
entities.dict
relations.dict
train.txt
valid.txt
test.txt
Where entities.dict
and relationships.dict
is ID\tname
for each line, and ID is the sequence number starting from 0; train.txt
、valid.txt
、test.txt
is Head entity\tRelation\tTail entity
for each.
DrugBank
In DrugBank, conditions are often specific medical states, including diseases, symptoms, and other health-related characteristics or problems. Conditions may also be used to describe other clinical human phenomena, such as procedures, therapies, and the presence or absence of certain genes.
Information regarding conditions is taken from a variety of reputable sources, including academic journals, product labels, and clinical trials. In addition, conditions include synonyms that aid in searches - for example, “gout flares” and “acute gout” refer to the same condition.
DrugBank started as a project to make it easier for academic researchers to get detailed structured information about drugs at the University of Alberta, which combins detailed drug data and comprehensive drug target information. There are 13791 drug items, including 2653 small molecule drugs, 1417 biotechnology (protein / peptide) drugs, 131 nutrients and 6451 experimental drugs. Each DrugCard entry contains more than 200 data fields, half of which are used for drug / chemical data and the other half for drug target or protein data.
Drugbank supports downloading all drug information in different formats. The full version is XML, the structure is SDF, the externally linked data is CSV, the protein identifier is CSV, and the target sequence is FASTA.
See DrugBank for more details and download links.
Here we use the processed RDF format data in paper A physical embedding model for knowledge graphs published in JIST2019, clicking this link for its open source code. After converting it to the format required by NeuralKG, the experiment can be started.
Configuration
Neuralkg provides two ways to configure parameters: use YAML format configuration file to adjust various parameters, or use command line to adjust during training. See parameter description for specific parameter functions. You can simply use litmodel_name
to select Conventional KGEs, GNN-based KGEs, and Rule-based KGEs, model_name
to select a specific model.
The YAML file can be obtained by modifying the example of configs, and run the model by:
python --load_config --config_path <your-config.yaml>
The script file can be obtained by modifying the example of scripts, and run the model by:
bash <your-script.sh>
Here, we choose TransE, ComplEx and RotatE of KGE to test the model.
Results
Model | MRR | Hit@1 | Hit@3 | Hit@10 |
---|---|---|---|---|
TransE | 0.054 | 0.003 | 0.080 | 0.149 |
ComplEx | 0.134 | 0.123 | 0.137 | 0.153 |
RotatE | 0.093 | 0.049 | 0.091 | 0.194 |