Knowledge Graphs (KG) consists of a large number of entities and relations among them as typed edges. Goal of the Question Answering over KG (KGQA) task is to answer natural language queries posed over the KG.
Knowledge graph embedding methods are effective to support the multi-hop KGQA by reducing KG sparsity base on performing missing link prediction. In the paper accepted to ACL 2020, Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings, researchers proposed EmbedKGQA, utilizing the link prediction properties of KG embeddings to mitigate the KG incompleteness problem without using any additional data, to overcome the shortcomings due to limited neighborhood size constraint imposed by existing multi-hop KGQA methods.
Here, we use NeuralKG to accomplish the step of embedding training in EmbedKGQA, and show the whole procedure of training EmbedKGQA based on NeuralKG.
Data Processing
- We download the code from the github repo of EmbedKGQA(we use the commit of c101d58), and download the data.zip from the google drive link proviced in the README.md;
- We put the directiry
;data
into
EmbedKGQA-master
- We transform the data format in
, the indices are in the seconde column as follows:data/MetaQA_half
(we use MetaQA_half as an example) into the data format in NeuralKG. Specifically, in original
entities.dict
and
relations.dict
# relations.dict directed_by 0 directed_by_reverse 1 has_genre 2 has_genre_reverse 3
So we put the indices into the first column like
# relations.dict 0 directed_by 1 directed_by_reverse 2 has_genre 3 has_genre_reverse
- Furthermore, some entities in MetaQA_half consisit of space symbols, and the intervals in triples in original
, and replace the tab symbols in triples into space symbols.train.txt
`,
`valid.txt
`and
`test.txt
`are tab symbols. We replace the space symbols in entities into
`_
Embedding Training
- We put the processed MetaQA_half dataset into the
directory of NeuralKG code downloaded from its github;dataset
- We write following script, and run this script (these may not be the best hyperparameters).
MODEL_NAME=ComplEx DATASET_NAME=MetaQA_half DATA_PATH=$DATA_DIR/$DATASET_NAME LITMODEL_NAME=KGELitModel MAX_EPOCHS=1000 EMB_DIM=200 LOSS=Adv_Loss ADV_TEMP=1.0 TRAIN_BS=1024 EVAL_BS=16 NUM_NEG=64 MARGIN=200.0 LR=5e-3 REGULARIZATION=1e-5 CHECK_PER_EPOCH=20 NUM_WORKERS=16 GPU=3 CUDA_VISIBLE_DEVICES=$GPU python -u src/main.py \ --model_name $MODEL_NAME \ --dataset_name $DATASET_NAME \ --data_path $DATA_PATH \ --litmodel_name $LITMODEL_NAME \ --max_epochs $MAX_EPOCHS \ --emb_dim $EMB_DIM \ --loss $LOSS \ --adv_temp $ADV_TEMP \ --train_bs $TRAIN_BS \ --eval_bs $EVAL_BS \ --num_neg $NUM_NEG \ --margin $MARGIN \ --lr $LR \ --regularization $REGULARIZATION \ --check_per_epoch $CHECK_PER_EPOCH \ --num_workers $NUM_WORKERS \ --save_config \
- Finally, we get the best model checkpoint in
and we readout the entity and relation embeddings as follows./output/link_prediction/MetaQA_half/ComplEx
import torch import numpy as np model = torch.load('epoch=xxx-Eval|mrr=x.xxx.ckpt') rel_emb = model['state_dict']['model.rel_emb.weight'] ent_emb = model['state_dict']['model.ent_emb.weight'] np.save('R.npy', rel_emb.cpu().numpy()) np.save('E.npy', ent_emb.cpu().numpy())
Run EmbedKGQA
- In the code of EmbedKGQA, we replace the code from line 286 in
as:./KGQA/LSTM/main.py
hops = args.hops if hops in ['1', '2', '3']: hops = hops + 'hop' if args.kg_type == 'half': data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '_half.txt' else: data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '.txt' print('Train file is ', data_path) hops_without_old = hops.replace('_old', '') valid_data_path = '../../data/QA_data/MetaQA/qa_dev_' + hops_without_old + '.txt' test_data_path = '../../data/QA_data/MetaQA/qa_test_' + hops_without_old + '.txt' model_name = args.model kg_type = args.kg_type print('KG type is', kg_type) # embedding_folder = '../../pretrained_models/embeddings/' + model_name + '_MetaQA_' + kg_type embedding_folder = '../../embedding' entity_embedding_path = embedding_folder + '/E.npy' relation_embedding_path = embedding_folder + '/R.npy' entity_dict = embedding_folder + '/entities.dict' relation_dict = embedding_folder + '/relations.dict' # w_matrix = embedding_folder + '/W.npy' w_matrix = None bn_list = [] # for i in range(3): # bn = np.load(embedding_folder + '/bn' + str(i) + '.npy', allow_pickle=True) # bn_list.append(bn.item())
- We comment the code related to batchnorm in
since we don't use any batchnorm during embeddings training, including line 92-105, line 179 and line 139../KGQA/LSTM/model.py
- run command line as
python main.py --mode train --relation_dim 200 --hidden_dim 256 \ --gpu 2 --freeze 0 --batch_size 128 --validate_every 5 --hops 2 --lr 0.0005 --entdrop 0.1 --reldrop 0.2 --scoredrop 0.2 \ --decay 1.0 --model ComplEx --patience 5 --ls 0.0 --kg_type half
Finally, we get the test resuls of 83.85.