NeuralKG for KGQA

Knowledge Graphs (KG) consists of a large number of entities and relations among them as typed edges. Goal of the Question Answering over KG (KGQA) task is to answer natural language queries posed over the KG.

Knowledge graph embedding methods are effective to support the multi-hop KGQA by reducing KG sparsity base on performing missing link prediction. In the paper accepted to ACL 2020, Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings, researchers proposed EmbedKGQA, utilizing the link prediction properties of KG embeddings to mitigate the KG incompleteness problem without using any additional data, to overcome the shortcomings due to limited neighborhood size constraint imposed by existing multi-hop KGQA methods.

Here, we use NeuralKG to accomplish the step of embedding training in EmbedKGQA, and show the whole procedure of training EmbedKGQA based on NeuralKG.

Data Processing

We download the code from the github repo of EmbedKGQA(we use the commit of c101d58), and download the data.zip from the google drive link proviced in the README.md;
We put the directiry data into EmbedKGQA-master;
We transform the data format in data/MetaQA_half (we use MetaQA_half as an example) into the data format in NeuralKG. Specifically, in original entities.dict and relations.dict, the indices are in the seconde column as follows:
```
# relations.dict
directed_by  0
directed_by_reverse  1
has_genre    2
has_genre_reverse    3
```
So we put the indices into the first column like
```
# relations.dict
0 directed_by
1 directed_by_reverse
2 has_genre
3 has_genre_reverse
```
Furthermore, some entities in MetaQA_half consisit of space symbols, and the intervals in triples in original train.txt`, `valid.txt` and `test.txt` are tab symbols. We replace the space symbols in entities into `_, and replace the tab symbols in triples into space symbols.

Embedding Training

We put the processed MetaQA_half dataset into the dataset directory of NeuralKG code downloaded from its github;

We write following script, and run this script (these may not be the best hyperparameters).

MODEL_NAME=ComplEx
DATASET_NAME=MetaQA_half
DATA_PATH=$DATA_DIR/$DATASET_NAME
LITMODEL_NAME=KGELitModel
MAX_EPOCHS=1000
EMB_DIM=200
LOSS=Adv_Loss
ADV_TEMP=1.0
TRAIN_BS=1024
EVAL_BS=16
NUM_NEG=64
MARGIN=200.0
LR=5e-3
REGULARIZATION=1e-5
CHECK_PER_EPOCH=20
NUM_WORKERS=16
GPU=3

CUDA_VISIBLE_DEVICES=$GPU python -u src/main.py \
   --model_name $MODEL_NAME \
   --dataset_name $DATASET_NAME \
   --data_path $DATA_PATH \
   --litmodel_name $LITMODEL_NAME \
   --max_epochs $MAX_EPOCHS \
   --emb_dim $EMB_DIM \
   --loss $LOSS \
   --adv_temp $ADV_TEMP \
   --train_bs $TRAIN_BS \
   --eval_bs $EVAL_BS \
   --num_neg $NUM_NEG \
   --margin $MARGIN \
   --lr $LR \
   --regularization $REGULARIZATION \
   --check_per_epoch $CHECK_PER_EPOCH \
   --num_workers $NUM_WORKERS \
   --save_config \

Finally, we get the best model checkpoint in ./output/link_prediction/MetaQA_half/ComplEx and we readout the entity and relation embeddings as follows

import torch
import numpy as np

model = torch.load('epoch=xxx-Eval|mrr=x.xxx.ckpt')
rel_emb = model['state_dict']['model.rel_emb.weight']
ent_emb = model['state_dict']['model.ent_emb.weight']
np.save('R.npy', rel_emb.cpu().numpy())
np.save('E.npy', ent_emb.cpu().numpy())

Run EmbedKGQA

In the code of EmbedKGQA, we replace the code from line 286 in ./KGQA/LSTM/main.py as:

hops = args.hops
if hops in ['1', '2', '3']:
   hops = hops + 'hop'
if args.kg_type == 'half':
   data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '_half.txt'
else:
   data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '.txt'
print('Train file is ', data_path)

hops_without_old = hops.replace('_old', '')
valid_data_path = '../../data/QA_data/MetaQA/qa_dev_' + hops_without_old + '.txt'
test_data_path = '../../data/QA_data/MetaQA/qa_test_' + hops_without_old + '.txt'

model_name = args.model
kg_type = args.kg_type
print('KG type is', kg_type)
# embedding_folder = '../../pretrained_models/embeddings/' + model_name + '_MetaQA_' + kg_type
embedding_folder = '../../embedding'

entity_embedding_path = embedding_folder + '/E.npy'
relation_embedding_path = embedding_folder + '/R.npy'
entity_dict = embedding_folder + '/entities.dict'
relation_dict = embedding_folder + '/relations.dict'
# w_matrix =  embedding_folder + '/W.npy'
w_matrix = None

bn_list = []

# for i in range(3):
#     bn = np.load(embedding_folder + '/bn' + str(i) + '.npy', allow_pickle=True)
#     bn_list.append(bn.item())

We comment the code related to batchnorm in ./KGQA/LSTM/model.py since we don't use any batchnorm during embeddings training, including line 92-105, line 179 and line 139.

run command line as

python main.py --mode train --relation_dim 200 --hidden_dim 256 \
--gpu 2 --freeze 0 --batch_size 128 --validate_every 5 --hops 2 --lr 0.0005 --entdrop 0.1 --reldrop 0.2  --scoredrop 0.2 \
--decay 1.0 --model ComplEx --patience 5 --ls 0.0 --kg_type half

Finally, we get the test resuls of 83.85.

NeuralKG for KGQA

Data Processing

Embedding Training

Run EmbedKGQA

发表回复 取消回复

发表回复取消回复