Knowledge Graphs (KG) consists of a large number of entities and relations among them as typed edges. Goal of the Question Answering over KG (KGQA) task is to answer natural language queries posed over the KG.

Knowledge graph embedding methods are effective to support the multi-hop KGQA by reducing KG sparsity base on performing missing link prediction. In the paper accepted to ACL 2020, Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings, researchers proposed EmbedKGQA, utilizing the link prediction properties of KG embeddings to mitigate the KG incompleteness problem without using any additional data, to overcome the shortcomings due to limited neighborhood size constraint imposed by existing multi-hop KGQA methods.

Here, we use NeuralKG to accomplish the step of embedding training in EmbedKGQA, and show the whole procedure of training EmbedKGQA based on NeuralKG.

Data Processing

  1. We download the code from the github repo of EmbedKGQA(we use the commit of c101d58), and download the from the google drive link proviced in the;
  2. We put the directiry data into EmbedKGQA-master;
  3. We transform the data format in data/MetaQA_half (we use MetaQA_half as an example) into the data format in NeuralKG. Specifically, in original entities.dict and relations.dict, the indices are in the seconde column as follows:
    # relations.dict
    directed_by  0
    directed_by_reverse  1
    has_genre    2
    has_genre_reverse    3

    So we put the indices into the first column like

    # relations.dict
    0 directed_by
    1 directed_by_reverse
    2 has_genre
    3 has_genre_reverse
  4. Furthermore, some entities in MetaQA_half consisit of space symbols, and the intervals in triples in original train.txt`, `valid.txt` and `test.txt` are tab symbols. We replace the space symbols in entities into `_, and replace the tab symbols in triples into space symbols.

Embedding Training

  1. We put the processed MetaQA_half dataset into the dataset directory of NeuralKG code downloaded from its github;
  2. We write following script, and run this script (these may not be the best hyperparameters).
    CUDA_VISIBLE_DEVICES=$GPU python -u src/ \
       --model_name $MODEL_NAME \
       --dataset_name $DATASET_NAME \
       --data_path $DATA_PATH \
       --litmodel_name $LITMODEL_NAME \
       --max_epochs $MAX_EPOCHS \
       --emb_dim $EMB_DIM \
       --loss $LOSS \
       --adv_temp $ADV_TEMP \
       --train_bs $TRAIN_BS \
       --eval_bs $EVAL_BS \
       --num_neg $NUM_NEG \
       --margin $MARGIN \
       --lr $LR \
       --regularization $REGULARIZATION \
       --check_per_epoch $CHECK_PER_EPOCH \
       --num_workers $NUM_WORKERS \
       --save_config \
  3. Finally, we get the best model checkpoint in ./output/link_prediction/MetaQA_half/ComplEx and we readout the entity and relation embeddings as follows
    import torch
    import numpy as np
    model = torch.load('epoch=xxx-Eval|')
    rel_emb = model['state_dict']['model.rel_emb.weight']
    ent_emb = model['state_dict']['model.ent_emb.weight']'R.npy', rel_emb.cpu().numpy())'E.npy', ent_emb.cpu().numpy())

Run EmbedKGQA

  1. In the code of EmbedKGQA, we replace the code from line 286 in ./KGQA/LSTM/ as:
    hops = args.hops
    if hops in ['1', '2', '3']:
       hops = hops + 'hop'
    if args.kg_type == 'half':
       data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '_half.txt'
       data_path = '../../data/QA_data/MetaQA/qa_train_' + hops + '.txt'
    print('Train file is ', data_path)
    hops_without_old = hops.replace('_old', '')
    valid_data_path = '../../data/QA_data/MetaQA/qa_dev_' + hops_without_old + '.txt'
    test_data_path = '../../data/QA_data/MetaQA/qa_test_' + hops_without_old + '.txt'
    model_name = args.model
    kg_type = args.kg_type
    print('KG type is', kg_type)
    # embedding_folder = '../../pretrained_models/embeddings/' + model_name + '_MetaQA_' + kg_type
    embedding_folder = '../../embedding'
    entity_embedding_path = embedding_folder + '/E.npy'
    relation_embedding_path = embedding_folder + '/R.npy'
    entity_dict = embedding_folder + '/entities.dict'
    relation_dict = embedding_folder + '/relations.dict'
    # w_matrix =  embedding_folder + '/W.npy'
    w_matrix = None
    bn_list = []
    # for i in range(3):
    #     bn = np.load(embedding_folder + '/bn' + str(i) + '.npy', allow_pickle=True)
    #     bn_list.append(bn.item())
  2. We comment the code related to batchnorm in ./KGQA/LSTM/ since we don't use any batchnorm during embeddings training, including line 92-105, line 179 and line 139.
  3. run command line as
    python --mode train --relation_dim 200 --hidden_dim 256 \
    --gpu 2 --freeze 0 --batch_size 128 --validate_every 5 --hops 2 --lr 0.0005 --entdrop 0.1 --reldrop 0.2  --scoredrop 0.2 \
    --decay 1.0 --model ComplEx --patience 5 --ls 0.0 --kg_type half

Finally, we get the test resuls of 83.85.