Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions JaQuAD.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@
" 'batch_size': 32, # <=32 for TPUv2-8\n",
" 'lr': 2e-5, # Learning Rate\n",
" 'max_length': 384, # Max Length input size\n",
" 'doc_stride': 128, # The interval of the context when splitting is needed\n",
" 'doc_stride': 128, # The overlap of the context when splitting is needed\n",
" 'epochs': 4, # Max Epochs\n",
" 'dataset': 'SkelterLabsInc/JaQuAD',\n",
" 'huggingface_auth_token': None,\n",
Expand Down Expand Up @@ -195,7 +195,8 @@
" val += [padding] * pad_len\n",
" return val\n",
"\n",
" for i in range(0, input_len - max_seq_len + stride, stride):\n",
Copy link
Author

@akeyhero akeyhero Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This range will be empty when input_len <= max_seq_len - stride

" step = max_seq_len - question_len - stride\n",
" for i in range(0, max(context_len - stride, step), step):\n",
Comment on lines +198 to +199
Copy link
Author

@akeyhero akeyhero Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stride is a sequence length of overlapping tokens in the Hugging Face manner. (if I am correct)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment, but we choose to maintain the meaning of stride.

As you say, a stride of Tokenizer means the length of overlapping tokens.
However, HuggingFace sometimes uses stride as an interval of two spans (e.g. squad.py)
In my thought, this is implementation specific.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. That's so confusing 😭

" span = {key: make_value(val, i) for key, val in inputs.items()}\n",
" answer_start = answer_start_position - i\n",
" answer_end = answer_end_position - i\n",
Expand Down Expand Up @@ -482,11 +483,12 @@
"\n",
" ctx_start = tokens.index(self.tokenizer.sep_token_id) + 1\n",
" answer_start_index = ctx_start\n",
" answer_end_index = len(offsets) - 1\n",
" while offsets[answer_start_index][0] < start_char:\n",
" while offsets[answer_start_index][1] < start_char:\n",
Copy link
Author

@akeyhero akeyhero Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One may not like this change, but I prefer inclusive answer chunks.

e.g. where 分間 is a single token:
Original answer: 九十分
Previous answer chunk: 九十
Proposed answer chunk: 九十分間

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested both options, I found that the inclusive answer chunks performed better.
Thank you.

" answer_start_index += 1\n",
" while offsets[answer_end_index][1] > start_char + len(answer):\n",
" answer_end_index -= 1\n",
Comment on lines -488 to -489
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will get a smaller index number by 1 when the following token length is >= 2.

" answer_end_index = answer_start_index\n",
" while answer_end_index < len(offsets) \\\n",
" and offsets[answer_end_index][0] < start_char + len(answer):\n",
" answer_end_index += 1\n",
"\n",
" span_inputs = {\n",
" 'input_ids': tokens,\n",
Expand Down Expand Up @@ -660,7 +662,7 @@
},
"outputs": [],
"source": [
"def get_answers(model: AutoModelForQuestionAnswering,\n",
"def get_answers(model: QAModel,\n",
" context: str,\n",
" question: str,\n",
" n_best_size: int = 5,\n",
Expand All @@ -686,7 +688,7 @@
" 1:-1].tolist()\n",
" end_indexes = np.argsort(end_logits)[-1:-n_best_size - 1:-1].tolist()\n",
" cur_offsets = offsets[i:]\n",
" i += doc_stride\n",
" i += max_seq_len - question_len - doc_stride\n",
" for start_index in start_indexes:\n",
" for end_index in end_indexes:\n",
" if 0 < start_index <= end_index < len(cur_offsets):\n",
Expand Down