- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3
 
Sampling for BERT
    
  
In this tutorial we provide a list of steps required to prepare samples with text opinions for BERT language model.
NOTE: This documentation page represents an updated version of the "Process Mass-Media relations for Language Models with AREkit" ; in the prior one we describe sampling process from scratch and under older API version AREkit-0.22.0.
First, it is necessary to declare labels expected to adopted in further samples preparation process.
In this post we focused on sentiment-related data sampling and therefore considering the following
set of labels: Positive, Negative and additionally neutral, type of NoLabel which AREkit provides by default.
class Positive(Label):
    pass
class Negative(Label):
    passNext step, we declare label scaler.
Scaler (BaseLabelScaler class) allows us to provide conversion from Label type to int/uint values and vice versa.
We declare Sentiment scaller as follows:
class SentimentLabelScaler(BaseLabelScaler):
    def __init__(self):
        int_to_label = OrderedDict([(NoLabel(), 0), (Positive(), 1), (Negative(), -1)])
        uint_to_label = OrderedDict([(NoLabel(), 0), (Positive(), 1), (Negative(), 2)])
        super(SentimentLabelScaler, self).__init__(int_to_label, uint_to_label)In terms of the input aspects of the BERT model,
we deal with a sequence (optionally) separated by a [SEP] token onto couple parts, such as:
TextA and TextB.
For the classificational task, TextB might be treated as a prompt with the auxilary information
which might be considered in a result class decission.
For the sentiment analysis and relation extraction domain you may examine more approaches in Awesome Sentiment Attitude Extraction Repository
At present, text_b template is expected to contain a placeholders for subject, object and context,
where context corresponds to a text part between subject and object.
For texts in Russian, we assign the following NLI-styled (Natural Language Inference) prompt:
NOTE: you may left
text_b_tempaleteas None once you don't want to consider a separated sequence.
text_b_template = '{subject} к {object} в контексте : << {context} >>'Next, we focused on text provider.
First, there is a need to setup terms mapper.
Terms mappers allows us to customize the way on how terms will be displayed in samples.
AREkit provides BertDefaultStringTextTermsMapper, in which you may among all
of the different term types customize mentioned named entities.
In terms of the latter we have a separated wiki page
Entity Values Formatting Examples.
From that tutorial, here we adopt CustomEntitiesFormatter and assign #S and #O masks towards the
text opinion participants, i.e. subject and object respectively.
Depending on the text_b_template we may declare a single text provider (i.e. TextA only)
or pair-based one:
terms_mapper = BertDefaultStringTextTermsMapper(
    entity_formatter=CustomEntitiesFormatter(
        subject_fmt="#S", object_fmt="#O"))
text_provider = BaseSingleTextProvider(terms_mapper) \
    if text_b_template is None else \
        PairTextProvider(text_b_template, terms_mapper)Finally we may compose sample rows provider:
sample_rows_provider = BaseSampleRowProvider(
    label_provider=MultipleLabelProvider(SentimentLabelScaler()),
    text_provider=text_provider)Initialize information related to the samples format and output directory/path.
As for format, there is a need to declare a type inherited from the BaseWriter.
By default, AREkit provides TsvWriter -- is a CSV-style formatter.
Side note: Tilte prefix
tsvcomes from the format proposed by google-BERT.
writer = TsvWriter(write_header=True)
samples_io = SamplesIO("out/", writer, target_extension=".tsv.gz")pipeline_item = BertExperimentInputSerializerPipelineItem(
    sample_rows_provider=sample_rows_provider,
    samples_io=samples_io,
    save_labels_func=lambda data_type: True,
    balance_func=lambda data_type: data_type == DataType.Train)Please refer to the following posts in order to initialize your text opinion annotation pipeline (annot_pipeline)
and setup Data Folding (data_folding):
Or just follow the complete tutorial implemenation
Finally, we can compose pipeline by wrapping a predefined pipeline_item and then run it!
This could be accomplished as follows:
pipeline = BasePipeline([
    pipeline_item
])
pipeline.run(input_data=None,
             params_dict={
                 "data_folding": data_folding,
                 "data_type_pipelines": annot_pipeline 
             })Finally our result is a content of the out directory.
The contents depend on Data Folding format.
For example, in case of the fixed folding onto Train and Test data types,
it is expected to see the following set of contents:
./out/
    sample_train.tsv.gz
    sample_test.tsv.gz
© Nicolay Rusnachenko 2016-Present. Released under the MIT license.