Here is the repo for the class in ENSAE: Machine Learning for Natural Language Processing.
Main Instructor: Pierre COLOMBO colombo.pierre@gmail.com
-
A project on Sequence labelling task (Dialogue Act and Emotion/Sentiment classification). The pre-trained RoBERTa and DeBERTa have been used to tackle this task. And the contextual embeddings pre-trained on written corpus are proven useful for spoken language. You can also find the paper on OpenReview.
Team members: Yunhao CHEN, Hélène RONDEY
-
Four labs which reveal the basic pipeline in NLP. You can find some interesting points below:
pandas-profilinglibrary, automatically generate a analysis report- Projector tensorflow: interactive visualization and analysis of high-dimensional data. PCA, t-SNE, etc
torchinfo.summarycould display the model summary as in TensorFlowsklearn.metrics.classification_report: Build a text report showing the main classification metrics.
-
gensimlibrary: unsupervised tool, to learn vector representations of topics in text. Including TF-IDF, LSA, LDA, word2vec, ect. Need to spend some time on it. Refer to its documentation: https://radimrehurek.com/gensim/apiref.html -
Stop words: words to be filtered out (i.e. stopped) before or after processing of NLP. No universal list of stop words.
nltk.download('stopwords')provides a list of stop words. Usually we combine it with punctuations. We can get it from:from string import punctuationand convert it to list bylist(punctuation). -
gensim.models.phrases: Automatically detect common phrases (ex: multi-word expressions, word n-gram collocations) from a stream of sentences. Ex: we'd like "New York" as one expression instead of "New" and "York". -
nltk punkt: Sentence Tokenizer that divides a text into a list of sentences. The NLTK data package includes a pre-trained Punkt tokenizer for English. -
The
torchtextpackage is a part of Pytorch, consisting of data processing utilities for NLP. We can load the pre-trained token vectors viaGloVeorFastText.
-
Text cleaning process:
- Divide samples into sentences, using
nltk punkt - Tokenize each sentences, using tokenizers in
nltk.tokenize. Then clean the obtained tokens (E.g. remove the HTML tags...). - Detect and combine the multi-word expression, using
Phrases
- Divide samples into sentences, using
-
Creating a
vocabinstance from a OrderedDict. Then we can insert the special tokens (UNK, PAD...), set default index, etc.Combining the vocab instance and tokenizers, one can represente a sentence by a list of index. Also Possible to pad it. Refer to the section
Sequence Classificationin lab3 notebook. Note: the pre-trainedAutoTokenizerin transformers can achieve automatically this step: convert text to a list of index and padding it.AutoTokenizeris always combined with its corresponding pre-trained model. The latter contains the appropriate Embedding layer.Then, one can feed this list of index into a pre-trained Embedding layer.
ps:
torch.nn.Embedding.from_pretrainedpermits to load pretrained embeddings in a (customized) model -
Learn how to build a Vocabulary from scratch. Define
stoi()(str to index) anditos()(index to str), and special tokens; Add special tokens into sentences.Attention! To define a Vocubulary, we use the tokens, instead of the words. We always depart from the tokenized text!
-
Padding the text at 2 levels: dialogue level and sentence level. Refer to the section
Sequence Classification with Conversational Contextin lab3.
-
input()is a built-in function allowing user to interactively input. -
use
multiprocessing.cpu_count()to get the number of available CPU cores, in order to setn_workers. Setting workers to number of cores is a good rule of thumb. -
nvidia-smibash command to check GPU info;lscputo check CPU info.
nn.CrossEntropyLosspermits to define the weights, to treat the imbalanced data. Less frequent, more important.
b_counter = Counter(batch['label'].detach().cpu().tolist())
b_weights = torch.tensor([len(batch['label'].detach().cpu().tolist()) /
b_counter[label] if b_counter[label] > 0 else 0
for label in list(range(args['num_class']))])
b_weights = b_weights.to(device)
loss_function = nn.CrossEntropyLoss(weight=b_weights)- BERT uses
AdamWas optimizer, which is based on L2 regularization of Adam. Pay Attention to it if fine-tune BERT. Moreover, BERT uses a linearlr_schedulerwith warming up steps.