Skip to content

yunhao-tech/NLP_ENSAE_2023

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP ENSAE 2023:

Here is the repo for the class in ENSAE: Machine Learning for Natural Language Processing.

Main Instructor: Pierre COLOMBO colombo.pierre@gmail.com


What have we learned?

  1. A project on Sequence labelling task (Dialogue Act and Emotion/Sentiment classification). The pre-trained RoBERTa and DeBERTa have been used to tackle this task. And the contextual embeddings pre-trained on written corpus are proven useful for spoken language. You can also find the paper on OpenReview.

    Team members: Yunhao CHEN, Hélène RONDEY

  2. Four labs which reveal the basic pipeline in NLP. You can find some interesting points below:


visualization tools

  • pandas-profiling library, automatically generate a analysis report
  • Projector tensorflow: interactive visualization and analysis of high-dimensional data. PCA, t-SNE, etc
  • torchinfo.summary could display the model summary as in TensorFlow
  • sklearn.metrics.classification_report: Build a text report showing the main classification metrics.

NLP packages

  • gensim library: unsupervised tool, to learn vector representations of topics in text. Including TF-IDF, LSA, LDA, word2vec, ect. Need to spend some time on it. Refer to its documentation: https://radimrehurek.com/gensim/apiref.html

  • Stop words: words to be filtered out (i.e. stopped) before or after processing of NLP. No universal list of stop words.

    nltk.download('stopwords') provides a list of stop words. Usually we combine it with punctuations. We can get it from: from string import punctuation and convert it to list by list(punctuation).

  • gensim.models.phrases: Automatically detect common phrases (ex: multi-word expressions, word n-gram collocations) from a stream of sentences. Ex: we'd like "New York" as one expression instead of "New" and "York".

  • nltk punkt: Sentence Tokenizer that divides a text into a list of sentences. The NLTK data package includes a pre-trained Punkt tokenizer for English.

  • The torchtext package is a part of Pytorch, consisting of data processing utilities for NLP. We can load the pre-trained token vectors via GloVe or FastText.


Pipeline in NLP

  • Text cleaning process:

    1. Divide samples into sentences, using nltk punkt
    2. Tokenize each sentences, using tokenizers in nltk.tokenize. Then clean the obtained tokens (E.g. remove the HTML tags...).
    3. Detect and combine the multi-word expression, using Phrases
  • Creating a vocab instance from a OrderedDict. Then we can insert the special tokens (UNK, PAD...), set default index, etc.

    Combining the vocab instance and tokenizers, one can represente a sentence by a list of index. Also Possible to pad it. Refer to the section Sequence Classification in lab3 notebook. Note: the pre-trained AutoTokenizer in transformers can achieve automatically this step: convert text to a list of index and padding it. AutoTokenizer is always combined with its corresponding pre-trained model. The latter contains the appropriate Embedding layer.

    Then, one can feed this list of index into a pre-trained Embedding layer.

    ps: torch.nn.Embedding.from_pretrained permits to load pretrained embeddings in a (customized) model

  • Learn how to build a Vocabulary from scratch. Define stoi() (str to index) and itos() (index to str), and special tokens; Add special tokens into sentences.

    Attention! To define a Vocubulary, we use the tokens, instead of the words. We always depart from the tokenized text!

  • Padding the text at 2 levels: dialogue level and sentence level. Refer to the section Sequence Classification with Conversational Context in lab3.


About Python

  • input() is a built-in function allowing user to interactively input.

  • use multiprocessing.cpu_count() to get the number of available CPU cores, in order to set n_workers. Setting workers to number of cores is a good rule of thumb.

  • nvidia-smi bash command to check GPU info; lscpu to check CPU info.

About Training

  • nn.CrossEntropyLoss permits to define the weights, to treat the imbalanced data. Less frequent, more important.
b_counter = Counter(batch['label'].detach().cpu().tolist())
b_weights = torch.tensor([len(batch['label'].detach().cpu().tolist()) /
                          b_counter[label] if b_counter[label] > 0 else 0 
                          for label in list(range(args['num_class']))])
b_weights = b_weights.to(device)

loss_function = nn.CrossEntropyLoss(weight=b_weights)
  • BERT uses AdamW as optimizer, which is based on L2 regularization of Adam. Pay Attention to it if fine-tune BERT. Moreover, BERT uses a linear lr_scheduler with warming up steps.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •