This document provides an overview of the codebase for generating synthetic MWOs by LLMs and humanising them using rule-based approaches.
The following functionalities are implemented:
llm_generate.py: code to prepare few-shot examples, generate synthetic MWO sentences using GP-4o mini, processing of LM outputsget_all_paths(): get all stored paths from json files inpath_patternsdirectoryget_generate_prompt(): prepare prompt for LLM to generate synthetic MWO sentencesget_generate_fewshot(): prepare few-shot examples for LLMgenerate_mwo(): generate synthetic MWO sentences using LLM (simple)generate_diverse_mwo(): generate diverse synthetic MWO sentences using LLMprocess_mwo_response(): process LLM outputs of synthetic MWO sentencesget_samples(): samples paths from each path type
llm_prompt.py: code to get list of prompt variations, processing of LLM outputs, and paraphrasing the promptsinitialise_prompts(): get list of prompt variations for LLMcheck_similarity(): check similarity between prompt variationsprocess_prompt_response(): process LLM outputs of prompt variationsparaphrase_prompt(): paraphrase the prompts for LLM
diversity_experiment.ipynb: experiments for increasing the diversity of the LLM-generated MWO sentences per path- Same prompt VS Variations of prompt
- Single generation VS Batch generation
- Generation VS Paraphrasing
- You can find the few-shot examples used in the
fewshot_messagesdirectory. - Some logs of the LLM-generated MWO sentences can be found in the
mwo_sentencesdirectory.
The following functionalities are implemented:
humanise_experiment.ipynb: experiments for humanising synthetic MWO sentenceshumanise.py: rule-based approach for humanising synthetic MWO sentencesinitialise_globals(): initialise global dictionaries for humanisationload_dictionary(): load the corrections dictionary for humanisationshuffle_dictionary(): shuffle the corrections dictionaryintroduce_contractions(): introduce English contractions to synthetic MWO sentences (50% probability)introduce_abbreviations(): introduce abbreviations/jargon to synthetic MWO sentences (40% probability)rule_introduce_typos(): introduce up to 3 typos in the synthetic MWO sentenceshumanise_sentence(): apply the above rules to humanise synthetic MWO sentences