Building a sufficiently large and high-quality dataset for natural language processing is often a significant challenge, especially in personal research or student environments. Ensuring diversity and avoiding overfitting in the dataset are crucial, and achieving this demands considerable effort and patience.
One of the common applications of data augmentation is preventing overfitting, where a model focuses excessively on minor details in the training data. Data augmentation also makes the model more transparent, helping to understand how it works and improve it through data transformation adjustments.
This week, we discuss one of the popular data augmentation methods, which is the use of specific rules. In particular, Easy Data Augmentation introduced by Wei and his colleagues [1] includes four main methods:
-
Synonym Replacement (SR): Replacing words in a sentence with synonyms, for example, "I am climbing a tree" can become "I am ascending a tree."
-
Random Insertion (RI): Randomly inserting words into random positions in the sentence to enrich it and change its structure.
-
Random Swap (RS): Randomly swapping the positions of two words in the sentence to create a new variation, for example, "I go swimming" might become "I swim go."
-
Random Deletion (RD): Randomly removing some words from the sentence to make it shorter and change its meaning.
This method relies on a deep understanding of language and grammatical knowledge and can be adjusted to be simple or complex depending on the data augmentation goals.
However, it's important to note that language comprehension and vocabulary categorization skills are crucial when designing rule-based data augmentation. Designing structured vocabulary can lead to significant improvements. For instance, "I am jogging" has more similarity to "I am swimming" than "I am screaming." Attention to such details can lead to effective and transparent data augmentation results.
Further information: https://www.facebook.com/dsociety.uit.ise/posts/pfbid0TBhKRirm6x6C7DQeDb...
Written by: Ha Bang
Translated by: Ngoc Diem