Adversarial Examples in NLP using BERT-based attacks

Tina Huang
8 min readMay 4, 2020


Review of the following article:

BAE: BERT-based Adversarial Examples for Text Classification by Siddhant Garg, Goutham Ramakrishnan


  • Adversarial example: fake data that mimic the training data but will produce misclassified label when machine learning algorithm encounters it.
  • Adversarial attack: a pipeline that generates adversarial examples.


Deep neural network has been very popular since AlexNet’s performances bring back hope in deep learning. One of the long standing challenge is to build a deep neural network that is robust at detecting changes and/or resisting against adversarial examples. Examples of attacked applications are shown in Figure 1[1].

Figure 1. Attacked applications and Benchmark Dataset discussed by [1] in Table 4

Who does the problem affect and Is it a big problem

Black box attack is originally introduced in the computer vision domain, which is easier to mimic original data by infusing fuzzy data(noises) to original data represented in the continuous pixel intensity. Unlike computer vision, natural language processing utilize discrete words as tokens and it’s often challenging to change swap out tokens without changing the semantics(meaning of the sentence), pragmatics(underlying sentiments and implications), and syntax of a sentence.

Why did the authors do what they did

Siddhant Garg and Goutham Ramakrishnan propose a method in generating adversarial examples in NLP using BERT-based attack that retain the semantics meaning and high degree of attack strength.


What is new?

Based οn the survey review which is last revised on October 2019, the more commonly attacked deep neural networks(DNN) are LSTM and CNN[1]. Less has been invested in attacking model like BERT and none uses BERT-based attack method before this novel BERT-based adversarial attack.

To illustrate attack models:

Why is what the authors did new and worthy of publication?

The authors propose a new approach in generating adversarial examples using BERT-based tokens which is able to retain semantic meaning of the sentence in natural looking sentences. Different from other pipeline, it utilizes dynamic embedding vector spaces rather than fixed embedding vector space and proposed attacks have differing attack strength depending on needs.


What did the authors do?

The author proposes two types of perturbations/tokens/“changes” in which it is introduced into the sentence. One type of token t replaces a word in the sentence while the other, token t’, is inserted but preserving other words in the sentence. It is formally written in the following notation. (Note: a word in this context is not literally just ONE word. It could be an inseparable word phrase.)

Formal definition of tokens t and t’ in S for Sentence

There are 4 types of attack modes constructed by the combinations of the two perturbations types discussed earlier: only replacements(BAE-R), only insertions(BAE-I), replacement or insertion(BAE-R/I), and both replacement and insertions(BAE-R+I). Each attack mode is not limited to a single perturbation, ie. only one token for each sentence, because it might not be sufficient to produce a successful attack. Obeying the attack mode, tokens are generated in a ranked order(highest -> least importance) iteratively until it becomes an adversarial example. Table 3 illustrates 4 attack modes.

Example of attacks by replacing and insertion

The complete experiment including their innovative steps?

Just earlier, we explained the 4 attack modes. Now let’s talk about how this experiment is actually implemented.

BAE-R Pseudo code explained

(If you understand the pseudo code above, feel free to jump down a bit)

Given a sentence S with ground truth y and classifier C, this algorithm will output an adversarial example S_adv using this algorithm if it exists. When evaluating tokens in the S, this algorithm aims to generate an adversarial example by perturbing around the highest importance token to increase efficiency. There are many ways to do importance ranking, ie. genetic algorithms, deletion, replacement. In this article, it utilizes masking.

Essentially, important tokens have higher influence on prediction tasks than the less important one (ie. Determiner ‘The’). By masking, we could replacing the masked word in S with an arbitrary word and see how it affects the prediction. The greater the change, the greater the token importance. There are another masking step, but it is different than the one used in importance ranking.

Going back to the algorithm, the token importance returns the token index ranked in descending order. In this loop, mask sentence is generated on a token. Then, the BERT model will predict a set of top-K tokens, T, for the masked word. Unlike the previous mask, this mask step is used as a “target” for finding “similar” tokens as captured in the BERT embeddings space. Noted BERT embedding isn’t perfect also. It requires a filter on the set to remove adding [t] ∈ T that generates ungrammatical syntax or antonyms.

Using [t] ∈ T, it generates a set of new sentences L by original unmask tokens with [t] at mask’s position. By feeding the new sentences L into the classifier, we can compute the classification C(L(t)), determine whether it is a successful attack by comparing with y, and return if found. If no classification is found at that run, best unsuccessful adversarial example will carry over to next round for another perturbation. Eventually, it will loop through all combinations to find misclassified cases if it exists.

(Noted: [t] is used because token could be a phrase of words. In insertion algorithm, the mask will be in between the words in the sentence.)

TB : insert the mask photo

(Let’s resume for those who skipped)

Now that we have adversarial examples generated, this paper targeted 3 different models, Word-LSTM, Word-CNN, and BERT using 7 datasets as illustrated in Table 1. The performance of the attacks are evaluated on performing text classification tasks including sentiment classification, subjectivity detection, and question type classifications; and the quantitative metrics used are the percentages of accuracy and maximal perturbations, grammatical correctness, and sentimental accuracy.

The BAE-attacks are compared to the TEXTFOOLER. Long story in short, TEXTFOOLER shares great similarity to the approach from the current paper except it uses a fixed vector embedding from Nikola Mrkšić’s paper, fixed # of synonyms(50), and fixed threshold of word similarity greater than 0.80. The ground truth label is evaluated by humans using grammatical correctness and sentiment analysis on the sentence using 2 human evaluators.

Results and Discussion

How’d they do?

This paper shows that at least 1 BAE-based attack mode is better than TEXTFOOLER at lowering accuracy% from Table 2. This result shows that there are different attack strength for the 4 attack modes. Attack strength is BAE-R+I > BAE-R/I > BAE-R or BAE-I. BAE-R+I is best at attacking since natural language usually use different phrasal variations for paraphrasing which also makes it more natural looking and suffers less from grammatical incorrectness. More importantly, BAE-R+I is consistently the best at attacking for all datasets using different models, AND still holds similar average semantic similarity with the original sentence.

From Figure 2, they showed that BAE-R+I generally does better in lowering test accuracy using lower percent perturbation. What does that mean? It implies, it uses less time on finding example and it’s more effective

The article also shows that adversarial examples are more natural looking adversarial examples and more accurate in paraphrasing. However, using 2 evaluators might not be sufficient because grammatical correctness and sentiment analysis are quite subjective. The paper did not discuss the amount of disagreements between the human evaluators…(ie. standard deviation). But, again there are simply no points in doing so since there’s only 2 evaluator and no way for them to break ties.

Did they achieve what they said they would ?

Yes, it delivers a more efficient(less perturbation needed) and effective(lower accuracy)in generating adversarial examples.


  • the authors propose 4 effective and efficient BERT-BASED attack modes by simply using insertions and replacements of tokens.
  • the adversarial examples are more natural looking than the controlled, which is preferable for adversarial training.
  • we could leverage these attack modes to generate adversarial examples for attack application in NLP.

Related Works:

[1]Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey is also a great review paper on grasping what has been researched on, and potential area of research untouched but needed explorations. It is highly encourage for beginners who need to grasp the bigger picture quickly.

[2]Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment by Nikola Mrkšić is the original paper that proposes TEXTFOOLER and is being utilized as a baseline attack mode. This paper also performed an ablation study on the effectiveness of importance ranking and the needs of including synonyms constraints. This is a great paper to gain some insights why these factors are important in increasing the adversarial examples’ effectiveness and robustness.

[3] Besides performing tasks in text classification in general NLP domains,it is also important to leverage existing techniques for improving detections of adversarial examples in other domains, ie. medical domain in support of fraudulent billing activities as discussed in Adversarial attacks on medical machine learning.

[4]In extension of topics in biomedical decision making, vulnerabilities in biomedical NLP could be devastating for medical decision tasks. If wrong decision is proposed by AI, it will negatively impact the patients’ health. Adversarial Examples for Biomedical NLP Tasks discusses adversarial examples generated by BERT-based model in BioNLP, which is submitted recently in April 2020. Perhaps, it’ll be worth it to compare and contrast how these two techniques. If these two pipelines are similar, then we could evaluate the transferability of these attack modes. If not, then we could leverage BAE-attacks pipeline to a BioNLP tasks.

[5] In May 2019, Adversarial Learning of Knowledge Embeddings for the Unified Medical Language System discusses how adversarial examples improve precision and effectiveness knowledge embedding by performing adversarial training. Perhaps we could try to improve the task performance by applying to this BAE-attacks.