White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
Authors
Abstract
Adversarial examples are important for understanding the behavior of neural models, and can improve their robustness through adversarial training. Recent work in natural language processing generated adversarial examples by assuming white-box access to the attacked model, and optimizing the input directly against it (Ebrahimi et al., 2018). In this work, we show that the knowledge implicit in the optimization procedure can be distilled into another more efficient neural network. We train a model to emulate the behavior of a white-box attack and show that it generalizes well across examples. Moreover, it reduces adversarial example generation time by 19x-39x. We also show that our approach transfers to a black-box setting, by attacking The Google Perspective API and exposing its vulnerability. Our attack flips the API-predicted label in 42% of the generated examples, while humans maintain high-accuracy in predicting the gold label.
1 Introduction
Adversarial examples (Goodfellow et al., 2014) have gained tremendous attention recently, as they elucidate model limitations, and expose vulnerabilities in deployed systems. Work in natural language processing (NLP) either (a) used simple heuristics for generating adversarial examples (Jia and Liang, 2017; Belinkov and Bisk, 2017; , or (b) assumed white-box access, where the attacker has access to gradients of the model with respect to the input Ebrahimi et al., 2018) . In this approach, adversarial examples are constructed through an optimization process that uses gradient descent to search for input examples that maximally change the predictions of a model. However, developing attacks with only black-box access to a model (no access to gradients) is still under-explored in NLP. We train our black-box attacker, DISTFLIP, to emulate the white-box attack. (B) We use DISTFLIP to attack a black-box model.
Inspired by work in computer vision (Papernot et al., 2016; Liu et al., 2017) , we show in this work that a neural network can learn to emulate the optimization process of a white-box attack and generalize well to new examples. Figure 1 gives a high-level overview of our approach. We assume a text classification model and a white-box attack that flips characters in the input to modify the model prediction (Ebrahimi et al., 2018) . We generate output adversarial examples using the white-box attack and train a neural network from these inputoutput examples to imitate the white-box attack. This results in a much more efficient attack whose run-time is independent of the optimization process. Moreover, assuming adversarial examples transfer between different models, our distilled model can now be used to generate adversarial examples for black-box attacks directly.
We use our approach to attack a toxicity classifier, aimed at detecting toxic language on social media (Hosseini et al., 2017) . We show that our model achieves a speed-up of 19x-39x in generating adversarial examples while maintaining similar attack quality, compared to an optimization-based method. We then use our model for a black-box attack against Google Perspective API for detecting toxic sentences, and find that 42% of our generated sentences are misclassified by the API, while humans agree that the sentences are toxic.
Our code can be downloaded from http:// github.com/orgoro/white-2-black.
2 Background
Adversarial examples have been extensively used recently in NLP for probing and understanding neural models (Jia and Liang, 2017; Weber et al., 2018) . Methods for generating such examples include adding random or heuristically constructed noise (Belinkov and Bisk, 2017; Rodriguez and Rojas-Galeano, 2018; Gao et al., 2018) , meaningpreserving modifications that change the surface form Ribeiro et al., 2018) , and human-in-the-loop generation . A weakness of such models is that they do not directly try to modify the prediction of a model, which can reduce efficacy (Kurakin et al., 2016) . In the white-box setting, Feng et al. 2018have changed the meaning of an input without changing model output using access to gradients, and Ebrahimi et al. (2018) proposed HOTFLIP, the aforementioned white-box attack that we emulate for flipping input characters.
In computer vision, Papernot et al. (2016) and Liu et al. (2017) have shown that adversarial examples generated by a white-box model can be helpful for a black-box attack. Generating adversarial text examples is more challenging than adversarial images, because images are points in a continuous space, and thus it is easier to apply norm restrictions. Text examples have a discrete structure and thus such approaches have not been investigated for adversarial text generation yet.
3 Hotflip
HOTFLIP (Ebrahimi et al., 2018 ) is a white-box method for generating adversarial examples for a character-level neural model. It uses the gradient with respect to a 1-hot input representation to estimate the character flip that incurs the highest cost. We briefly describe HOTFLIP, which we use to generate training examples for our distilled model.
Let x = ((x 1 1 , . . . , x n 1 ), . . . , (x 1 m , . . . , x n m )
) be a sentence represented as a sequence of m characters, encoded as 1-hot vectors over a vocabulary of size n. Define L(x, y) to be the loss of a trained model for the input x with respect to a label y.
HOTFLIP requires one function evaluation (forward pass) and one gradient computation (backward pass) to compute a first-order estimate of the best possible character flip in x. Flipping the i th character from a to b can be represented by this vector:
− → v i b = (. . . , (0, . . . , −1, . . . , 1, . . . , 0) i , . . . )
, where −1 and 1 are in the positions for the a th and b th characters respectively. A first-order estimate of the change in loss can be obtained by computing ∇ x L(x, y) with back-propagation, and taking a directional derivative along − →
v i b : ∇− → v i b L(x, y) = ∇ x L(x, y) • − → v i b .
We can now choose the character-flip a to b that maximizes this estimate using the gradients with respect to the input x:
arg max i,b [∇L(x, y) • − → v i b ] = arg max i,b [ ∂L ∂x b i − ∂L ∂x a i ].
To perform a sequence of flips, any search procedure can be applied. HOTFLIP uses beam search of r steps, keeping at each step the top-K flip sequences that increased L(x, y) the most. This require O(K • r) forward and backward passes. Character insertions and deletions are modeled as multiple flips, but for simplicity, we only consider character flips in our work. The main drawbacks of HOTFLIP are that it does not gain any knowledge from optimizing over multiple examples, and that its efficiency is strongly tied to the search procedure used (O(K •r) forward and backward passes per example for beam-search). Next, we present our model that overcomes these limitations.
4 Distilling A Black-Box Attack
We are interested in whether (a) the knowledge in the optimization process of HOTFLIP can be distilled into a neural model, and (b) whether this model can generalize to a black-box attack. Therefore, the outline of our approach is as follows: 1. Train a source text classification model on data from a similar distribution to the data used to train the target black-box model.
2. Generate Adversarial Examples By Performing
white-box optimization (with HOTFLIP) on the source model. 3. Train an efficient attacker to generate adversarial examples, and perform a black-box attack against the target model.
We assume a training set D = {(x i , y i )} N i=1 used to train a character-based source model S(•) that takes a character sequence x as input, and returns a distribution over the output space Y (details on the source model are in Section 5). We now elaborate on the processes of data generation and training of the attacker.
Data Generation
We take examples (x, y) from D and run HOTFLIP with search until we obtain an adversarial examplex such that the probability of the gold label is low, that is, [S(x)] y < τ , where [S(x)] y is the probability given by S(x) to y ∈ Y and τ is a threshold (we use τ = 0.15).
Let s = ((x (0) = x), x (1) , . . . , (x (l) =x)) be the sequence of sentences generatingx, where every consecutive pair (x (i) , x (i+1) ) differs by a single character: the character in position j (i) was flipped to the character c (i) . Our attacker is trained from examples (x (i) , (j (i) , c (i) )) generated from every pair of consecutive sentences in s. For example, if the sentence is the one-word sentence "Asshole", and after flipping one character it becomes "Assnole", the example would be ("Asshole", (4, 'n')).
Model training Our attacker takes a character sequence x as input and outputs a pair (j * , c * ), where j * ∈ [1, . . . , m] is the position of the character to be flipped, and c * is the target character. Figure 2 describes the architecture of our model. Our model embeds each character using a pretrained 300-dimensional character embedding 1 , and then passes the character sequence through a 1layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with 512-dimensional hidden states. The BiLSTM h j hidden state in every position are passed through two feed-forward networks, one for replacement prediction (which character to flip) and one for target prediction (what target character to choose). The network that perfoms replacement prediction has 2 hidden layers of dimensions 100 and 50 with ReLU activations, and a single logit l j as output per position. The output distribution over the sentence positions is given by a softmax over all character positions. At inference time j * is computed with an argmax instead of a softmax.
The network that produces target prediction has two-hidden layers of dimensions 100 and 100 with ReLU activations and outputs a vector of logits v j ∈ R 96 per position with a softmax layer, which Our loss function is simply the sum of two crossentropy terms: one for the gold position, and one for the gold character in the gold position.
Running our model is more efficient that HOT-FLIP and run-time is independent of the optimization procedure. A forward pass in our model is equivalent to 2 • K steps in HOTFLIP with beamsearch. We show this leads to large practical speedups in Section 5.
5 Experiments
We now empirically investigate whether our method can be used to attack classifiers for detecting "toxic" language on social media. Recently a challenge by Alphabet aimed to improve labeling of toxic comments that are rude and disrespectful. Alphabet released a dataset 2 of 160K comments from Wikipedia discussions, and classified each comment to six labels. We focus on the Toxic label only, which represents 9.5% of the dataset.
We used the dataset to train the source model S(•), which runs a 2-layer bidirectional GRU (Cho et al., 2014) over the input x, and then uses an attention (Bahdanau et al., 2015) pooling layer to obtain a fixed dimensional vector for x. This vector is passed through a feed-forward layer to compute the probability that x is toxic. The accuracy of the source model is 96.5% AUC -comparable to the top submissions in the challenge.
We used the 13,815 toxic-labeled sentences from the training set to generate adversarial examples for training the attacker as described above. Dataset generation depends on the search procedure, and we experiment with 3 setups: (a) HOTFLIP-5: beam-search with K=5, (b) HOTFLIP-10: beamsearch with K=10, and (C) HOTFLIP+: a more expensive search procedure that calls S(•) more frequently. We describe the details of this procedure in Appendix A. Because our attacker is independent of the search procedure, inference time is not affected by the search procedure at data generation time. This results in three distilled models: DISTFLIP-5, and DISTFLIP+. 3 Attacking the source model We compare the performance of DISTFLIP variants against HOT-FLIP variant, including HOTFLIP-1 (K=1). We also compare to a RANDOM baseline, which chooses a position and target character randomly, and to an ATTENTION baseline, which uses the attention layer of S(•) to choose the character position with maximum attention to flip, and selects a target character randomly. Table 1 summarizes our results. We report the average number of flips required to change the prediction of toxic sentences in the source model, the slow-down per single character-flip, and the slowdown per attack, which is computed by multiplying slow-down per flip by the ratio of the number of flips required per attack. Because roughly 15% of the examples in the dataset mostly contain repeated profanities that require many flips, we also report the average number of flips for the other 85% of the examples.
We observe that HOTFLIP+ requires the fewest flips to change model prediction, but attacks are very slow. The number of flips per attack for DISTFLIP+ is comparable to HOTFLIP-5 and HOTFLIP-10, but it achieves a speed-up of 19x-39x. DISTFLIP-5 and DISTFLIP-10 require a few more flips compared to DISTFLIP+. Overall attack quality is high, with less than two flips necessary for 85% of the examples for DISTFLIP+. Figure 3 provides a more fine-grained view of the results by showing the proportion of sentences classified as toxic as a function of the number of flips. Overall, the picture is similar to Table 1, with DISTFLIP+ being comparable to HOTFLIP-10.
Attacking The Google Perspective API The Google perspective API 4 returns the probability that a sentence is toxic, where probability > 0.7 is classified as toxic, < 0.3 is non-toxic, and otherwise uncertain. The model itself is not publicly available. We randomly selected 136 toxic examples from the validation set and attacked them with DISTFLIP+ until the source model probability was < 0.5.
We measured the toxicity probability before and after our attack and found that the average toxicity probability decreased from 0.9 to 0.67, with an average of 5.0 flips per sentence. The label is flipped from toxic to uncertain or non-toxic in 42% of these examples.
Human validation To validate that toxic sentences remain toxic after our attack, we showed 5 independent annotators a total of 150 sentences from three classes: toxic sentences, non-toxic sentences, and attacked sentences (attacked by DISTFLIP-5). The same annotator was never shown a toxic sentence and its attacked counterpart. We asked annotators whether sentences are toxic, and measured average annotated toxicity.
We found that 89.8% of the toxic sentences were annotated as toxic, compared to 87.6% in the attacked toxic sentences. This shows that humans clearly view the sentences as toxic, even after our attack. Table 2 shows examples for sentences attacked by DISTFLIP-10 and the change in toxicity score according to The Google Perspective API.
6 Conclusion
In this work we present a general approach for distilling the knowledge in any white-box attack into a trained model. We show that this results in substantial speed-up (19x-39x) Table 1 : Average number of flips to change the prediction of toxic sentences, average number of flips to change the prediction for 85% of the examples that do not contain repeated profanities, slow-down per flip compared to DISTFLIP+, and slow-down per sentence attack compared to DISTFLIP+. One character-flip using DISTFLIP takes 12ms on an Nvidia GTX1080Ti.
Original Adversarial Google score decline=I don't fucking think so decline=I don't vucking think so 0.79 → 0.07 its an AP article you asshole, how is it not neutral???
its an AP article you assnole, how is it not neutral??? 0.96 → 0.48 I think the 1 million sales is total bullshit though.
I think the 1 million sales is total bullshkt though. 0.79 → 0.07 do not know juden! silence is golde juden, now shut up.
do not know juden! silence is golden juden, now xhjt up. 0.69 → 0.37 This article sucks . . woo woo wooooooo This article xuxks . . woo woo wooooooo 0.93 → 0.22 Also bring back the brendle article you piece of shit.
Also bring back the brendle article you pkece of xhkt. 0.98 → 0.32 to be driven away and die to be driven away and dke 0.82 → 0.32 Table 2 : Examples of sentences attacked by DISTFLIP-10 and The Google Perspective toxicity score before and after the attack.
attack transfers to a black-box setting: we expose the vulnerability of The Google Perspective API, and are able to change the prediction for 42% of input toxic examples, while humans easily detect that examples are still toxic.
https://github.com/minimaxir/charembeddings/blob/master/glove.840B.300dchar.txt
https://www.kaggle.com/c/jigsawtoxic-comment-classification-challenge/ data
DISTFLIP-5 is trained on 64K generated examples, DISTFLIP-10 on 62K examples, and DISTFLIP+ on 46K.4 https://www.perspectiveapi.com