This is the official code for "Imitation Attacks and Defenses for Black-box Machine Translation Systems". This repository contains the code for replicating our adversarial attack experiments on your own MT models.
Read our blog and our paper for more information on the method.
This code is written using Fairseq and PyTorch. The code is based on an older version of Fairseq, from this commit. The code is made to run on one GPU or CPU. I used one GTX 1080 for all the experiments. Most experiments run in a few minutes.
An easy way to install the code is to create a fresh anaconda environment:
conda create -n attacking python=3.6
source activate attacking
pip install -e . # install local version of fairseq
pip install -r requirements.txt
Now you should be ready to go!
The repository is broken down by attack type:
malicious_nonsense.pycontains the malicious nonsense attack.targeted_flips.pycontains the targeted flips attack.universal.pycontains the two universal attacks (untargeted and suffix dropper).
The file attack_utils.py contains additional code for evaluating models, the first-order taylor expansion, computing embedding gradients, and evaluating the top candidates for the attack. Overall, the code in this repository is a stripped down and cleaned up version of the code used in the paper. The code is designed to be easy to understand and quick to get started with.
First, you need to get a machine translation model. Fortunately, fairseq already has a number of pretrained models available. See this repository for a complete list. Here we will download a transformer-based English-German model that is trained on the WMT16 dataset.
wget https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2
wget https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2
bunzip2 wmt16.en-de.joined-dict.transformer.tar.bz2
bunzip2 wmt16.en-de.joined-dict.newstest2014.tar.bz2
tar -xvf wmt16.en-de.joined-dict.transformer.tar
tar -xvf wmt16.en-de.joined-dict.newstest2014.tarNow we can run an interactive version of the malicious nonsense attack.
export CUDA_VISIBLE_DEVICES=0
python malicious_nonsense.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang deThe arguments we passed in are: the dataset we downloaded, the model architecture type (we downloaded a Transformer Big architecture), the model checkpoint path, the path to the BPE dictionary, and a flag to enable interactive attacks, respectively. The --source-lang and --target-lang flags are usually ok to omit because fairseq can automatically infer the language pair. If you want to run the attack on the WMT16 test set rather than interactively, you can omit the --interactive-attacks flag and pass in --valid-subset test. If you do not have a GPU, omit the export CUDA_VISIBLE_DEVICES=0 command and also pass in the --cpu argument in the command.
Now you can enter a sentence that you want to turn into malicious nonsense. Let's try something benign like I am a student at the University down the hill. You can also try something more malicious like Barack Obama was shot by a rebel group or whatever your desired adversarial malicious input/output from the model is.
The other attacks follow the same arguments as malicious nonsense.
python targeted_flips.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang deFor targeted flips we currently assume that --interactive-attacks is set.
First, enter the sentence that you want to attack, e.g., I am sad which translates to Ich bin traurig for the English-German model we downloaded above. Then, choose the word in the target side that you want to flip, e.g., traurig and what you want to flip it to, e.g., froh (which means happy/glad in English). Then, you can enter nothing for the optional lists. This should cause the attack to flip the input from I am sad to I am glad.
Of course, I am glad is not "adversarial" in the sense that the model is making a correct translation. We can restrict the attack from adding the word glad into the attack. The attack finds I am lee which the model translates as Ich bin froh.
python universal.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang deThis commands defaults to the untargeted attack. Passing --suffix-dropper will perform the suffix dropper attack.
Please consider citing our work if you found this code or our paper beneficial to your research.
@article{Wallace2020Stealing,
Author = {Eric Wallace and Mitchell Stern and Dawn Song},
journal={arXiv preprint arXiv:2004.15015},
Year = {2020},
Title = {Imitation Attacks and Defenses for Black-box Machine Translation Systems}
}
This code was developed by Eric Wallace, contact available at ericwallace@berkeley.edu.
If you'd like to contribute code, feel free to open a pull request. If you find an issue with the code, please open an issue.