Image CAPTCHA Dataset

This page contains part of our CAPTCHA images dataset which is collected during our experiment for the project, I Am Robot: (Deep) Learning to Break Semantic Image CAPTCHAs . The work was developed at Network Security Lab, Columbia University and presented and published at Euro S&P 2016 and Blackhat Asia 2016.

Collaborators: Suphannee Sivakorn, Jason Polakis and Angelos D. Keromytis

Copyright and License:
All code and documentation copyright the project collaborators and Network Security Lab at Columbia University, New York, NY, USA. CAPTCHA images copyright the CAPTCHA services associated. The project released under MIT License.

Feel free to use our dataset:
Below is a list of works/projects related to very interesting CAPTCHA and deep-learning research and have cited this dataset/paper as a data source. Thank you!

One Thousand CAPTCHA Photos Organized with a Neural Network

Google reCaptcha Image Dataset

This dataset contains 700 CAPTCHA challenges (7,000 images) collected from Google reCaptcha version 1.0 (early - mid 2015). One challenge contains 10 images i.e., one example image and nine candidate images, and instruction text for human. Here is an example:

Dataset Format

One folder is for one challenge set which contains:

1 example image file: samp_[number].png
9 candidate images file: cand_[number].png
1 challenge screenshot image file: full.png
1 information (about the challenge) file: info.txt

The "info.txt" holds useful information about the challenge and the images in the set including annotated tags and scores from deep learning services (please refer to our paper for more information), challenge description (instruction) and keyword (hint). The file is written in json format style.

In addition, we provide correct responses for these challenge sets, which is located in file named "correct_responses.txt". These are responses that were answered by human. Here is the format of the response:

[challenge set name] | [candidate number answers]
Here is an example:
accounts.snapchat.com_2015-04-25_23-14-30 | 2,8,9
where "accounts.snapchat.com_2015-04-25_23-14-30" is challenge set name, and 2,8,9 are the correct candidate numbers i.e., cand_2.png, cand_8.png, cand_9.png.

Download

The link to download is here: recapt_offline.tar.gz (513 MB).

Facebook CAPTCHA Image Dataset

This dataset contains 200 CAPTCHA challenges from Facebook image CAPTCHA during early - mid 2015. One challenge contains candidate 12 images and an instruction. Here is an example:

Dataset Format

This dataset has the same format as Google reCaptcha Image Dataset.

Download

The link to download is here: fb_offline.tar.gz (145 MB).

Labeled CAPTCHA Images

This dataset contains 3,000 images from Google reCaptcha version 1.0 same as above. However all images in this set are labeled by human accordingly to the reCAPTCHA available categories at that time. Since they are labeled, they are useful for training purpose. An image which does not belong to any category or unclear to be identified, is labeled as "unclear".

Dataset Format

All files are saved in ".png" extension, and named as the following format:

[MD5hashvalue]_[label].png
e.g., fe7d3d349b44cca3e7704a103fafa55f_soup.png, ff059aff88fb0caf2e6d2ecc2d5a73b7_pizza.png, f2c5949b07823513b656d3623acc9ccc_unclear.png. where the MD5hashvalue is the MD5 128-bit hash value in the hexadecimal digits of the image and the label is the label of the image.

Download

The link to download is here: img_labeled_3000.tar.gz (73 MB).