Writer Identification Data Set based on Text Block (Word Region)

Sheng He, Lambert Schomaker, "FragNet: Writer Identification using Deep Fragment Networks" IEEE Trans. on Information Forensics and Security, 2020.

IAM data set

    First, please go to the website IAM to download the data set.

    Our split word for training and testing is available.

    Training set: IAM-train-list.txt Testing set: IAM-test-list.txt

    format: test/000-a01-107u-00-01-We.png

    test/: means for testing

    000: is the writer id, which can be found in the ascci/forms.txt when you download the IAM dataset.

    a01-107u-00-01: word id for line 00 in the form a01-107u, which can be found in the ascci/words.txt when you downlaod the IAM dataset.

    We: word context in this image.

    all other information, such as the rectangle of each word can be found in the ascci/words.txt when you downlaod the IAM dataset.

CVL data set

    First, please go to the website CVL to download the data set.

    Our split word for training and testing is available.

    Training set: CVL-train-list.txt Testing set: CVL-test-list.txt

    All these word images are available when you downlaod the data set and you can find it on the folder words/ in their train/test sets.

    Note that our train/test sets are different from the original train/test split. Our work is for writer identification.

Firemaker data set

    We segment the pages into word zones.

    Training set: Firemaker-train-images.tar.gz Testing set: Firemaker-test-images.tar.gz

    Format: 48704-right-line-3-n-1-y-396-x-692-h-67-w-760.png

    48704: writer id.

    right: is the right page (the forth page of the Firemaker data set).

    line: is the line id, n: is the index of the word in the line, y,x,h,w is the word region.

CERUG-EN data set

    Our split word for training and testing is available.

    Training set: CERUG-EN-train-images.tar.gz Testing set: CERUG-EN-test-images.tar.gz

    Format: Writer7070_03-02-line-2-n-5-y-355-x-1251-h-81-w-78.png

    Writer7070_03: writer id.

    02: the second paragraph

    line: is the line id, n: is the index of the word in the line, y,x,h,w is the word region.