Notes
| FILES 'all-annotations': include annotations of all 1946 photos from VIST-VAL that have 5 narrative sentences associated. Annotations were made by the first author. 'external-annotations': annotations of 514 photos made by a person unfamiliar with the study, used to calculate the interrater reliability score. DATASET ENTRIES Each key in the json file corresponds to one photo of the VIST-VAL dataset. Each photo entry has the following attributes: photo_id: the original photo id in VIST-VAL azureCaption: the caption generated automatically by the machine learning technique adopted. photo_quality: a score between 0 and 3 based on the number of contextual categories (environment, people/object, activity) it clearly depicts (0 when ambiguous). It is the sum of "photo_quality_location", "photo_quality_subject", and "photo_quality_activity". photo_quality_location: a 0/1 score indicating whether the location of the scene photographed in clearly depicted photo_quality_subject: a 0/1 score indicating whether the subject (person or object) of the scene photographed in clearly depicted photo_quality_activity: a 0/1 score indicating whether the activity present in the scene photographed in clearly depicted azureCaption_quality: a score between 0 and 3 given to the azureCaption generated, according to these rules : 0) not generated or completely unrelated; 1) misses most important elements OR contains most of important elements and a few unrelated elements; 2) contains most of important elements OR all important elements and a few unrelated elements; 3) contains all important elements in the photo and does not contain any unrelated elements. groundTruthSIS: a set of five narrative sentences from VIST-VAL associated with the photo_id |