Training, Test and Evaluation Sets for the AMI Corpus
People working on automatic annotation of the AMI corpus should, where possible, use the same designations to ensure comparability of their results, in particular for a single annotation type.
On this page, you find our (strongly encouraged) suggestion for a division of the corpus into training, development and test sets, as well as a split into 5 and 10 parts for use in cross-validation.Motivation
In WP5, we are doing a number of content abstraction tasks. For each task, there may be several groups using algorithms of varying computational complexity and with different data requirements. We need to be able to compare, for instance, the results of systems for doing the same task that use ten-fold cross-validation, five-fold cross-validation, one simple training set, or a dev set and a training set. For this reason, we are holding out a portion of the data that no system will see but that will be used only for evaluation.
Main division into seen and unseen data (in hours):
|
|
seen data |
unseen eval data |
|
scenario (S) |
rest (61-ish) |
11 |
|
non-scenario ( N and B ) |
rest (28-ish) |
2.5-3 |
We divide the seen data into training and development data for systems that require that distinction. The data is divided (in hours) as follows:
|
|
train |
dev |
|
scenario |
rest (50-ish) |
11 |
|
non-scenario |
rest (25-ish) |
2.5-3 |
Partition for Scenario meetings
-
SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012 (25 sets, 25*4-2 = 98 meetings)
-
SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006 (5 sets, 5*4 = 20 meetings)
-
SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007 (5 sets, 5*4 = 20 meetings)
Look at the bottom of the page for annotation statistics, and the script used to generate them.
K-fold Cross-Validation Partition for Scenario meetings
Each line (for k=10) or each pair of lines (for k=5) is one part of the partition. For each fold, you use one part for testing, the other nine (four) for training. The columns indicate from which set of the tripartition the meetings are taken - e.g. if you want to run CV using no unseen data, you'd only use meetings from the SA and SB columns.
|
SA |
SB |
SC |
k=10 |
k=5 |
||
|
ES2002 |
IS1000 |
|
TS3004 |
ES2004 |
1 |
1 |
|
ES2007 |
IS1001 |
TS3005 |
|
|
2 |
|
|
ES2005 |
IS1002 |
|
TS3006 |
ES2014 |
3 |
2 |
|
ES2006 |
IS1003 |
TS3008 |
|
|
4 |
|
|
TS3009 |
IS1004 |
|
ES2003 |
IS1009 |
5 |
3 |
|
ES2008 |
ES2013 |
TS3010 |
|
|
6 |
|
|
TS3011 |
IS1006 |
|
ES2011 |
TS3003 |
7 |
4 |
|
ES2010 |
IS1007 |
ES2015 |
|
|
8 |
|
|
ES2009 |
TS3012 |
|
IS1008 |
TS3007 |
9 |
5 |
|
ES2012 |
IS1005 |
ES2016 |
|
|
10 |
|
Representation in NXT format
This information is encoded in NXT format in teh following manner. The corpus resource file meetings.xml contains information about all meetings in the corpus. There are four relevant attributes on the 'meeting' elements:- visibility - values are either 'seen' or 'unseen'. 'seen' means the data is in the training or test set; 'unseen' are those meetings in the evaluation set.
- seen_type - if the visibility is 'seen', this attribute will either contain the word 'training' or 'development'
- k10 - numerical value that divides the corpus into 10 parts for 10-fold cross-validation.
- k5 - numerical value that divides the corpus into 5 parts for 5-fold cross-validation.