Skip to content. Skip to navigation

The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings

Sections
Personal tools
You are here: Home Documentation Training, Test and Evaluation Sets for the AMI Corpus
Document Actions

Training, Test and Evaluation Sets for the AMI Corpus

last modified 2009-01-15 17:38

People working on automatic annotation of the AMI corpus should, where possible, use the same designations to ensure comparability of their results, in particular for a single annotation type.

On this page, you find our (strongly encouraged) suggestion for a division of the corpus into training, development and test sets, as well as a split into 5 and 10 parts for use in cross-validation.


Motivation

In WP5, we are doing a number of content abstraction tasks. For each task, there may be several groups using algorithms of varying computational complexity and with different data requirements. We need to be able to compare, for instance, the results of systems for doing the same task that use ten-fold cross-validation, five-fold cross-validation, one simple training set, or a dev set and a training set. For this reason, we are holding out a portion of the data that no system will see but that will be used only for evaluation.

Main division into seen and unseen data (in hours):


seen data

unseen eval data

scenario (S)

rest (61-ish)

11

non-scenario ( N and B )

rest (28-ish)

2.5-3

We divide the seen data into training and development data for systems that require that distinction. The data is divided (in hours) as follows:


train

dev

scenario

rest (50-ish)

11

non-scenario

rest (25-ish)

2.5-3

Partition for Scenario meetings

  • SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012 (25 sets, 25*4-2 = 98 meetings)

  • SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006 (5 sets, 5*4 = 20 meetings)

  • SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007 (5 sets, 5*4 = 20 meetings)

Look at the bottom of the page for annotation statistics, and the script used to generate them.

K-fold Cross-Validation Partition for Scenario meetings

Each line (for k=10) or each pair of lines (for k=5) is one part of the partition. For each fold, you use one part for testing, the other nine (four) for training. The columns indicate from which set of the tripartition the meetings are taken - e.g. if you want to run CV using no unseen data, you'd only use meetings from the SA and SB columns.

SA

SB

SC

k=10

k=5

ES2002

IS1000


TS3004

ES2004

1

1

ES2007

IS1001

TS3005



2


ES2005

IS1002


TS3006

ES2014

3

2

ES2006

IS1003

TS3008



4


TS3009

IS1004


ES2003

IS1009

5

3

ES2008

ES2013

TS3010



6


TS3011

IS1006


ES2011

TS3003

7

4

ES2010

IS1007

ES2015



8


ES2009

TS3012


IS1008

TS3007

9

5

ES2012

IS1005

ES2016



10


Representation in NXT format

This information is encoded in NXT format in teh following manner. The corpus resource file meetings.xml contains information about all meetings in the corpus. There are four relevant attributes on the 'meeting' elements:
  •  visibility - values are either 'seen' or 'unseen'. 'seen' means the data is in the training or test set; 'unseen' are those meetings in the evaluation set.
  • seen_type - if the visibility is 'seen', this attribute will either contain the word 'training' or 'development'
  • k10 - numerical value that divides the corpus into 10 parts for 10-fold cross-validation.
  • k5 - numerical value that divides the corpus into 5 parts for 5-fold cross-validation.
As with any elements and attributes in the NXT corpus these can be queried using the NXT Query Language.


Powered by Plone