Re: TREC-10 video track task discussion



Paul Over wrote:

> Beyond
> low-level tasks like segmentation (I assume not of use by itself to
> end-users but is it of interest to researchers still?) 

... I suspect that there are research groups out there who would be
interested in evaluating their shot and scene boundary work on something
that is easy for others to do likeways, thus facilitating direct
comparison.  Up to recently there were papers like "Another technique
for shot bound detection evaluated on 2 minutes of my favourite movie"
and now these are "A comparison of SBD techniques evaluated on 2 hours
of my favourite movie" but such work cannot be compared across research
groups because everybody uses different data.  The availability of the
NIST collection is its attraction for this.

and short of unconstrained high-level requests (e.g., I need at least 4
seconds of
> military aircraft in flight or I need a segment with President Reagan
> handing out an award.) I was thinking we might ask systems to
> identify objects or object types.
> 
... now that is a jump in complexity of a couple of orders of magnitude
... it would bring in the image processing people perhaps;

> One sort of object might be humans since examples appear in all of the
> videos - people alone, in small groups, and in crowds. Some people appear
> multiple times. They appear in various sorts of clothing, hats,
> locations, positions, etc. They show up in pictures within pictures,
> floating by, zooming, etc. They appear sometime with their names and
> titles.
> 
>   Example:  Find all the humans (bit broad)
>             Find all humans with given characteristics (walking, talking,
>             ..., indoors,...
>             Find all instances of a particular human, given
>                 an image
>                 a set of images
>                 the name (text)
>                 ...
> 
> Another sort of recurring object is superimposed text of various sorts
> including names of people, titles, but also, on graphs and charts, views
> of documents, etc.
> 
>    Example: Find all the text (bit broad)
>             Find all text concerning given subject
> 
> There are of course other less frequently occurring objects: hands,
> fingers, airplanes, ships, cranes, fasteners, etc.
> 
> I think these two tasks are possible given the videos, but are they of
> interest to any/enough researchers??

The second task, object based, would certainly be at the leading edge I
think, but I'm not sure how many groups would be ready for it ...
remember the function of a TREC-10 Video track is for it to be a dry run
and we want to make it easy for groups who are already active in video,
to take part.  If your group does video indexing and retrieval, but not
object-based work, then you can't do the above.  Am I correct in this
assumption ?
> 
> Third, to Victors list of issues:
> 
> * What is a sensible level of granularity for retrieval? Do we want to
> retrieve objects, frames, shots, scenes, etc?
> 
>   I assume the level would be chosen to fit the task, but perhaps the
>   proposal will include several possible tasks, each with a different
>   granularity for the retrieval. At the moment, I'm unable to come up
>   with a task that naturally targets something larger than a shot.
> 
> * Is it a good idea to prevent participants from "downgrading" to a
> simpler speech retrieval or speech-based question answering task?
> 
>   I think we want to try to design things so that systems must deal
>   with the video channel as opposed to primaily or exclusively the
>   audio. It should not become primarily a speech recognition task.
> 
I (Alan) agree wholly with this.

> * Should we also consider measuring the efficacy of browsing based
> approaches? They seem particularly appropriate for video retrieval
> tasks.
> 
>   Yes, if we can figure out an appropriate task and measurements.
>   So this would mean a task definition which includes a human searcher.
> 
and I (Alan) agree wholly with this too, referring to my point above. 
Remember the situation we have as of now vis-a-vis research into
indexing/browsing/retrieval of video ... we have several (many ?)
research groups all doing their own thing, everybody doing some kind of
shot bound detection and then everybody heading into different
directions.  Some do OCR on captions, some face recognition, some scene
clustering, some use audio for navigation, for helping with SBD, some
working on object tracking ... etc.  What we all have in common though,
is that we all have search/browsing interfaces.  So a lowest-common
denominator task would be something "rough", like browsing, which works
for all, and lets a variety of work groups take part.

Once again, I'm wondering, am I correct in this assumption ?

- Alan





Date Index | Thread Index | Problems or questions? Contact list-master@nist.gov