Re: Video TREC mailing list
Paul Over wrote:
>
> Posting for the group at DCU.....
>
> I am currently working on audio feature extraction for video
> TREC, and encoutered an issue which is potentially relevant
> to others. The common shot boundary files provide a
> description of the duration of each shot, which can be as
> short as 1 frame. This is roughly 33 ms in time.
>
> I am working on speech and music (instrumental sound)
> extraction. To distinguish between music and speech, humans
> require several hundred miliseconds of sound. However, the
> current shot boundaries are often far shorter than that,
> which creates a problem in these applications.
>
> My proposal to solve this problem is to suggest a minimum
> shot duration of 15 frames (i.e. 0.5 sec. in time), below
> which the shots are not considered to be relevant for audio
> feature extraction. In this way the common shot boundary
> numbering is preserved, but audio features will not be
> extracted from shots with less than 15 frames, as they are
> irrelevant from audio perspective.
Hello,
The reference segmentation is far from perfect. It has not be
done (not even checked) manually. As mentionned in the README
file, it has been done by an automatic system (which is not so
bad considering the TREC 2001 SBD evaluation). This system is
highly oriented towards high transition recall (i.e. to miss
as few of them as possible) and there is therefore a lot of
over-segmentation in the reference segmentation. Also, it would
have been preferable to get a segmentation from several SBD
systems and to merge (by a majority vote) the results.
I made the histogram of the reference segmentation and it
looks like this:
duration(frames) | count | ratio(/35099)
1: 1 0.00%
2: 1293 3.68%
3: 910 2.59%
4: 215 0.61%
5: 1168 3.33%
6: 220 0.63%
7: 195 0.56%
8: 183 0.52%
9: 149 0.42%
10: 358 1.02%
11: 140 0.40%
12: 190 0.54%
13: 184 0.52%
14: 185 0.53%
15: 432 1.23%
16: 203 0.58%
17: 191 0.54%
18: 175 0.50%
19: 171 0.49%
20: 311 0.89%
21: 221 0.63%
22: 181 0.52%
23: 199 0.57%
24: 201 0.57%
25: 255 0.73%
26: 188 0.54%
27: 149 0.42%
28: 156 0.44%
29: 143 0.41%
30: 240 0.68%
1-15: 5823 16.59%
1-30: 8807 25.09%
For comparison, the histogram for the manually annotated/checked
TREC 2001 SBD collection looks like:
duration(frames) | count | ratio(/3026)
1: 0 0.00%
2: 0 0.00%
3: 0 0.00%
4: 0 0.00%
5: 1 0.00%
6: 0 0.00%
7: 2 0.01%
8: 2 0.01%
9: 0 0.00%
10: 3 0.01%
11: 11 0.03%
12: 5 0.01%
13: 7 0.02%
14: 2 0.01%
15: 12 0.03%
16: 2 0.01%
17: 4 0.01%
18: 7 0.02%
19: 5 0.01%
20: 11 0.03%
21: 3 0.01%
22: 2 0.01%
23: 2 0.01%
24: 2 0.01%
25: 4 0.01%
26: 3 0.01%
27: 6 0.02%
28: 9 0.03%
29: 4 0.01%
30: 10 0.03%
1-15: 45 0.13%
1-30: 119 0.34%
This tends to indicate that most of short and very short shots
correspond to over-segmentation (probably during high motion
or high illumination change portions).
Considering the mentioned problems, two options are possible
according to the suggestions made:
1) Leave the reference segmentation "as is" and simply ignore
shots whose duration is inappropriate for a given task (which
may depend upon the task); if we consider the shots whose
duration is less than 0.5s, they account for about 16% of the
shot count but only for about 1% of the collection duration.
2) Recompute the segmentation with an additional constraint of
a minimum shot duration. I see at least two ways of doing it
one fast and simple (stupid merge) and the other a bit more
complicated and costly (remove the weakest transitions around
too short shots, needs a software modification and a complete
segmentation re-run). This requires also that the minimum shot
duration be identical for all tasks.
Mixed solutions are also possible: a minimum duration for the
reference segmentation plus, if necessary, a higher minimum
duration for some tasks. In all cases, fixing a minimum shot
duration around 10-15 frames is likely to improve significantly
the segmentation quality and usability. If there is a consensus
about the need and if it is still time to modify the reference
segmentation, I believe I can do it in less than one week.
One comment about audio processing: usually visual transitions
do not match audio transitions (either speaker/speaker, speaker/
music, ... or silence/music). So, maybe, a better strategy for
audio feature extraction/detection would be to extract/segment
the feature on whole files independently of image track segmentation
and then choose from the corresponding audio segments the shots
that best matches them since results are to be given that way.
Best regards.
Georges Quénot.
Email: Georges.Quenot@imag.fr
CLIPS-IMAG, 385, rue de la Bibliothèque, B.P. 53, 38041 Grenoble Cedex 9
Tel: (33-4) 76 63 58 55, Fax: (33-4) 76 44 66 75
Date Index |
Thread Index |
Problems or questions? Contact list-master@nist.gov