The thesis proposes a new moving object detection approach in H264/AVC
compressed domain method for high-resolution video surveillance that exploits
not the size of MBs but also the characteristics of MV fields of moving object to
identify the interested moving object. The method can detect quickly most regions
that contain moving objects even with uniform color objects.
The thesis is a result of a real project of a company so the ability to apply in
practice is very high. The application using the proposed method in the thesis can
helps people to search, detect the moments when movement happen more
effectively. The people can save a lot of time and effort.
However, the proposed method still needs empirical thresholds in order to
accurately detect the interested moving objects. In some scenes, the removal of
noise motion like swaying tree branches cannot be done because the motion value
of tree branches is high. For future work, we will focus on making the system selftuning the thresholds by using machine learning to get the best results.
53 trang |
Chia sẻ: yenxoi77 | Lượt xem: 523 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Luận văn Motion analysis from encoded video bitstream, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
ect moving objects,
especially in high spatial resolution video streams. The method uses the data taken
from the video compressed domain, including the size of the macroblocks to
detect the skeleton of the moving object and the motion vectors to detect the detail
of the moving object.
15
CHAPTER 2.
METHODOLOGY
Video compression standard h264
Before proposing the moving object detection method, this chapter will show
some informations about H264, a popular video compression standard, which is
used to encode and decode the surveillance video in the thesis.
This day, the installation of surveillance cameras in house became quite common.
Normally, video data from a surveillance camera over a long period of time
usually has very huge size. Consequently, videos need to be preprocessed and
encoded before being used and transmitted over the network. There are many
recognized compression standards and widely used. One of these is the H264 or
MPEG-4 part 10 [26], a compression standard recognized by the ITU-T Video
Coding Experts Group and the ISO/IEC Moving Picture Experts Group.
2.1.1. H264 file structure
Normally, the video after being captured from the camera will be compressed
using a common video compression standard such as H261, H263, MP4,
H264/AVC, H265/HEVC, etc. In the thesis, I encode and decode the video by
using H264/AVC. The H264 video codec or MPEG-4 part 10 is recognized by the
ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts
Group.
Typically, an H264 file is splitted into packets called the Network Abstraction
Layer Unit (NALU) [27], as shown in Fig. 2.1.
Figure 2.1. The structure of a H264 file
The first NALU byte indicates the type of NALU. The NALU type shows what
the NALU's structure is. It can be a slice or set parameters for decompression. The
meaning of the NALU in Table 2.1.
16
Table 2.1. NALU types
Type Definition
0 Undefined
1 Slice layer without partitioning non IDR
2 Slice data partition A layer
3 Slice data partition B layer
4 Slice data partition C layer
5 Slice layer without partitioning IDR
6 Additional information (SEI)
7 Sequence parameter set
8 Picture parameter set
9 Access unit delimiter
10 End of sequence
11 End of stream
12 Filler data
13..23 Reserved
24..31 Undefined
Other than NALU, the rest of the NALU is called RBSP (Raw Byte Sequence
Payload). RBSP contains data of SODB (String Of Data Bits). According to the
specification document H264 (ISO/IEC 14496-10) if the SODB is empty (no bits
are present), the RBSP is also empty. The first byte of RBSP (left side) contains
8 bits of SODB; The next byte of the RBSP will contain up to 8 bits of SODB and
continue until less than 8 bits of SODB.
Figure 2.2. RBSP structure
17
A video will normally be divided into frames and the encoder will encode them
one by one. Each frame is encoded into slices. Each slice is divided into
Macroblock (MB). Typically, each frame corresponds to a slice, but sometimes a
frame can be split into multiple slices. The slices are divided into categories as
shown in Fig. 2.2. A slice consists of a header and a data section (Fig. 2.3). The
header of the slice contains information about the type of slice, the type of MB in
the slice, the number of slice frames. The header also contains information about
the reference frame and quantitative parameters. The data portion of the slice is
the information about the macroblock.
Table 2.2. Slide types
Type Description
0 P-slice. Consists of P-macroblocks (each macroblock is predicted using
one reference frame) and/or I-macroblocks.
1 B-slice. Consists of B-macroblocks (each macroblock is predicted
using one or two reference frames) and/or I-macroblocks.
2 I-slice. Contains only I-macroblocks. Each macroblock is predicted
from previously coded blocks of the same slice.
3 SP-slice. Consists of P and/or I-macroblocks and lets you switch
between encoded streams.
4 SI-slice. It consists of a special type of SI-macroblocks and lets you
switch between encoded streams.
5 P-slice.
6 B-slice.
7 I-slice.
8 SP-slice.
9 SI-slice.
18
Figure 2.3. Slide structure
2.1.2. Macroblock
The basic principle of a compression standard is to split the video into frame
groups. Each frame is divided into the basic processing units. (For example, in the
H264/AVC standard, it is Macroblock (MB) which is a region 16x16 pixels).
Also, with some data regions carrying more detail, the MBs will be subdivided
into smaller sub-macroblocks (4x4 or 8x8 pixels). Each MB after compression
will contain the information used to recover the video later, including Motion
vector, Residual value, Quantization parameter, etc. as in Fig. 2.4, where:
• ADDR is the position of Macroblock in a frame;
• TYPE is the Macroblock type;
• QUANT is the quantization parameter;
• VECTOR is Motion vector;
• CBP (Coded Block Pattern) show how to split MB into smaller blocks;
• bN is encoded data of residual of color channels (4 Y, 1 Cr, 1 Cb).
Figure 2.4. Macroblock structure
During decompression, the video decoder receives the compressed video data as
a stream of binary data, decodes the elements and extracts the encoded
information, including coefficients of variation, size of MB (in bit), motion
19
prediction information, and so on and perform the reverse transformation to
restore the original image data.
2.1.3. Motion vector
With H264 compression, frame-based megabytes are predicted based on the
information that has been transferred from the encoder to the decoder. Usually,
there are two ways of predicting frame prediction and inter-frame prediction.
Frame forecasting uses compressed image data in the same frame as the
compressed macroblock and predicts inter-frame image data using previously
compressed frames. Interframe forecasting is accomplished through a predictive
and compensatory motion process in which the motion predator retrieves the
macroblock in the reference frame closest to the new macroblock and calculates
the motion vector, this vector characterizes the shift of the new macroblock to
encoding compared to the reference frame.
Referenced macroblocks are sent to the subtractor with the new macroblock that
needs coding to find error prediction or residual signal, which will characterize
the difference between the predicted macroblock and the actual macroblock. The
residual signal or prediction error will be converted to Discrete Cosine Transform
and quantized to reduce the number of bits to be stored or transmitted. These
coefficients together with the motion vectors will be applied to the entropy
compressor and the bit stream. Video streams of binary data include conversion
factors, motion prediction information, compressed data structure information,
and more. To perform video compression, one compares the values of the two
frames. A frame is used as a reference. When we want to compress a MB at
position i of a frame, the video compression algorithm tries to find the reference
frame of a MB with the smallest value of MB compared to MB at position i. Then,
if MB is found in the reference frame at position j, the change between i and j is
called the Motion vector (MV) of MB at position i (Fig. 2.5). Normally an MV
will consist of two values: x (the column position of MB) and y (row position of
MB).
20
Figure 2.5. The motion vector of a Macroblock
Note that the MV of a MB does not really describe the motion of the objects in
that MB, but merely represents the movement of pixels closest to the pixels in
MB.
21
Proposed method
This section describes the processing of proposored moving object detection
method. The processing includes three phases: Macroblock-based segmentation,
Object-based segmentation, and Object refinement.
2.2.1. Process video bitstream
The video data is taken directly from the surveillance camera, in the form of a
H264 bitstream. Then it is transported to process device. To get the MVs and MBs
information, I use the library LIVE555 [28] and JM 19.0 [29]. LIVE555 is a free,
open-source C ++ library that allows to send and receive streams of information
through RTP / RTCP, RTSP, and SIP protocols. The LIVE555 Streaming Media
module is responsible for connecting, authenticating and receiving data from the
RTSP stream taken directly from the surveillance camera. In addition to receiving
packets, LIVE555 Streaming Media also disassembles the header of packets. The
results from this module are therefore NALUs (refer to ISO/IEC 14496-10 [26]).
Then the NALU will be transferred to JM 19.0, a free H264 decode tool
commonly used in study and research, for processing. The original JM 19.0 input
decoder module is a compressed video file with the H264 compression format
(with the format described in Annex B of ISO/IEC 14496-10). The original output
is the decompressed video file in YUV format. However, in order to reduce the
time and volume of computation as originally planned, I made a modification to
this library that stopped just extracting the required information without fully
decoded the video.
Then, the MVs and MBs will be used to detect the moving object. I propose a
method that uses a combination of both MVs and MBs to determine the motion
in the video. This method can be applied to both in-house video and off-road
environment. Because using the data from compressed domain, it is easy to reduce
the processing time of the method compare with the methods use the data in the
pixel domain. The moving object detection method consists of 3 phases:
Macroblock-based segmentation, Object-based segmentation, and Object
refinement, as shown in Fig. 2.6.
22
Figure 2.6. The process of moving object detection method
2.2.2. Macroblock-based Segmentation
This phase is based on Poppe's approach [24]. I use the storage size of MB after
encoding (in bit) to determine that which MBs contain movement. It can be done
because MBs containing moving objects are often more detailed than others.
Therefore, the compression ratio of these MBs is usually lower, making the size
of these MBs much higher than that of the MBs in the background. Fig. Figure
2.8 is an example of the out-door frame and in-door frame. It shows the correlation
between the motion information and the size of MBs. Fig. Figure 2.8 (a) is the
original frames (first is an outdoor frame, second is an in-door frame), Fig. Figure
2.8 (b) is the map of the size of the MBs in those frames. Each square in Fig.
Figure 2.8 (b) represents the size of one MB. The larger the size, the more white
square. As we can see, the size of MBs is larger at the moving regions (e.g. the
vehicles, the leaves of shaking).
I use the size of MB to classify MBs into 2 types: the MBs that can belong to the
moving object and MBs that can belong to the background. To do this, I compare
the size of MBs with a threshold Ts. If the size of MB is greater than Ts, I mark
the MB as "can be the moving object". Otherwise, I mark it as "can be the
background".
23
Figure 2.7. Skipped Macroblock
However, the Poppe’s approach [24] still has an important constraint that is “A
general conclusion is that MBs corresponding to (the edges of) moving objects
will typically contain more bits in the bitstream than those representing BG”. This
mean the algorithm works good only on the MBs that contain the edges of moving
objects. The reason is H264 provides “skip_mode” for some special MBs. If a
region has uniform color such as a shirt, a wall, a car door, etc. the encoder doesn’t
need to send the information of MBs in that region to decoder. The decoder will
estimate a the value for the skipped MBs from neighboring coded MBs and uses
this to calculate a motion compensated prediction for the skipped MBs. Since
there is no residual information, the motion compensated prediction is directly
inserted into the decoded frame or field. That makes some MBs that may be
considered to be moving objects but their size is equal to zero. To solve this
problem, we apply a preprocessing step that recalculates the size of all skipped
MBs as the average of the size of MB on the left, above and on the right above
(Fig. Figure 2.7). All MBs considered to be moving object are merged together
using the 8-neighbor algorithm to yield segments before applying the next
macroblock-based segmentation phase.
24
Figure 2.8. (a) An outdoor and in-door frames (b) The "size-map" of frames, (c)
The "motion-map" of frames
2.2.3. Object-based Segmentation
It is desirable that the background model can adapt to gradual changes in the
appearance of the scene. For example, in an outdoor environment or through the
view of the window of a house, the light intensity typically varies during the day;
dynamic background such as raining, movements of clouds, swaying tree
branches, etc. can be seen anywhere. Observations from the motion vector field
have shown that the motion vectors of rigid moving objects usually have similar
direction and length, while motion vectors of uninterested moving objects (in the
following sections, we consider them as noise motion) such as swaying tree
branches usually have various direction and length. For noise motion like leaves
or tree branches, they still have large size MB, but usually contains holes in the
segments while the moving of human sometimes has various direction and length
25
but without holes in the segment. As shown in Fig. Figure 2.9, the car, motorbike
and human (in the rectangles) are the interested moving objects, while the waves
of water, lights (in the circles) are the uninterested moving objects or noise.
The level of consistency in MV field and density are exploited to identify the
interested motions as the movement of human, vehicles and to remove noise
motion, especially swaying branches. I define a segment that has “consistent” MV
field if its MV direction and MV length are “consistent”. The motion vector
directions are “consistent” if there are existed TC (90%) motion vectors whose
angle between any two vectors stay smaller than TA (10◦). And the motion vector
lengths are “consistent” if there are existed TC (90%) motion vectors whose length
difference between any two vectors stay smaller than TL (20). The density of a
segment is the ratio between the number of MBs in each segment and the number
of MBs in the margin of a segment. The process of object-based segmentation
consists of two steps which are level of consistency of motion vectors field and
level of segment’s density.
26
Figure 2.9. Example about the “consistent” of motion vector
For the level of consistency of motion vectors, because of various direction and
length but the small difference of motion vector fields of segments, we first
normalize MV directions to angle (in degrees) between MVs and positive X-axis
(in Cartesian coordinate) and the length of motion to integer values. In order to
appropriately analyze the MV field of each segment, specifically, a MV (x,y)
where the direction is Md and the length is Ml, given (𝑥, 𝑦 ≠ 0), is normalized as
follows:
27
𝑀𝑑 =
{
𝑟𝑜𝑢𝑛𝑑 (
arctan(
𝑦
𝑥
)
𝜋
∗ 180) with 𝑥, 𝑦 ≠ 0,
90 with 𝑥 = 0, 𝑦 ≠ 0,
0 with 𝑥 ≠ 0, 𝑦 = 0.
(1)
𝑀𝑙 = 𝑟𝑜𝑢𝑛𝑑(√𝑥2 + 𝑦2). (2)
After that, Chebyshev’s inequality is applied to ensure of consistency of MVs
field:
𝑝(|𝑋 − 𝜇𝐴| ≥ 𝑘𝜎𝐴) ≤
1
𝑘2
(3)
where X is a random variable that represents the direction of a motion vector, μA
and σA are mean and standard deviation of the distribution of motion vector
directions. From equation 3, in order to confirm that MV direction is consistent,
we have 𝑘 =
𝑇𝐴
2𝜎𝐴
and
1
𝑘2
≤ 1 − 𝑇𝐶, therefore, 𝜎𝐴 ≤ √
𝑇𝐴(1−𝑇𝐶)
4
. So, if 𝜎𝐴 ≤
√
𝑇𝐴(1−𝑇𝐶)
4
, the segment is considered to have consistency in MV direction. The
same condition is applied to MV lengths. If 𝜎𝐿 ≤ √
𝑇𝐿(1−𝑇𝐶)
4
where μL and σL are
mean and standard deviation of the distribution of motion vector length, the MV
lengths are consistent. However, we normalize MV directions to angles between
MV and the positive X-axis. That means 0° and 359° are next to each other. Thus,
we need to check the σA of each distribution when using the angle from 0° and
359° as the middle of X-axis and if there exists any 𝜎𝐴𝑖° that satisfies 𝜎𝐴𝑖° ≥
√
𝑇𝐴(1−𝑇𝐶)
4
. The MV directions of the segment are considered to be consistent.
For the level of density of a segment, we calculate the ratio between the number
of MBs in each segment 𝑁𝑆𝑖 and the number of MBs lying on its margin 𝑀𝑆𝑖. As
discussed above, if a segment is considered to be noise (not containing true
motion) then it usually contains holes (no information). So, the density ratio of a
noise segment is much smaller than that value of a true motion segment.
Therefore, we can classify between noise and true motion segments based on the
density level (or ratio):
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 =
𝑀𝑆𝑖
𝑁𝑆𝑖
≤ 𝑇𝐷𝑒𝑛𝑠𝑖𝑡𝑦
Finally, a segment is considered as an interested moving object when it satisfies:
28
(𝜎𝐿 ≤ √
𝑇𝐿(1−𝑇𝐶)
4
𝜎𝐴 ≤ √
𝑇𝐴(1−𝑇𝐶)
4
)
𝑀𝑆𝑖
𝑁𝑆𝑖
≤ 𝑇𝐷𝑒𝑛𝑠𝑖𝑡𝑦. (4)
2.2.4. Object Refinement
As discussed above, MBs containing moving objects are often with more details
than others. That means the block size of the moving objects or the motion noise
(excepting “skip_mode”) could be larger than that of the background size.
However, in case of moving object containing flat regions, these regions can be
estimated appropriately. Therefore, the block size becomes smaller and that
makes the Macroblock-based segmentation process removes some parts of an
object. In this step, we tend to recover these parts of an object. Furthermore,
observation has shown that this case only occurs in objects that have consistent
motion. Based on an analysis of motion directions and motion lengths, we can
check MB around the object to see if it is a part of the object or not. Starting from
the MBs which are marked as a moving object, we use the breadth-first search
algorithm to recover each layer around the segments from near to far and check
each MB using hypothesis testing: a MB with motion direction A and motion
length L is considered to belong to the moving object if:
([𝜇𝐴 − 𝜎𝐴] ≤ 𝐴 ≤ [𝜇𝐴 + 𝜎𝐴])([𝜇𝐿 − 𝜎𝐿] ≤ 𝐿 ≤ [𝜇𝐿 + 𝜎𝐿]). (5)
Chapter Summarization
This chapter describes some basic informations about video compression standard
H264. The detail of standard can be found in the document of ISO/IEC Moving
Picture Experts Group [26]. In the thesis, to receive the video bit stream from
camera and parse it into NALU, I use the Live555. This is a open source and free
library to process the H264 bit stream. After that, the JM 19.0 will be used to
handle the received video and extract the MVs and size of MBs of each frame.
MVs and MBs are the inputs of the object detection method described in the
following chapter.
This chapter also proposes a new moving object detection using size of MBs and
MVs. The method include three phases. The 1st phase, macroblock-based
segmentation, is used to detect the “skeleton” of movement region by using size
of MBs compare with a threshold Ts. After that, in the object-based segmentation
phase, I try to determine that in the moving regions which is belong to interested
moving object, which is belong to noise. Finaly, in the object refinement phase,
29
some missing movement MBs of flat regions has been recovered. In next chapter,
I will show the experimental results and a application built by using the method.
30
CHAPTER 3.
RESULTS
The thesis is done within the framework of the research project “Nghiên Cứu
Công Nghệ Tóm Tắt Video” by the cooperation between the University of
Engineering and Technology (UET) and VP9 Vietnam. Therefore, apart from the
experimental results, my team and I have built an application using the proposed
method. This application was handed over and approved by VP9 Vietnam. In the
application, in order to aid in quickly searching the moments contain movement
in video, we provide a good data struct to store the motion informations. With this
data structure, the users instead of having to search for motion on the whole frame,
they can search the motion in a region of interest to get better results.
The moving object detection application
In the process of using surveillance cameras, the need to store and search the
moments which happen movement is very important. When there is movement,
the moving image area is the area of interest, other static regions are called the
background. When the background is static (almost no change in the value of the
pixels), motion detection can be performed simply by subtracting the current
frame from the reference frame (such as methods determines motion in the pixel
domain mentioned above). However, in reality, the background is often changed
due to noise or unwanted movements (such as camera noise, shaking leaves or
exotic light). Thus, real-time motion detection in a video frame from which to
detect and locate events in a specific segment of a long video file is a challenge.
The problem of searching for events in large volumes, especially long-time video
surveillance, is a time-consuming and laborious task for users and processors. In
related studies published, there are several solutions for automated searching to
detect and locate the time in the video where the event occurred. However, the
fast and efficient search of the video segment containing the event has not been
satisfactorily resolved. The process of processing video data to find out where the
event occurred in the video is still limited. Therefore, the problem of analyzing
and synthesizing the data to summarize the video data so that the search is
convenient and effective still requires better solutions. There are many results
related to the video storing applying in industry.
The invention US6697523 [24], named: "Method for summarizing a video using
motion and color descriptors", relates to the method of extracting the motion
information of a video for the purpose of automatic capturing. The method of this
31
invention uses a partially compressed video data stream and also image
information (full decompression). This consumes the computing resources of the
device. A video summary can be made simply by retrieving a frame that
represents a video clip, or color analysis. This causes or loses information,
decreases the accuracy of the results when searching, or calculates complexity on
the domain of the image. Furthermore, the invention does not propose an effective
storage solution for information that has been synthesized.
The invention US5956026A [25], named: "Method for hierarchical
summarization and browsing of digital video," relates to summarization and
browsing by creating a simplified hierarchical representation of the video using
some wildcard. Each image represents a video shot, and the system must
determine the scene and frame number. Browsing is done through the avatar
frame. The invention uses both the extracted audio data to compute the videotape.
The invention does not offer a simplified method for storing the information of a
video.
The invention US7751632B2 [26], named: "Intelligent, dynamic, long-term
digital surveillance media storage system," provides a method of analyzing
multimedia data streams for encrypting and indexing data stored in the real
requirements of the monitoring system. In particular, video content analysis is
done based on the classification of the motion data of each frame. From there, the
system chooses the optimal encoding technique for each frame to create its own
descriptors for efficient storage. After the analysis of the video segment to choose
the optimal encoding will be deleted from the original file, only the descriptor is
saved. The invention does not propose an integrated analysis of frame-frame
motion information, does not support frame-based motion search, and does not
have a hierarchical storage system for capturing motion video information.
In the following section, I will describe some information related to the
application is built by using the proposed method.
3.1.1. The process of application
The process of application is in the Fig. Figure 3.1. As mentioned above, first, the
video data will be taken directly from the surveillance camera. This data is in the
form of a H264 bitstream. Basically, this is actually a real-time H264 file. Library
LIVE555 and JM19.0 are used to implement step (1) Entropy decode. LIVE555
is a free, open source C ++ library that allows you to send and receive streams of
information through RTP / RTCP, RTSP, and SIP protocols. The LIVE555
32
Streaming Media module is responsible for connecting, authenticating and
receiving data from the RTSP stream taken directly from the surveillance camera.
In addition to receiving packets, LIVE555 Streaming Media also disassembles the
header of packets. The results from this module are therefore NALUs (refer to
ISO/IEC 14496-10). Then the NALU will be transferred to JM 19.0, a free H264
decode tool commonly used in study and research, for processing. The original
JM 19.0 input decoder module is a compressed video file with the H264
compression format (with the format described in Annex B of ISO/IEC 14496-
10). The output is the extracted video file in YUV format. However, in order to
reduce the time and volume of computation as originally planned, I made a
modification to this library that stopped just extracting the required information
without fully decode video.
Then, this information will be used to perform the process (2) Moving object
detection. To implement this process, I use the proposed method in chapter 2. The
result obtained after this process is a matrix that describes the position of motion
in each frame called the Movement map. The position of motion will have a value
of 1, otherwise the value is 0.
33
Figure 3.1. The implementation process of the approach
The information in the Movement map will be used to perform (3) Synthesizing
movement. This process will evaluate and classify motions to varying degrees
34
depending on the frequency and appearance of the motion to obtain the motion
information.
The motion description information obtained from the above steps will be
reshaped and stored in a convenient data structure for later retrieval and use in
Storage Movement Information (4).
The details of the step (3) and step (4) will be described later.
3.1.2. The motion information
The motion information in the thesis is understood as a value representing the
level of motion of the object in the video. In order to obtain information describing
motion, we can first classify the motion in the video into real motion (caused by
objects such as human beings, vehicles, etc.) and motion due to interference. The
types of observations that can be observed are:
• Noise due to camera shake: The characteristic of this noise is the large
motion on the entire frame, with the cycle.
• Noise due to camera quality: This is caused by the low light intensity,
usually a form of noise is small, no cycles but fairly distributed.
• Noise due to light: blinking light (cyclic noise), tube lights, etc. These types
of interference are cyclical, large hard to determine.
• Noise due to weather factors such as rain, clouds, etc.
With real motion, we can divide into two types: normal movement and meaningful
movement. The concept of normal and meaningful here depends on the
circumstances of the video. For example, with home video, shaking curtains cause
visible movement, but movement means human movement in the scene; With the
motion on the road, the types of motion are more difficult to define. With general
types of motion, we can divide as follows:
• Movement of cyclic motion equipment (such as rotor blades, rotating
wheels)
• Wind motion caused by the wind (leaves, curtain fabric). These movements
are usually large movements and can have cycles.
• Movements of foreign objects such as sun shining, lights (motorcycle
lights, automobile lights from remote). These movements are often difficult
to determine. However, they usually appear in night-time video.
• Lastly, real motions are like moving people, moving vehicles in the
observation area.
35
3.1.3. Synthesizing movement information
Synthesis method, classification of motion begins with the step of calculating the
weight of motion for each position in the frame (each position corresponds to one
MB) in the time interval T. For a position, we weight the number of megabytes of
motion at times (by frame) during T is as follows:
• If the MB is moving at the time of review, the weight of motion at that
moment is equal to the count of the moments of preceding consecutive
motion.
• Other cases, if the MB at the time of review has no motion, the weight is
zero.
Then, the moving weight of each position in the composite frame after the time T
is equal to the sum of the time weightings at all times in the period T.
After calculating the moving weight, we proceed to evaluate the motion level to
perform the motion classification for each position in the composite frame after
the time T based on the weight calculated in the previous step. The level of motion
is divided into four levels by the binary symbol, namely: no movement (00), few
movement or noisy (01), movement (10), and many movement (11). Movement
level values are then saved to two-dimensional arrays and stored in a two-
dimensional array.
Figure 3.2. Data struct to storage motion information
36
3.1.4. Storing Movement Information
This step will store the movement information obtained after the synthesis step
described by the motion described above. The movement information data is
stored according to the hierarchy of space and time of the video. The structure
that stores the motion description information is depicted in the Fig. Figure 3.2.
Where:
• Level 1 is a folder that contains aggregate data for each video storaged time
by time.
• Level 2 is the folder that contains the files that contains the information
data according the horizontal of frame in a temporal dimension.
• Level 3 is the files that contains movement information data of blocks in
columns of the frame in a temporal dimension.
• Level 4 is the contents of the files in level 3. These files contain binary
values from 0 to 3. The value is the level of motion of the block in a time
T (may be 1 seconds, 2 seconds, 3 seconds, 10 seconds, etc.). The user can
modify T through using the parameter.
The advantage of this data structure is when you want to search the moments when
movement happen, you can choose an area (corresponding to some MBs). In that
case, the time to searching is shorter because the application has only searched in
the files correspond with the MBs you choose. Moreover, predefining the
searching region (region of interested) will make the accuracy of the result is
higher than the searching on full frame.
Experiments
3.2.1. Dataset
The proposed method is designed to operate with a fixed, downward-facing
camera. The maximum resolution of videos is 1920x1080 pixels. The program
can be installed directly on a device attached to the camera like Raspberry Pi,
running Linux operating system that guarantees real-time processing.
The experimental data was provided by VP9 Vietnam company and processed by
HMI laboratory, University of Engineering and Technology. The data set includes
43 videos with resolutions of 1280x720 and 1920x1080. In addition, the method
uses live data from more than 100 cameras installed in the city of Hanoi and Da
Nang City which are provided by VP9 including indoor data and outdoor. The
37
videos with various lighting and environmental conditions including outdoor light
(sunlight, low sunshine), artificial light (tube, led), wind, rain, etc. It can be said
that the data set satisfies the supply of different situations and environments for
the moving object detection problem.
Figure 3.3. Example frames of test videos
For gathering and statistics for the report, I made the ground truth for 7 videos
with a resolution of 1280x720 and 1920x1080 and used these videos to perform
the experimental results. Table 3.1 describes the information about the videos
used for the experimental results. In Fig. Figure 3.3, we have some example
frames of the test videos (Figure 3.3a is a frame of TrongNha_02, Figure 3.3b is
a frame of DNG8_1708, Figure 3.3c is a frame of NEM1_131, Figure 3.3d is a
frame of HMI_WetRoad, Figure 3.3e is a frame of CuaHang_01 and Figure 3.3f
is a frame of HMI_OutDoor). These videos are captured in different environments
38
and circumstances to perform the experiements. Fig. Figure 3.4 depicts some of
their respective frames and ground truth.
Table 3.1. The information of test videos
Video
Information
Resolution Place
HMI_WetRoad 1920 × 1080 Outdoor
HMI_OutDoor 1280 × 720 Outdoor
GVO2_0308 1280 × 720 Outdoor
NEM1_131 1920 × 1080 In-door
DNG8_1708 1920 × 1080 Outdoor
CuaHang_01 1280 × 720 In-door
TrongNha_02 1280 × 720 In-door
In addition, to compare with the approach of Poppe [24] that we base on in
macroblock-based segmentation phase, we use the second dataset from IEEE
Change Detection Workshop 2014 [30]. So, the experimental process will carry
out on 2 datasets, including 11 test sequences, which are divided into 2 groups.
First group consists of 4 test sequences: PETS2006, Pedestrians, Highway and
Office from the baseline profile of the IEEE Change Detection Workshop 2014.
Both video frames and motion ground truth can be downloaded on the homepage
of Changedetection. We use ffmpeg [31] to create compressed video from given
frames with all of encoding parameters set to default. Fig. Figure 3.5 shows an
example frame of Pedestrians test sequence (a) and its motion ground truth (b).
Table 3.2 shows the information of four videos: the 1st column is the name of
videos, the next three columns are the resolution, frame rate value, and
quantization parameter (qp) value, respectively, of each video. As we can see, the
videos in the 1st group have difference resolution but they are all low resolution
videos. The frame rate of videos is 25 fps and qp value depends on each video.
These videos are quite similar to the videos in Poppe’s experiment.
39
Figure 3.4. Example frames and their ground truth
Table 3.2. The information of test sequences in group 1
Video
Information
Resolution fps qp
pedestrians 360 × 240 25 25
PETS2006 720 × 576 25 27
Highway 320 × 240 25 23
Office 360 × 240 25 23
The videos in the 2nd group are 7 videos mentioned above. These videos from
actual indoor and outdoor surveillance cameras without scripting and prior
arrangement. The motion ground truth are made by ourself by investigating the
video frame by frame. They are all the high spatial resolution videos.
40
Figure 3.5. An example frame of Pedestrians (a) and ground truth image (b)
3.2.2. Evaluation methods
The efficiency of the method is evaluated by the recall value, the precision value
and F1 score. In which, the precision value is calculated by:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
recall value is calculated by:
Recall =
TruePositive
TruePositive + FalseNegative
and the F1 score is calculated by:
𝐹1 = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
41
where:
• TruePositive: The total number of Macroblocks correctly detected as a
moving object
• FalsePositive: The total number of Macroblocks are background but
detected as a moving object
• FalseNegative: The total number of Macroblocks are a moving object but
not detected
High precision means that the accuracy of the method is good. High recall means
that the percentage of the missing moving object is low. A perfect system is a
system with precision and recall is both 100%. However, this is impossible.
Normally, when adjusting the system for precision priority, it will reduce recall
and vice versa. In that case, we can use the F1 score. This allows for a balance
between precision and recall.
3.2.3. Implementations
The proposed method in this thesis is set up in C++ language. Our experiments
were done on Windows PC of the Intel Core i5-3337U, 1.8GHz, and 8 GB RAM.
Base on observation, we’ve seen that Ts should be chosen empirically base on
each test video. The other parameters should be Tc = 90%, TA = 10◦, TL = 20, and
Tdensity = 80%.
3.2.4. Experimental results
The videos in 1st group are performed experiment many times and select the best
result. Table 3.3 shows the comparison experimental results of 2 approachs these
videos. In the case of using the proposed method, the average value of precision
of the four videos is 80%, the average of recall is 84%, and the F1 score is
81.9878. If using Poppe’s method, average of precision is 81%, average of recall
is 83%, and the F1 score is 81.95122. We can see that the performance of our
method is equivalent to that of Poppe’s method when applying on a low-resolution
video.
42
Table 3.3. The performance of two approachs with Pedestrians, PETS2006,
Highway, and Office
Video
Our approach Poppe’s approach
Precision
(%)
Recall
(%)
F1
Precision
(%)
Recall
(%)
F1
pedestrians 84 95 89.16201 80 90 84.70588
PETS2006 87 80 83.35329 88 78 82.6988
Highway 77 81 78.94937 78 80 78.98734
Office 72 82 76.67532 75 83 78.79747
Average 80 84 81.95122 81 83 81.9878
With the 2nd video group, the high resolution videos, the proposoed method is
used to perform experiment many times with different Ts parameters and selected
4 best results. Table 3.4 is the experimental result when using Poppe’s approach
and Table 3.5 is the experimental result of the proposed method on these videos.
The results show that the recall values of Poppe’s approach are usually smaller
than the values of proposed method, meaning the number of missing moving
objects detected by Poppe’s approach greater than the proposed method. This
happen because there are many “skip_mode” MBs in a frame of a high resolution
video.
Table 3.4. The experimental result of Poppe’s approach on 2nd group
Video Precision Recall F1
HMI_WetRoad 0.4954 0.8943 0.6376
HMI_OutDoor 0.5145 0.7711 0.6172
GVO2_0308 0.6821 0.6016 0.6393
NEM1_131 0.6055 0.7602 0.6741
DNG8_1708 0.8777 0.7489 0.8082
CuaHang_01 0.7468 0.8339 0.788
TrongNha_02 0.8341 0.7247 0.7756
In additional, the experimental results in Table 4.5 show that the videos which
have good results are the videos have less noise, and there is a clear distinction
between the background and moving objects. And, the results do not depend on
videos capture from outdoor or indoor cameras. As in the results table, the best
result is the TrongNha_02 video (Fig. Figure 3.3a) with F1 score = 0.8771. This
is a video obtained in a working room (namely a police station). Good
environmental conditions with low noise. A moving object is a person who clearly
43
distinguishes the floor. The shirt of a moving object has only one color but is not
uniform due to many wrinkles.
The worst video is NEM1_131 (Fig. Figure 3.3d) with F1 score = 0.6235.
Although this video is recorded indoors, it has an outward-facing view. And the
entrance of the room is made of glass, easy to reflect the moving objects. The
video is recorded in the evening, so the light outside the room is easy to make
noise.
Table 3.5. The experimental result of proposed method on 2nd group
Video Ts Precision Recall F1
HMI_WetRoad
90 0.7409 0.8644 0.7979
100 0.734 0.8935 0.8059
110 0.736 0.8197 0.7756
120 0.7461 0.9453 0.834
HMI_OutDoor
70 0.6916 0.8681 0.7699
80 0.641 0.8656 0.7366
90 0.7055 0.8962 0.7895
100 0.7195 0.9151 0.8056
GVO2_0308
70 0.5926 0.8018 0.6815
80 0.577 0.8653 0.6923
90 0.5376 0.836 0.6543
100 0.5821 0.916 0.7118
NEM1_131
90 0.4762 0.8183 0.602
100 0.4655 0.9333 0.6211
110 0.4847 0.8737 0.6235
120 0.4855 0.8702 0.6233
DNG8_1708
60 0.7612 0.8164 0.7878
65 0.7889 0.9217 0.8501
70 0.7843 0.9157 0.8449
75 0.777 0.8789 0.8248
CuaHang_01
75 0.7498 0.8796 0.8095
80 0.7676 0.9302 0.8411
85 0.7372 0.8598 0.7938
90 0.6828 0.9339 0.7889
TrongNha_02
50 0.8283 0.9319 0.8771
55 0.8139 0.9095 0.859
60 0.8248 0.9261 0.8725
65 0.8254 0.9247 0.8722
The experimental results also show that the choice of the threshold Ts is quite
difficult. This is also a limitation of the proposed method. Normally, the video is
44
less noise, the threshold value Ts will be less than the Ts of the video has more
noise.
Under the system conditions described above, the processing speed is between 17
and 23 fps. If you install the program on a Raspberry Pi2 device, the processing
speed is between 22 and 27 fps depending on the amount of motion in each frame
of the video. This speed fully meets the real-time requirements of the problem.
Chapter Summarization
This chapter presents the experiment results of thesis. The dataset of experiments
are taken from database of Change Detection Workshop 2014 and more than 100
actual surveillance cameras installed in Hanoi City and Da Nang City which are
provided by VP9 including indoor data and outdoor sences. These videos are
captured without scripting and prior arrangement. The results show that the
proposed method can determine accurately moving objects in the benchmark
videos of Change Detection Workshop 2014. In addition, with high-resolution
videos, the proposed method can perform in real-time better than the related
works. This may due to the appearance of many “skip_mode” MBs in a frame of
a high resolution video. The proposed method has been also used to build a
moving object detection application for industrial use.
45
CONCLUSIONS
The thesis proposes a new moving object detection approach in H264/AVC
compressed domain method for high-resolution video surveillance that exploits
not the size of MBs but also the characteristics of MV fields of moving object to
identify the interested moving object. The method can detect quickly most regions
that contain moving objects even with uniform color objects.
The thesis is a result of a real project of a company so the ability to apply in
practice is very high. The application using the proposed method in the thesis can
helps people to search, detect the moments when movement happen more
effectively. The people can save a lot of time and effort.
However, the proposed method still needs empirical thresholds in order to
accurately detect the interested moving objects. In some scenes, the removal of
noise motion like swaying tree branches cannot be done because the motion value
of tree branches is high. For future work, we will focus on making the system self-
tuning the thresholds by using machine learning to get the best results.
46
List of of author’s publications related to thesis
1. Minh Hoa Nguyen, Tung Long Vuong, Dinh Nam Nguyen, Do Van
Nguyen, Thanh Ha Le and Thi Thuy Nguyen, “Moving Object Detection
in Compressed Domain for High Resolution Videos,” SoICT ’17, pp. 364-
369, 2017.
2. Nguyễn Đình Nam, Nguyễn Thị Thủy, Nguyễn Đỗ Văn, Nguyễn Minh
Hòa, Vương Tùng Long, Lê Thanh Hà, "Phương pháp phân tích và lưu trữ
thông tin mô tả chuyển động trong nội dung viđeo và phương tiện lưu trữ
dữ liệu tổng hợp mô tả chuyển động trong nội dung viđeo". Pending Patent,
apply in 03/05/2017.
47
REFERENCES
[1] S. Aslam, "Omnicore," Omnicore Group, 18 9 2018. [Online]. Available:
https://www.omnicoreagency.com/youtube-statistics/.
[2] M. Piccardi, "Background subtraction techniques: a review," IEEE International
Conference on Systems, Man and Cybernetics, pp. 3099-3104, 2004.
[3] A. A. T. D. a. A. C. Wren, "Pfinder: real-time tracking of the human body," IEEE Trans.
on Patfern Anal. and Machine Infell, vol 19, pp. 780-785, 1997.
[4] J. T. J. G. B. a. S. D.Koller, "Towards Robust Automatic Traffic Scene Analysis in Real-
time," Proc. ICPR’94, pp. 126-131, 1994.
[5] B. a. S.A.Velastin, "Automatic congestion detection system for underground platforms,"
Proc. ISIMP2001, pp. 158-161, 2001.
[6] C. M. a. A. R.Cucchiara, "Detecting moving objects, ghosts, and shadows in video
streams," IEEE Trans on Pattern Anal. and Machine Intell, vol. 25, pp. 1337-1442, 2003.
[7] C. a. W.E.L.Grimson, "Adaptive background mixture models for real-time tracking,"
Proc. IEEE CVPR 1999, pp. 246-252, 1999.
[8] P. P. a. J.A.Schoonees, "Understanding background mixture models for foreground
segmentation," Proc. of IVCNZ 2002, pp. 267-271, 2002.
[9] M. T. a. P. R. Venkatesh Babu, "A survey on compressed domain video analysis
techniques," Multimedia Tools and Applications, vol. 75, p. 1043–1078, 2016.
[10] G. G. a. G. T.Wiegand, "Overview of the H.264/AVC video coding standard," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560-576, 2003.
[11] D. J. G. a. H. Q. ZengW, "Robust moving object segmentation on H.264/AVC
compressed video using the block-based MRF model," Real-Time Imaging, vol. 11, pp.
36-44, 2009.
[12] Y. L. a. Z. Z. Zhi Liu, "Real-time spatiotemporal segmentation of video objects in the
H.264 compressed domain," Journal of visual communication and image representation,
vol. 18, p. 275–290, 2007.
[13] F.-E. G. R.-B. L. M.-G. J. a. J.-L. L. C Solana-Cipres, "Real-time moving object
segmentation in H.264 compressed domain based on approximate reasoning,"
International Journal of Approximate Reasoning, vol. 51, p. 99–114, 2009.
[14] C.-M. M. a. W.-K. Cham, "Real-time video object segmentation in H.264 compressed
domain," IET Image Processing, vol. 3, p. 272 – 285, 2009.
[15] P. C. V. S. L. P. a. V. D. W. R. S De Bruyne, "Estimating motion reliability to improve
moving object detection in the H.264/AVC domain," IEEE international conference on
multimedia and expo, p. 290–299, 2009.
[16] Z. y. W. a. R. m. H. Shi zheng Wang, "Surveillance video synopsis in the compressed
domain for fast video browsing," Journal of Visual Communication and Image
Representation, vol. 24, p. 1431–1442, 2003.
[17] P. A. A. H. a. A. K. Marcus Laumer, "Compressed Domain Moving Object Detection by
Spatio-Temporal Analysis of H.264/AVC Syntax Elements," Picture Coding Symposium
(PCS), p. 282–286, 2015.
[18] R. V. B. a. R. G. P. Manu Tom, "Compressed domain human action recognition in
H.264/AVC video streams," Multimedia Tools and Applications, vol. 74, no. 21, p. 9323–
9338, 2015.
48
[19] B. R. Biswas S, "Real-time anomaly detection in H.264 compressed videos," National
conference on computer vision, pattern recognition, image processing and graphics, pp.
1-4, 2013.
[20] B. R. Biswas S, "Anomaly detection in compressed H.264/AVC video," Multimedia Tools
and Applications, p. 1–17, 2014.
[21] C. D. C. Vimal Thilak, "Tracking of extended size targets in H.264 compressed video
using the probabilistic data association filter," European Signal Processing Conference
12th, p. 281–284, 2004.
[22] S. M. K. M. You W, "Moving object tracking in H.264/AVC bitstream," Multimedia
Content Analysis and Mining, pp. 483-492, 2007.
[23] H. N. Christian Käs, "An Approach to Trajectory Estimation of Moving Objects in the
H.264 Compressed Domain," Advances in Image and Video Technology, pp. 318-329,
2009.
[24] B. S. P. T. L. P. a. d. W. R. C Poppe, "Moving object detection in the H.264/AVC
compressed domain for video surveillance applications," Journal of Visual
Communication and Image Representation, vol. 20, p. 428–437, 2009.
[25] L. R. S. M. C. P. a. R. v. d. W. Antoine Vacavant, "Adaptive background subtraction in
H.264/Avc bitstreams based on macroblock sizes," Computer Vision Theory and
Application (VISAPP), p. 51–58, 2011.
[26] K. A. P. H. S. Ajay Divakaran, "Method for summarizing a video using motion and color
descriptors". US Patent US09634364, 09 08 2000.
[27] K. Ratakonda, "Method for hierarchical summarization and browsing of digital video".
US Patent US5956026A, 19 12 1997.
[28] K. C. L. H. T. O. Lipin Liu, "Intelligent, dynamic, long-term digital surveilance media
storage system". US Patent US7751632B2, 15 02 2005.
[29] J. T. C. I. J. 1, "ISO/IEC 14496-10," ISO and IEC, 2014. [Online]. Available:
https://www.iso.org/obp/ui/#iso:std:iso-iec:14496:-10:ed-8:v1:en.
[30] D. M., "Gentle Logic," 16 11 2011. [Online]. Available:
[31] R. Finlayson, "LIVE555.COM," Live Networks, Inc., [Online]. Available:
[32] Karsten.Suehring, "Fraunhofer," Fraunhofer Heinrich Hertz Institute, [Online].
Available:
[33] V. L. a. K. Wong, "Design & Reuse," Ocean Logic Pty Ltd, [Online]. Available:
https://www.design-reuse.com/articles/12849/designing-a-real-time-hdtv-1080p-
baseline-h-264-avc-encoder-core.html.
Các file đính kèm theo tài liệu này:
- luan_van_motion_analysis_from_encoded_video_bitstream.pdf