Luận văn Motion analysis from encoded video bitstream

The thesis proposes a new moving object detection approach in H264/AVC compressed domain method for high-resolution video surveillance that exploits not the size of MBs but also the characteristics of MV fields of moving object to identify the interested moving object. The method can detect quickly most regions that contain moving objects even with uniform color objects. The thesis is a result of a real project of a company so the ability to apply in practice is very high. The application using the proposed method in the thesis can helps people to search, detect the moments when movement happen more effectively. The people can save a lot of time and effort. However, the proposed method still needs empirical thresholds in order to accurately detect the interested moving objects. In some scenes, the removal of noise motion like swaying tree branches cannot be done because the motion value of tree branches is high. For future work, we will focus on making the system selftuning the thresholds by using machine learning to get the best results.

53 trang | Chia sẻ: yenxoi77 | Lượt xem: 423 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Luận văn Motion analysis from encoded video bitstream, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

ect moving objects, especially in high spatial resolution video streams. The method uses the data taken from the video compressed domain, including the size of the macroblocks to detect the skeleton of the moving object and the motion vectors to detect the detail of the moving object. 15 CHAPTER 2. METHODOLOGY Video compression standard h264 Before proposing the moving object detection method, this chapter will show some informations about H264, a popular video compression standard, which is used to encode and decode the surveillance video in the thesis. This day, the installation of surveillance cameras in house became quite common. Normally, video data from a surveillance camera over a long period of time usually has very huge size. Consequently, videos need to be preprocessed and encoded before being used and transmitted over the network. There are many recognized compression standards and widely used. One of these is the H264 or MPEG-4 part 10 [26], a compression standard recognized by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. 2.1.1. H264 file structure Normally, the video after being captured from the camera will be compressed using a common video compression standard such as H261, H263, MP4, H264/AVC, H265/HEVC, etc. In the thesis, I encode and decode the video by using H264/AVC. The H264 video codec or MPEG-4 part 10 is recognized by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. Typically, an H264 file is splitted into packets called the Network Abstraction Layer Unit (NALU) [27], as shown in Fig. 2.1. Figure 2.1. The structure of a H264 file The first NALU byte indicates the type of NALU. The NALU type shows what the NALU's structure is. It can be a slice or set parameters for decompression. The meaning of the NALU in Table 2.1. 16 Table 2.1. NALU types Type Definition 0 Undefined 1 Slice layer without partitioning non IDR 2 Slice data partition A layer 3 Slice data partition B layer 4 Slice data partition C layer 5 Slice layer without partitioning IDR 6 Additional information (SEI) 7 Sequence parameter set 8 Picture parameter set 9 Access unit delimiter 10 End of sequence 11 End of stream 12 Filler data 13..23 Reserved 24..31 Undefined Other than NALU, the rest of the NALU is called RBSP (Raw Byte Sequence Payload). RBSP contains data of SODB (String Of Data Bits). According to the specification document H264 (ISO/IEC 14496-10) if the SODB is empty (no bits are present), the RBSP is also empty. The first byte of RBSP (left side) contains 8 bits of SODB; The next byte of the RBSP will contain up to 8 bits of SODB and continue until less than 8 bits of SODB. Figure 2.2. RBSP structure 17 A video will normally be divided into frames and the encoder will encode them one by one. Each frame is encoded into slices. Each slice is divided into Macroblock (MB). Typically, each frame corresponds to a slice, but sometimes a frame can be split into multiple slices. The slices are divided into categories as shown in Fig. 2.2. A slice consists of a header and a data section (Fig. 2.3). The header of the slice contains information about the type of slice, the type of MB in the slice, the number of slice frames. The header also contains information about the reference frame and quantitative parameters. The data portion of the slice is the information about the macroblock. Table 2.2. Slide types Type Description 0 P-slice. Consists of P-macroblocks (each macroblock is predicted using one reference frame) and/or I-macroblocks. 1 B-slice. Consists of B-macroblocks (each macroblock is predicted using one or two reference frames) and/or I-macroblocks. 2 I-slice. Contains only I-macroblocks. Each macroblock is predicted from previously coded blocks of the same slice. 3 SP-slice. Consists of P and/or I-macroblocks and lets you switch between encoded streams. 4 SI-slice. It consists of a special type of SI-macroblocks and lets you switch between encoded streams. 5 P-slice. 6 B-slice. 7 I-slice. 8 SP-slice. 9 SI-slice. 18 Figure 2.3. Slide structure 2.1.2. Macroblock The basic principle of a compression standard is to split the video into frame groups. Each frame is divided into the basic processing units. (For example, in the H264/AVC standard, it is Macroblock (MB) which is a region 16x16 pixels). Also, with some data regions carrying more detail, the MBs will be subdivided into smaller sub-macroblocks (4x4 or 8x8 pixels). Each MB after compression will contain the information used to recover the video later, including Motion vector, Residual value, Quantization parameter, etc. as in Fig. 2.4, where: • ADDR is the position of Macroblock in a frame; • TYPE is the Macroblock type; • QUANT is the quantization parameter; • VECTOR is Motion vector; • CBP (Coded Block Pattern) show how to split MB into smaller blocks; • bN is encoded data of residual of color channels (4 Y, 1 Cr, 1 Cb). Figure 2.4. Macroblock structure During decompression, the video decoder receives the compressed video data as a stream of binary data, decodes the elements and extracts the encoded information, including coefficients of variation, size of MB (in bit), motion 19 prediction information, and so on and perform the reverse transformation to restore the original image data. 2.1.3. Motion vector With H264 compression, frame-based megabytes are predicted based on the information that has been transferred from the encoder to the decoder. Usually, there are two ways of predicting frame prediction and inter-frame prediction. Frame forecasting uses compressed image data in the same frame as the compressed macroblock and predicts inter-frame image data using previously compressed frames. Interframe forecasting is accomplished through a predictive and compensatory motion process in which the motion predator retrieves the macroblock in the reference frame closest to the new macroblock and calculates the motion vector, this vector characterizes the shift of the new macroblock to encoding compared to the reference frame. Referenced macroblocks are sent to the subtractor with the new macroblock that needs coding to find error prediction or residual signal, which will characterize the difference between the predicted macroblock and the actual macroblock. The residual signal or prediction error will be converted to Discrete Cosine Transform and quantized to reduce the number of bits to be stored or transmitted. These coefficients together with the motion vectors will be applied to the entropy compressor and the bit stream. Video streams of binary data include conversion factors, motion prediction information, compressed data structure information, and more. To perform video compression, one compares the values of the two frames. A frame is used as a reference. When we want to compress a MB at position i of a frame, the video compression algorithm tries to find the reference frame of a MB with the smallest value of MB compared to MB at position i. Then, if MB is found in the reference frame at position j, the change between i and j is called the Motion vector (MV) of MB at position i (Fig. 2.5). Normally an MV will consist of two values: x (the column position of MB) and y (row position of MB). 20 Figure 2.5. The motion vector of a Macroblock Note that the MV of a MB does not really describe the motion of the objects in that MB, but merely represents the movement of pixels closest to the pixels in MB. 21 Proposed method This section describes the processing of proposored moving object detection method. The processing includes three phases: Macroblock-based segmentation, Object-based segmentation, and Object refinement. 2.2.1. Process video bitstream The video data is taken directly from the surveillance camera, in the form of a H264 bitstream. Then it is transported to process device. To get the MVs and MBs information, I use the library LIVE555 [28] and JM 19.0 [29]. LIVE555 is a free, open-source C ++ library that allows to send and receive streams of information through RTP / RTCP, RTSP, and SIP protocols. The LIVE555 Streaming Media module is responsible for connecting, authenticating and receiving data from the RTSP stream taken directly from the surveillance camera. In addition to receiving packets, LIVE555 Streaming Media also disassembles the header of packets. The results from this module are therefore NALUs (refer to ISO/IEC 14496-10 [26]). Then the NALU will be transferred to JM 19.0, a free H264 decode tool commonly used in study and research, for processing. The original JM 19.0 input decoder module is a compressed video file with the H264 compression format (with the format described in Annex B of ISO/IEC 14496-10). The original output is the decompressed video file in YUV format. However, in order to reduce the time and volume of computation as originally planned, I made a modification to this library that stopped just extracting the required information without fully decoded the video. Then, the MVs and MBs will be used to detect the moving object. I propose a method that uses a combination of both MVs and MBs to determine the motion in the video. This method can be applied to both in-house video and off-road environment. Because using the data from compressed domain, it is easy to reduce the processing time of the method compare with the methods use the data in the pixel domain. The moving object detection method consists of 3 phases: Macroblock-based segmentation, Object-based segmentation, and Object refinement, as shown in Fig. 2.6. 22 Figure 2.6. The process of moving object detection method 2.2.2. Macroblock-based Segmentation This phase is based on Poppe's approach [24]. I use the storage size of MB after encoding (in bit) to determine that which MBs contain movement. It can be done because MBs containing moving objects are often more detailed than others. Therefore, the compression ratio of these MBs is usually lower, making the size of these MBs much higher than that of the MBs in the background. Fig. Figure 2.8 is an example of the out-door frame and in-door frame. It shows the correlation between the motion information and the size of MBs. Fig. Figure 2.8 (a) is the original frames (first is an outdoor frame, second is an in-door frame), Fig. Figure 2.8 (b) is the map of the size of the MBs in those frames. Each square in Fig. Figure 2.8 (b) represents the size of one MB. The larger the size, the more white square. As we can see, the size of MBs is larger at the moving regions (e.g. the vehicles, the leaves of shaking). I use the size of MB to classify MBs into 2 types: the MBs that can belong to the moving object and MBs that can belong to the background. To do this, I compare the size of MBs with a threshold Ts. If the size of MB is greater than Ts, I mark the MB as "can be the moving object". Otherwise, I mark it as "can be the background". 23 Figure 2.7. Skipped Macroblock However, the Poppe’s approach [24] still has an important constraint that is “A general conclusion is that MBs corresponding to (the edges of) moving objects will typically contain more bits in the bitstream than those representing BG”. This mean the algorithm works good only on the MBs that contain the edges of moving objects. The reason is H264 provides “skip_mode” for some special MBs. If a region has uniform color such as a shirt, a wall, a car door, etc. the encoder doesn’t need to send the information of MBs in that region to decoder. The decoder will estimate a the value for the skipped MBs from neighboring coded MBs and uses this to calculate a motion compensated prediction for the skipped MBs. Since there is no residual information, the motion compensated prediction is directly inserted into the decoded frame or field. That makes some MBs that may be considered to be moving objects but their size is equal to zero. To solve this problem, we apply a preprocessing step that recalculates the size of all skipped MBs as the average of the size of MB on the left, above and on the right above (Fig. Figure 2.7). All MBs considered to be moving object are merged together using the 8-neighbor algorithm to yield segments before applying the next macroblock-based segmentation phase. 24 Figure 2.8. (a) An outdoor and in-door frames (b) The "size-map" of frames, (c) The "motion-map" of frames 2.2.3. Object-based Segmentation It is desirable that the background model can adapt to gradual changes in the appearance of the scene. For example, in an outdoor environment or through the view of the window of a house, the light intensity typically varies during the day; dynamic background such as raining, movements of clouds, swaying tree branches, etc. can be seen anywhere. Observations from the motion vector field have shown that the motion vectors of rigid moving objects usually have similar direction and length, while motion vectors of uninterested moving objects (in the following sections, we consider them as noise motion) such as swaying tree branches usually have various direction and length. For noise motion like leaves or tree branches, they still have large size MB, but usually contains holes in the segments while the moving of human sometimes has various direction and length 25 but without holes in the segment. As shown in Fig. Figure 2.9, the car, motorbike and human (in the rectangles) are the interested moving objects, while the waves of water, lights (in the circles) are the uninterested moving objects or noise. The level of consistency in MV field and density are exploited to identify the interested motions as the movement of human, vehicles and to remove noise motion, especially swaying branches. I define a segment that has “consistent” MV field if its MV direction and MV length are “consistent”. The motion vector directions are “consistent” if there are existed TC (90%) motion vectors whose angle between any two vectors stay smaller than TA (10◦). And the motion vector lengths are “consistent” if there are existed TC (90%) motion vectors whose length difference between any two vectors stay smaller than TL (20). The density of a segment is the ratio between the number of MBs in each segment and the number of MBs in the margin of a segment. The process of object-based segmentation consists of two steps which are level of consistency of motion vectors field and level of segment’s density. 26 Figure 2.9. Example about the “consistent” of motion vector For the level of consistency of motion vectors, because of various direction and length but the small difference of motion vector fields of segments, we first normalize MV directions to angle (in degrees) between MVs and positive X-axis (in Cartesian coordinate) and the length of motion to integer values. In order to appropriately analyze the MV field of each segment, specifically, a MV (x,y) where the direction is Md and the length is Ml, given (𝑥, 𝑦 ≠ 0), is normalized as follows: 27 𝑀𝑑 = { 𝑟𝑜𝑢𝑛𝑑 ( arctan( 𝑦 𝑥 ) 𝜋 ∗ 180) with 𝑥, 𝑦 ≠ 0, 90 with 𝑥 = 0, 𝑦 ≠ 0, 0 with 𝑥 ≠ 0, 𝑦 = 0. (1) 𝑀𝑙 = 𝑟𝑜𝑢𝑛𝑑(√𝑥2 + 𝑦2). (2) After that, Chebyshev’s inequality is applied to ensure of consistency of MVs field: 𝑝(|𝑋 − 𝜇𝐴| ≥ 𝑘𝜎𝐴) ≤ 1 𝑘2 (3) where X is a random variable that represents the direction of a motion vector, μA and σA are mean and standard deviation of the distribution of motion vector directions. From equation 3, in order to confirm that MV direction is consistent, we have 𝑘 = 𝑇𝐴 2𝜎𝐴 and 1 𝑘2 ≤ 1 − 𝑇𝐶, therefore, 𝜎𝐴 ≤ √ 𝑇𝐴(1−𝑇𝐶) 4 . So, if 𝜎𝐴 ≤ √ 𝑇𝐴(1−𝑇𝐶) 4 , the segment is considered to have consistency in MV direction. The same condition is applied to MV lengths. If 𝜎𝐿 ≤ √ 𝑇𝐿(1−𝑇𝐶) 4 where μL and σL are mean and standard deviation of the distribution of motion vector length, the MV lengths are consistent. However, we normalize MV directions to angles between MV and the positive X-axis. That means 0° and 359° are next to each other. Thus, we need to check the σA of each distribution when using the angle from 0° and 359° as the middle of X-axis and if there exists any 𝜎𝐴𝑖° that satisfies 𝜎𝐴𝑖° ≥ √ 𝑇𝐴(1−𝑇𝐶) 4 . The MV directions of the segment are considered to be consistent. For the level of density of a segment, we calculate the ratio between the number of MBs in each segment 𝑁𝑆𝑖 and the number of MBs lying on its margin 𝑀𝑆𝑖. As discussed above, if a segment is considered to be noise (not containing true motion) then it usually contains holes (no information). So, the density ratio of a noise segment is much smaller than that value of a true motion segment. Therefore, we can classify between noise and true motion segments based on the density level (or ratio): 𝐷𝑒𝑛𝑠𝑖𝑡𝑦 = 𝑀𝑆𝑖 𝑁𝑆𝑖 ≤ 𝑇𝐷𝑒𝑛𝑠𝑖𝑡𝑦 Finally, a segment is considered as an interested moving object when it satisfies: 28 (𝜎𝐿 ≤ √ 𝑇𝐿(1−𝑇𝐶) 4  𝜎𝐴 ≤ √ 𝑇𝐴(1−𝑇𝐶) 4 ) 𝑀𝑆𝑖 𝑁𝑆𝑖 ≤ 𝑇𝐷𝑒𝑛𝑠𝑖𝑡𝑦. (4) 2.2.4. Object Refinement As discussed above, MBs containing moving objects are often with more details than others. That means the block size of the moving objects or the motion noise (excepting “skip_mode”) could be larger than that of the background size. However, in case of moving object containing flat regions, these regions can be estimated appropriately. Therefore, the block size becomes smaller and that makes the Macroblock-based segmentation process removes some parts of an object. In this step, we tend to recover these parts of an object. Furthermore, observation has shown that this case only occurs in objects that have consistent motion. Based on an analysis of motion directions and motion lengths, we can check MB around the object to see if it is a part of the object or not. Starting from the MBs which are marked as a moving object, we use the breadth-first search algorithm to recover each layer around the segments from near to far and check each MB using hypothesis testing: a MB with motion direction A and motion length L is considered to belong to the moving object if: ([𝜇𝐴 − 𝜎𝐴] ≤ 𝐴 ≤ [𝜇𝐴 + 𝜎𝐴])([𝜇𝐿 − 𝜎𝐿] ≤ 𝐿 ≤ [𝜇𝐿 + 𝜎𝐿]). (5) Chapter Summarization This chapter describes some basic informations about video compression standard H264. The detail of standard can be found in the document of ISO/IEC Moving Picture Experts Group [26]. In the thesis, to receive the video bit stream from camera and parse it into NALU, I use the Live555. This is a open source and free library to process the H264 bit stream. After that, the JM 19.0 will be used to handle the received video and extract the MVs and size of MBs of each frame. MVs and MBs are the inputs of the object detection method described in the following chapter. This chapter also proposes a new moving object detection using size of MBs and MVs. The method include three phases. The 1st phase, macroblock-based segmentation, is used to detect the “skeleton” of movement region by using size of MBs compare with a threshold Ts. After that, in the object-based segmentation phase, I try to determine that in the moving regions which is belong to interested moving object, which is belong to noise. Finaly, in the object refinement phase, 29 some missing movement MBs of flat regions has been recovered. In next chapter, I will show the experimental results and a application built by using the method. 30 CHAPTER 3. RESULTS The thesis is done within the framework of the research project “Nghiên Cứu Công Nghệ Tóm Tắt Video” by the cooperation between the University of Engineering and Technology (UET) and VP9 Vietnam. Therefore, apart from the experimental results, my team and I have built an application using the proposed method. This application was handed over and approved by VP9 Vietnam. In the application, in order to aid in quickly searching the moments contain movement in video, we provide a good data struct to store the motion informations. With this data structure, the users instead of having to search for motion on the whole frame, they can search the motion in a region of interest to get better results. The moving object detection application In the process of using surveillance cameras, the need to store and search the moments which happen movement is very important. When there is movement, the moving image area is the area of interest, other static regions are called the background. When the background is static (almost no change in the value of the pixels), motion detection can be performed simply by subtracting the current frame from the reference frame (such as methods determines motion in the pixel domain mentioned above). However, in reality, the background is often changed due to noise or unwanted movements (such as camera noise, shaking leaves or exotic light). Thus, real-time motion detection in a video frame from which to detect and locate events in a specific segment of a long video file is a challenge. The problem of searching for events in large volumes, especially long-time video surveillance, is a time-consuming and laborious task for users and processors. In related studies published, there are several solutions for automated searching to detect and locate the time in the video where the event occurred. However, the fast and efficient search of the video segment containing the event has not been satisfactorily resolved. The process of processing video data to find out where the event occurred in the video is still limited. Therefore, the problem of analyzing and synthesizing the data to summarize the video data so that the search is convenient and effective still requires better solutions. There are many results related to the video storing applying in industry. The invention US6697523 [24], named: "Method for summarizing a video using motion and color descriptors", relates to the method of extracting the motion information of a video for the purpose of automatic capturing. The method of this 31 invention uses a partially compressed video data stream and also image information (full decompression). This consumes the computing resources of the device. A video summary can be made simply by retrieving a frame that represents a video clip, or color analysis. This causes or loses information, decreases the accuracy of the results when searching, or calculates complexity on the domain of the image. Furthermore, the invention does not propose an effective storage solution for information that has been synthesized. The invention US5956026A [25], named: "Method for hierarchical summarization and browsing of digital video," relates to summarization and browsing by creating a simplified hierarchical representation of the video using some wildcard. Each image represents a video shot, and the system must determine the scene and frame number. Browsing is done through the avatar frame. The invention uses both the extracted audio data to compute the videotape. The invention does not offer a simplified method for storing the information of a video. The invention US7751632B2 [26], named: "Intelligent, dynamic, long-term digital surveillance media storage system," provides a method of analyzing multimedia data streams for encrypting and indexing data stored in the real requirements of the monitoring system. In particular, video content analysis is done based on the classification of the motion data of each frame. From there, the system chooses the optimal encoding technique for each frame to create its own descriptors for efficient storage. After the analysis of the video segment to choose the optimal encoding will be deleted from the original file, only the descriptor is saved. The invention does not propose an integrated analysis of frame-frame motion information, does not support frame-based motion search, and does not have a hierarchical storage system for capturing motion video information. In the following section, I will describe some information related to the application is built by using the proposed method. 3.1.1. The process of application The process of application is in the Fig. Figure 3.1. As mentioned above, first, the video data will be taken directly from the surveillance camera. This data is in the form of a H264 bitstream. Basically, this is actually a real-time H264 file. Library LIVE555 and JM19.0 are used to implement step (1) Entropy decode. LIVE555 is a free, open source C ++ library that allows you to send and receive streams of information through RTP / RTCP, RTSP, and SIP protocols. The LIVE555 32 Streaming Media module is responsible for connecting, authenticating and receiving data from the RTSP stream taken directly from the surveillance camera. In addition to receiving packets, LIVE555 Streaming Media also disassembles the header of packets. The results from this module are therefore NALUs (refer to ISO/IEC 14496-10). Then the NALU will be transferred to JM 19.0, a free H264 decode tool commonly used in study and research, for processing. The original JM 19.0 input decoder module is a compressed video file with the H264 compression format (with the format described in Annex B of ISO/IEC 14496- 10). The output is the extracted video file in YUV format. However, in order to reduce the time and volume of computation as originally planned, I made a modification to this library that stopped just extracting the required information without fully decode video. Then, this information will be used to perform the process (2) Moving object detection. To implement this process, I use the proposed method in chapter 2. The result obtained after this process is a matrix that describes the position of motion in each frame called the Movement map. The position of motion will have a value of 1, otherwise the value is 0. 33 Figure 3.1. The implementation process of the approach The information in the Movement map will be used to perform (3) Synthesizing movement. This process will evaluate and classify motions to varying degrees 34 depending on the frequency and appearance of the motion to obtain the motion information. The motion description information obtained from the above steps will be reshaped and stored in a convenient data structure for later retrieval and use in Storage Movement Information (4). The details of the step (3) and step (4) will be described later. 3.1.2. The motion information The motion information in the thesis is understood as a value representing the level of motion of the object in the video. In order to obtain information describing motion, we can first classify the motion in the video into real motion (caused by objects such as human beings, vehicles, etc.) and motion due to interference. The types of observations that can be observed are: • Noise due to camera shake: The characteristic of this noise is the large motion on the entire frame, with the cycle. • Noise due to camera quality: This is caused by the low light intensity, usually a form of noise is small, no cycles but fairly distributed. • Noise due to light: blinking light (cyclic noise), tube lights, etc. These types of interference are cyclical, large hard to determine. • Noise due to weather factors such as rain, clouds, etc. With real motion, we can divide into two types: normal movement and meaningful movement. The concept of normal and meaningful here depends on the circumstances of the video. For example, with home video, shaking curtains cause visible movement, but movement means human movement in the scene; With the motion on the road, the types of motion are more difficult to define. With general types of motion, we can divide as follows: • Movement of cyclic motion equipment (such as rotor blades, rotating wheels) • Wind motion caused by the wind (leaves, curtain fabric). These movements are usually large movements and can have cycles. • Movements of foreign objects such as sun shining, lights (motorcycle lights, automobile lights from remote). These movements are often difficult to determine. However, they usually appear in night-time video. • Lastly, real motions are like moving people, moving vehicles in the observation area. 35 3.1.3. Synthesizing movement information Synthesis method, classification of motion begins with the step of calculating the weight of motion for each position in the frame (each position corresponds to one MB) in the time interval T. For a position, we weight the number of megabytes of motion at times (by frame) during T is as follows: • If the MB is moving at the time of review, the weight of motion at that moment is equal to the count of the moments of preceding consecutive motion. • Other cases, if the MB at the time of review has no motion, the weight is zero. Then, the moving weight of each position in the composite frame after the time T is equal to the sum of the time weightings at all times in the period T. After calculating the moving weight, we proceed to evaluate the motion level to perform the motion classification for each position in the composite frame after the time T based on the weight calculated in the previous step. The level of motion is divided into four levels by the binary symbol, namely: no movement (00), few movement or noisy (01), movement (10), and many movement (11). Movement level values are then saved to two-dimensional arrays and stored in a two- dimensional array. Figure 3.2. Data struct to storage motion information 36 3.1.4. Storing Movement Information This step will store the movement information obtained after the synthesis step described by the motion described above. The movement information data is stored according to the hierarchy of space and time of the video. The structure that stores the motion description information is depicted in the Fig. Figure 3.2. Where: • Level 1 is a folder that contains aggregate data for each video storaged time by time. • Level 2 is the folder that contains the files that contains the information data according the horizontal of frame in a temporal dimension. • Level 3 is the files that contains movement information data of blocks in columns of the frame in a temporal dimension. • Level 4 is the contents of the files in level 3. These files contain binary values from 0 to 3. The value is the level of motion of the block in a time T (may be 1 seconds, 2 seconds, 3 seconds, 10 seconds, etc.). The user can modify T through using the parameter. The advantage of this data structure is when you want to search the moments when movement happen, you can choose an area (corresponding to some MBs). In that case, the time to searching is shorter because the application has only searched in the files correspond with the MBs you choose. Moreover, predefining the searching region (region of interested) will make the accuracy of the result is higher than the searching on full frame. Experiments 3.2.1. Dataset The proposed method is designed to operate with a fixed, downward-facing camera. The maximum resolution of videos is 1920x1080 pixels. The program can be installed directly on a device attached to the camera like Raspberry Pi, running Linux operating system that guarantees real-time processing. The experimental data was provided by VP9 Vietnam company and processed by HMI laboratory, University of Engineering and Technology. The data set includes 43 videos with resolutions of 1280x720 and 1920x1080. In addition, the method uses live data from more than 100 cameras installed in the city of Hanoi and Da Nang City which are provided by VP9 including indoor data and outdoor. The 37 videos with various lighting and environmental conditions including outdoor light (sunlight, low sunshine), artificial light (tube, led), wind, rain, etc. It can be said that the data set satisfies the supply of different situations and environments for the moving object detection problem. Figure 3.3. Example frames of test videos For gathering and statistics for the report, I made the ground truth for 7 videos with a resolution of 1280x720 and 1920x1080 and used these videos to perform the experimental results. Table 3.1 describes the information about the videos used for the experimental results. In Fig. Figure 3.3, we have some example frames of the test videos (Figure 3.3a is a frame of TrongNha_02, Figure 3.3b is a frame of DNG8_1708, Figure 3.3c is a frame of NEM1_131, Figure 3.3d is a frame of HMI_WetRoad, Figure 3.3e is a frame of CuaHang_01 and Figure 3.3f is a frame of HMI_OutDoor). These videos are captured in different environments 38 and circumstances to perform the experiements. Fig. Figure 3.4 depicts some of their respective frames and ground truth. Table 3.1. The information of test videos Video Information Resolution Place HMI_WetRoad 1920 × 1080 Outdoor HMI_OutDoor 1280 × 720 Outdoor GVO2_0308 1280 × 720 Outdoor NEM1_131 1920 × 1080 In-door DNG8_1708 1920 × 1080 Outdoor CuaHang_01 1280 × 720 In-door TrongNha_02 1280 × 720 In-door In addition, to compare with the approach of Poppe [24] that we base on in macroblock-based segmentation phase, we use the second dataset from IEEE Change Detection Workshop 2014 [30]. So, the experimental process will carry out on 2 datasets, including 11 test sequences, which are divided into 2 groups. First group consists of 4 test sequences: PETS2006, Pedestrians, Highway and Office from the baseline profile of the IEEE Change Detection Workshop 2014. Both video frames and motion ground truth can be downloaded on the homepage of Changedetection. We use ffmpeg [31] to create compressed video from given frames with all of encoding parameters set to default. Fig. Figure 3.5 shows an example frame of Pedestrians test sequence (a) and its motion ground truth (b). Table 3.2 shows the information of four videos: the 1st column is the name of videos, the next three columns are the resolution, frame rate value, and quantization parameter (qp) value, respectively, of each video. As we can see, the videos in the 1st group have difference resolution but they are all low resolution videos. The frame rate of videos is 25 fps and qp value depends on each video. These videos are quite similar to the videos in Poppe’s experiment. 39 Figure 3.4. Example frames and their ground truth Table 3.2. The information of test sequences in group 1 Video Information Resolution fps qp pedestrians 360 × 240 25 25 PETS2006 720 × 576 25 27 Highway 320 × 240 25 23 Office 360 × 240 25 23 The videos in the 2nd group are 7 videos mentioned above. These videos from actual indoor and outdoor surveillance cameras without scripting and prior arrangement. The motion ground truth are made by ourself by investigating the video frame by frame. They are all the high spatial resolution videos. 40 Figure 3.5. An example frame of Pedestrians (a) and ground truth image (b) 3.2.2. Evaluation methods The efficiency of the method is evaluated by the recall value, the precision value and F1 score. In which, the precision value is calculated by: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 recall value is calculated by: Recall = TruePositive TruePositive + FalseNegative and the F1 score is calculated by: 𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 41 where: • TruePositive: The total number of Macroblocks correctly detected as a moving object • FalsePositive: The total number of Macroblocks are background but detected as a moving object • FalseNegative: The total number of Macroblocks are a moving object but not detected High precision means that the accuracy of the method is good. High recall means that the percentage of the missing moving object is low. A perfect system is a system with precision and recall is both 100%. However, this is impossible. Normally, when adjusting the system for precision priority, it will reduce recall and vice versa. In that case, we can use the F1 score. This allows for a balance between precision and recall. 3.2.3. Implementations The proposed method in this thesis is set up in C++ language. Our experiments were done on Windows PC of the Intel Core i5-3337U, 1.8GHz, and 8 GB RAM. Base on observation, we’ve seen that Ts should be chosen empirically base on each test video. The other parameters should be Tc = 90%, TA = 10◦, TL = 20, and Tdensity = 80%. 3.2.4. Experimental results The videos in 1st group are performed experiment many times and select the best result. Table 3.3 shows the comparison experimental results of 2 approachs these videos. In the case of using the proposed method, the average value of precision of the four videos is 80%, the average of recall is 84%, and the F1 score is 81.9878. If using Poppe’s method, average of precision is 81%, average of recall is 83%, and the F1 score is 81.95122. We can see that the performance of our method is equivalent to that of Poppe’s method when applying on a low-resolution video. 42 Table 3.3. The performance of two approachs with Pedestrians, PETS2006, Highway, and Office Video Our approach Poppe’s approach Precision (%) Recall (%) F1 Precision (%) Recall (%) F1 pedestrians 84 95 89.16201 80 90 84.70588 PETS2006 87 80 83.35329 88 78 82.6988 Highway 77 81 78.94937 78 80 78.98734 Office 72 82 76.67532 75 83 78.79747 Average 80 84 81.95122 81 83 81.9878 With the 2nd video group, the high resolution videos, the proposoed method is used to perform experiment many times with different Ts parameters and selected 4 best results. Table 3.4 is the experimental result when using Poppe’s approach and Table 3.5 is the experimental result of the proposed method on these videos. The results show that the recall values of Poppe’s approach are usually smaller than the values of proposed method, meaning the number of missing moving objects detected by Poppe’s approach greater than the proposed method. This happen because there are many “skip_mode” MBs in a frame of a high resolution video. Table 3.4. The experimental result of Poppe’s approach on 2nd group Video Precision Recall F1 HMI_WetRoad 0.4954 0.8943 0.6376 HMI_OutDoor 0.5145 0.7711 0.6172 GVO2_0308 0.6821 0.6016 0.6393 NEM1_131 0.6055 0.7602 0.6741 DNG8_1708 0.8777 0.7489 0.8082 CuaHang_01 0.7468 0.8339 0.788 TrongNha_02 0.8341 0.7247 0.7756 In additional, the experimental results in Table 4.5 show that the videos which have good results are the videos have less noise, and there is a clear distinction between the background and moving objects. And, the results do not depend on videos capture from outdoor or indoor cameras. As in the results table, the best result is the TrongNha_02 video (Fig. Figure 3.3a) with F1 score = 0.8771. This is a video obtained in a working room (namely a police station). Good environmental conditions with low noise. A moving object is a person who clearly 43 distinguishes the floor. The shirt of a moving object has only one color but is not uniform due to many wrinkles. The worst video is NEM1_131 (Fig. Figure 3.3d) with F1 score = 0.6235. Although this video is recorded indoors, it has an outward-facing view. And the entrance of the room is made of glass, easy to reflect the moving objects. The video is recorded in the evening, so the light outside the room is easy to make noise. Table 3.5. The experimental result of proposed method on 2nd group Video Ts Precision Recall F1 HMI_WetRoad 90 0.7409 0.8644 0.7979 100 0.734 0.8935 0.8059 110 0.736 0.8197 0.7756 120 0.7461 0.9453 0.834 HMI_OutDoor 70 0.6916 0.8681 0.7699 80 0.641 0.8656 0.7366 90 0.7055 0.8962 0.7895 100 0.7195 0.9151 0.8056 GVO2_0308 70 0.5926 0.8018 0.6815 80 0.577 0.8653 0.6923 90 0.5376 0.836 0.6543 100 0.5821 0.916 0.7118 NEM1_131 90 0.4762 0.8183 0.602 100 0.4655 0.9333 0.6211 110 0.4847 0.8737 0.6235 120 0.4855 0.8702 0.6233 DNG8_1708 60 0.7612 0.8164 0.7878 65 0.7889 0.9217 0.8501 70 0.7843 0.9157 0.8449 75 0.777 0.8789 0.8248 CuaHang_01 75 0.7498 0.8796 0.8095 80 0.7676 0.9302 0.8411 85 0.7372 0.8598 0.7938 90 0.6828 0.9339 0.7889 TrongNha_02 50 0.8283 0.9319 0.8771 55 0.8139 0.9095 0.859 60 0.8248 0.9261 0.8725 65 0.8254 0.9247 0.8722 The experimental results also show that the choice of the threshold Ts is quite difficult. This is also a limitation of the proposed method. Normally, the video is 44 less noise, the threshold value Ts will be less than the Ts of the video has more noise. Under the system conditions described above, the processing speed is between 17 and 23 fps. If you install the program on a Raspberry Pi2 device, the processing speed is between 22 and 27 fps depending on the amount of motion in each frame of the video. This speed fully meets the real-time requirements of the problem. Chapter Summarization This chapter presents the experiment results of thesis. The dataset of experiments are taken from database of Change Detection Workshop 2014 and more than 100 actual surveillance cameras installed in Hanoi City and Da Nang City which are provided by VP9 including indoor data and outdoor sences. These videos are captured without scripting and prior arrangement. The results show that the proposed method can determine accurately moving objects in the benchmark videos of Change Detection Workshop 2014. In addition, with high-resolution videos, the proposed method can perform in real-time better than the related works. This may due to the appearance of many “skip_mode” MBs in a frame of a high resolution video. The proposed method has been also used to build a moving object detection application for industrial use. 45 CONCLUSIONS The thesis proposes a new moving object detection approach in H264/AVC compressed domain method for high-resolution video surveillance that exploits not the size of MBs but also the characteristics of MV fields of moving object to identify the interested moving object. The method can detect quickly most regions that contain moving objects even with uniform color objects. The thesis is a result of a real project of a company so the ability to apply in practice is very high. The application using the proposed method in the thesis can helps people to search, detect the moments when movement happen more effectively. The people can save a lot of time and effort. However, the proposed method still needs empirical thresholds in order to accurately detect the interested moving objects. In some scenes, the removal of noise motion like swaying tree branches cannot be done because the motion value of tree branches is high. For future work, we will focus on making the system self- tuning the thresholds by using machine learning to get the best results. 46 List of of author’s publications related to thesis 1. Minh Hoa Nguyen, Tung Long Vuong, Dinh Nam Nguyen, Do Van Nguyen, Thanh Ha Le and Thi Thuy Nguyen, “Moving Object Detection in Compressed Domain for High Resolution Videos,” SoICT ’17, pp. 364- 369, 2017. 2. Nguyễn Đình Nam, Nguyễn Thị Thủy, Nguyễn Đỗ Văn, Nguyễn Minh Hòa, Vương Tùng Long, Lê Thanh Hà, "Phương pháp phân tích và lưu trữ thông tin mô tả chuyển động trong nội dung viđeo và phương tiện lưu trữ dữ liệu tổng hợp mô tả chuyển động trong nội dung viđeo". Pending Patent, apply in 03/05/2017. 47 REFERENCES [1] S. Aslam, "Omnicore," Omnicore Group, 18 9 2018. [Online]. Available: https://www.omnicoreagency.com/youtube-statistics/. [2] M. Piccardi, "Background subtraction techniques: a review," IEEE International Conference on Systems, Man and Cybernetics, pp. 3099-3104, 2004. [3] A. A. T. D. a. A. C. Wren, "Pfinder: real-time tracking of the human body," IEEE Trans. on Patfern Anal. and Machine Infell, vol 19, pp. 780-785, 1997. [4] J. T. J. G. B. a. S. D.Koller, "Towards Robust Automatic Traffic Scene Analysis in Real- time," Proc. ICPR’94, pp. 126-131, 1994. [5] B. a. S.A.Velastin, "Automatic congestion detection system for underground platforms," Proc. ISIMP2001, pp. 158-161, 2001. [6] C. M. a. A. R.Cucchiara, "Detecting moving objects, ghosts, and shadows in video streams," IEEE Trans on Pattern Anal. and Machine Intell, vol. 25, pp. 1337-1442, 2003. [7] C. a. W.E.L.Grimson, "Adaptive background mixture models for real-time tracking," Proc. IEEE CVPR 1999, pp. 246-252, 1999. [8] P. P. a. J.A.Schoonees, "Understanding background mixture models for foreground segmentation," Proc. of IVCNZ 2002, pp. 267-271, 2002. [9] M. T. a. P. R. Venkatesh Babu, "A survey on compressed domain video analysis techniques," Multimedia Tools and Applications, vol. 75, p. 1043–1078, 2016. [10] G. G. a. G. T.Wiegand, "Overview of the H.264/AVC video coding standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560-576, 2003. [11] D. J. G. a. H. Q. ZengW, "Robust moving object segmentation on H.264/AVC compressed video using the block-based MRF model," Real-Time Imaging, vol. 11, pp. 36-44, 2009. [12] Y. L. a. Z. Z. Zhi Liu, "Real-time spatiotemporal segmentation of video objects in the H.264 compressed domain," Journal of visual communication and image representation, vol. 18, p. 275–290, 2007. [13] F.-E. G. R.-B. L. M.-G. J. a. J.-L. L. C Solana-Cipres, "Real-time moving object segmentation in H.264 compressed domain based on approximate reasoning," International Journal of Approximate Reasoning, vol. 51, p. 99–114, 2009. [14] C.-M. M. a. W.-K. Cham, "Real-time video object segmentation in H.264 compressed domain," IET Image Processing, vol. 3, p. 272 – 285, 2009. [15] P. C. V. S. L. P. a. V. D. W. R. S De Bruyne, "Estimating motion reliability to improve moving object detection in the H.264/AVC domain," IEEE international conference on multimedia and expo, p. 290–299, 2009. [16] Z. y. W. a. R. m. H. Shi zheng Wang, "Surveillance video synopsis in the compressed domain for fast video browsing," Journal of Visual Communication and Image Representation, vol. 24, p. 1431–1442, 2003. [17] P. A. A. H. a. A. K. Marcus Laumer, "Compressed Domain Moving Object Detection by Spatio-Temporal Analysis of H.264/AVC Syntax Elements," Picture Coding Symposium (PCS), p. 282–286, 2015. [18] R. V. B. a. R. G. P. Manu Tom, "Compressed domain human action recognition in H.264/AVC video streams," Multimedia Tools and Applications, vol. 74, no. 21, p. 9323– 9338, 2015. 48 [19] B. R. Biswas S, "Real-time anomaly detection in H.264 compressed videos," National conference on computer vision, pattern recognition, image processing and graphics, pp. 1-4, 2013. [20] B. R. Biswas S, "Anomaly detection in compressed H.264/AVC video," Multimedia Tools and Applications, p. 1–17, 2014. [21] C. D. C. Vimal Thilak, "Tracking of extended size targets in H.264 compressed video using the probabilistic data association filter," European Signal Processing Conference 12th, p. 281–284, 2004. [22] S. M. K. M. You W, "Moving object tracking in H.264/AVC bitstream," Multimedia Content Analysis and Mining, pp. 483-492, 2007. [23] H. N. Christian Käs, "An Approach to Trajectory Estimation of Moving Objects in the H.264 Compressed Domain," Advances in Image and Video Technology, pp. 318-329, 2009. [24] B. S. P. T. L. P. a. d. W. R. C Poppe, "Moving object detection in the H.264/AVC compressed domain for video surveillance applications," Journal of Visual Communication and Image Representation, vol. 20, p. 428–437, 2009. [25] L. R. S. M. C. P. a. R. v. d. W. Antoine Vacavant, "Adaptive background subtraction in H.264/Avc bitstreams based on macroblock sizes," Computer Vision Theory and Application (VISAPP), p. 51–58, 2011. [26] K. A. P. H. S. Ajay Divakaran, "Method for summarizing a video using motion and color descriptors". US Patent US09634364, 09 08 2000. [27] K. Ratakonda, "Method for hierarchical summarization and browsing of digital video". US Patent US5956026A, 19 12 1997. [28] K. C. L. H. T. O. Lipin Liu, "Intelligent, dynamic, long-term digital surveilance media storage system". US Patent US7751632B2, 15 02 2005. [29] J. T. C. I. J. 1, "ISO/IEC 14496-10," ISO and IEC, 2014. [Online]. Available: https://www.iso.org/obp/ui/#iso:std:iso-iec:14496:-10:ed-8:v1:en. [30] D. M., "Gentle Logic," 16 11 2011. [Online]. Available: [31] R. Finlayson, "LIVE555.COM," Live Networks, Inc., [Online]. Available: [32] Karsten.Suehring, "Fraunhofer," Fraunhofer Heinrich Hertz Institute, [Online]. Available: [33] V. L. a. K. Wong, "Design & Reuse," Ocean Logic Pty Ltd, [Online]. Available: https://www.design-reuse.com/articles/12849/designing-a-real-time-hdtv-1080p- baseline-h-264-avc-encoder-core.html.

Các file đính kèm theo tài liệu này:

luan_van_motion_analysis_from_encoded_video_bitstream.pdf