2011_CVPR_ljduan_Visual Saliency Detection by Spatially Weighted Dissimilarity

Abstract

In this paper, a new visual saliency detection method is proposed based on the spatially weighted dissimilarity. We measured the saliency by integrating three elements as follows: the dissimilarities between image patches, which were evaluated in the reduced dimensional space, the spatial distance between image patches and the central bias. The dissimilarities were inversely weighted based on the corresponding spatial distance. A weighting mechanism, indicating a bias for human fixations to the center of the image, was employed. The principal component analysis (PCA) was the dimension reducing method used in our system. We extracted the principal components (PCs) by sampling the patches from the current image. Our method was compared with four saliency detection approaches using three image datasets. Experimental results show that our method outperforms current state-of-the-art methods on predicting human fixations.

1.Introduction

Human vision system is able to select salient information among mass visual input to focus on. Observers never form

a complete, detailed representation of their surroundings

[1]. This selective attention mechanism enables us to efficiently capture prey and evade predators, and is a crucial feature for surviving. Due to its biological importance, a lot of efforts have been made to probe the nature of attention [2]. Meanwhile, for many applications in graphic, design and human computer interaction, e.g., image search, it is an essential function for understanding where humans look at in a scene [3]. Therefore, computationally modeling such mechanism has become a popular research topic in recent years [4-8]. More and more research work has been published on effectively simulating such intelligent behave in human visual system.

In this paper, our goal is to measure the saliency for each patch drawn from an image. There are three elements in our definition of the saliency: dissimilarity, spatial distance and central bias. We first combined two different kinds of information: the dissimilarities between patches which were evaluated in the reduced dimensional space, and the spatial distance between them which was evaluated in the spatial domain. If one patch is more distinct than all the other ones in the reduced dimensional space, it will be more likely to be a candidate salient region. While, with the increasing of the spatial distance between two patches, the influence of the dissimilarity between them is decreasing. Therefore, the dissimilarity is inversely weighted by the distance, which is known as the spatially weighted dissimilarity. In addition, according to the previous studies on the distribution of human fixations on images [27], people tend to gaze at the center. Therefore we also proposed a weighting mechanism indicating a strong bias to the center of the image.

In our method, we used the PCA to reduce the dimensionality of each image patch which is represented as a vector. PCs throw out dimensions that are noises with respect to the saliency calculation (e.g. high spatial frequencies that are ignored when evaluating fixation due to peripheral blur). The method in Rajashekar et al. [9] also inspired us: they filter the image by using the PCs extracted from the patches of fixation and linearly add the saliency maps generated from each PC. Their experimental results are promising. Because we apply our method to predict human fixations, we refer to the following conclusions [10]: “The fixation locations have a steeper two-point correlation function than the function generated on the base of the locations selected randomly and then the spatial correlations analyzed by PCA can be used to distinguish the fixation locations from the random ones”.

In order to evaluate the performance of our proposed

Visual Saliency Detection by Spatially Weighted Dissimilarity

Lijuan Duan1, Chunpeng Wu1, Jun Miao2, Laiyun Qing3, Yu Fu4 1College of Computer Science and Technology, Beijing University of Technology,

Beijing 100124, China

2Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,

Chinese Academy of Sciences, Beijing 100190, China

3School of Information Science and Engineering, Graduate University of the Chinese Academy of

Sciences, Beijing 100049, China

4Department of Computing, University of Surrey, Guildford, Surrey, UK GU2 7XH ljduan@https://www.360docs.net/doc/fb14927710.html,, wuchunpeng@https://www.360docs.net/doc/fb14927710.html,, jmiao@https://www.360docs.net/doc/fb14927710.html,,

lyqing@https://www.360docs.net/doc/fb14927710.html,, y.fu@https://www.360docs.net/doc/fb14927710.html,

method, we carried out some experiments on two color image datasets and a gray image dataset respectively. By comparing the saliency maps generated by several state-of-the-art saliency detection approaches, our method and the corresponding eye tracking data, we demonstrated that our method could predict human fixations more effectively.

The remainder of the paper is organized as follows: Previous work is discussed in the following section. In Section 3, we stated the framework of our saliency detection method in details. In Section 4 we demonstrated our experimental results using three image datasets and compared the results with another four saliency detection methods. In Section 5, we discussed the application of our method. The conclusions are given in Section 6.

2.Previous Work

Based on Treisman’s feature integration theory [11], the bottom-up model proposed by Itti et al. [4] focuses on the role of intensity, color and orientation, i.e., early visual features. These features were used by many methods [3, 8, 12, 13]. In these methods, the relevant structures were manually determined, such as selecting Gabors as visual filter. However, Kienzle et al. [14] inferred these relevant structures automatically from data. In our method, we found some similar low-level fixation attractors which were discussed in Rajashekar et al. [9] without specifying which image features should be analyzed. However, comparing to [9], we extracted the PCs by sampling the patches from the current image, but not from a large number of images. We suppose that PCA over patches within each image emphasized the variability within the image, which is what will drive fixation.

In frequency domain, Hou et al. [15] analyzed the log amplitude spectrum of an image and obtained the spectral residual which indicated the saliency. While Guo et al. [16] argued that the phase spectrum, but not the amplitude spectrum, of the Fourier transform is the key of obtaining the location of salient areas. Actually, the method in [15] was based on the well-known 1/f law (i.e., scale invariance) which measures the spatial correlations between pixels in an image [17]. In our method, we directly analyzed the correlations in the reduced dimensional space, but not in the frequency domain.

Following the visual saliency model [4], the center-surround mechanism has been widely studied [7, 12]. Besides this measure, other criterions for finding the “irregular patterns” in images were also used in literature. Bruce et al. [6] computed the saliency based on the self-information. Hou et al. [18] introduced the Incremental Coding Length (ICL) to measure the perspective entropy gain from each feature. Harel et al. [19] and Gopalakrishnan et al. [20] formulated the problem of saliency detection as Markov random walks on images represented as graphs. In our method, the saliency is determined by the spatially weighted dissimilarity.

3.Proposed Saliency Measurement

The proposed framework is shown in Fig. 1. There are three main steps in our method: representing image patches, reducing dimensionality and evaluating the spatially-weighted dissimilarity. Non-overlapping patches drawn from an image are represented as vectors of pixels. All patches are mapped into a reduced dimensional space. And then the saliency of each image patch is determined by aggregating the spatially-weighted dissimilarities between this patch and all the other ones in the image. The weighting mechanism indicating central bias is used in this step. Finally, the saliency map is normalized and resized to the scale of the original image, and then is smoothed with a Gaussian filter (3

σ=). In Fig. 1, for a better illustration, all patches are reduced to 3 dimensions which can be arbitrary in practice.

3.1.Image Patches Representation

Given an H W

×image I, non-overlapping patches with

the size of k k

×pixels are drawn from it. So the total number of patches is//

L H k W k

????

????. A patch is

denoted as

p where1,2,...,

i L

=. Then, each patch is

represented as a column vector

f of pixel values. The length of the vector is2

3k, since the color space has three components. Finally, we get a sample matrix 1

[]

i L

A f f f f

"" where L is the total number of patches as stated above. In Section 4.2, we will comment on the effect if patches are allowed to overlap and the effect Figure 1: The framework of the proposed method. There are three main steps in our method: representing image patches, reducing dimensionality, and evaluating the spatially-weighted dissimilarity. The weight mechanism indicating the central bias

is used in the third step. In the figure, for a better illustration, all

patches are reduced to 3 dimensions which can be arbitrary in

practice.

if the grid divisions are allowed to arbitrarily locate.

3.2. Dimensionality Reduction

We aim to effectively describe patches in a relatively low dimensional space. As discussed in Section 1, we used an equivalent method to PCA to reduce data dimension. Each column in the matrix A subtracts the average along the columns. Then, we calculated the co-similarity matrix 2()/T L =G A A , therefore the size of the matrix G is L L ×. The eigenvalues and eigenvectors were calculated based on the matrix G and the biggest d eigenvalues were selected with their eigenvectors 12[,,,]T d =U X X X "

where i X is an eigenvector. The size of the matrix U is d L ×. As demonstrated in Fig. 1, a patch is mapped to a point in the reduced dimensional space and the positions of these points are determined by value of eigenvectors in matrix U . For example, the patch i p is represented by the corresponding column 1()T i i di x x =???U where i =1, 2,…, L . In Section 4.2, we will show the results of our method with and without the step of dimensionality reduction.

3.3. Evaluate Spatially Weighted Dissimilarity

There are two factors which were considered for evaluating the saliency: the dissimilarities between image patches in a reduced dimensional space, and their spatial distance. With the increasing of the spatial distance between two patches, the influence of the dissimilarity between them was decreasing. Therefore, the dissimilarities were inversely weighted by their corresponding spatial distances. Furthermore, the distance of each patch from the center of the image is involved in the evaluation of the saliency because of the central bias as stated in [3, 27]. With the increasing of the distance between a patch and the center, the saliency of the patch should be appropriately depreciated.

By integrating the elements of dissimilarity, spatial distance and central bias, the saliency of the patch i p is defined as follows

211()(){(,)(,)}L

j Saliency i i i j Dissimilarity i j ωω==??∑ (1)

where 1(,)i j ω is defined as

(,)1(,)

i j i j Dist p p ω=

+ (2)

where (,)i j Dist p p is the spatial distance between the two centers of patch i p and patch j p

in the image. The (,)Dissimilarity i j between the patch i p and patch j p in the reduced dimensional space is defined as

(,)||d

si sj s Dissimilarity i j x x ==?∑ (3)

In Equation (1), besides the weight 1ω representing the biological plausible characteristics which is similar to Mexican hat function, 2ωis the second weighting mechanism we proposed according to the average saliency map from human eye fixations indicating a bias to the center of image, which is shown at the bottom right of Fig. 4 in Judd et al. [3]. 2()i ω is defined as :

2()1()/i i DistToCenter p D ω=? (4)

where (

)i DistToCenter p is the spatial distance between two centers of patch i p and the patch at the center of the original image, and max {()}j j D DistToCenter p = is a normalization factor.

We will demonstrate in Section 4.2 that both spatially weighting mechanisms 1ω and 2ω play the significant role in saliency computing.

4. Experimental Validation

We applied our method on three public image datasets to evaluate its performance. Our method was compared with four different state-of-the-art saliency detection models based on a commonly-used validation approach. The same parameters of our method will be used across these datasets to illustrate its robustness.

Input Our method Human fixations

Figure 2: Results for a qualitative comparison between our

method and human fixations on the color image dataset 1. The first column show the input images, the second column is our

4.1. Parameters Selection

There are three parameters in our method: (1) the color space to which each patch is transformed (2) the dimension d to which each vector representing a patch is reduced (3) the size k of each patch. For color space, we choose YCbCr which is perceptually uniform and is a better approximation of the color processing in the human vision system as Gopalakrishnan et al. show in [20]. For each vector, the YCbCr channels are stacked after each other. We chose 11 as reduced dimension, because 11 is also the value that maximizes saliency predictions. The relationship between the dimensionality and the performance of our method will be shown in subsection 4.2. For the size k of each patch, we choose 14 because it is the smallest size that obtained maximal AUC. Also, the relationship between the size and the performance of our method will be shown in subsection 4.2.

4.2. Results on Color Image Dataset 1

The color image dataset which we used is introduced in Bruce et al. [6]. There are 120 images including indoor and outdoor scenes in the dataset, and 20 subjects’ fixations are recorded for each image （all the image sizes are 681× 511 pixels ）. To compare the saliency maps with the human fixations, we use the popular validation approach as Bruce et al. and Tatler et al. introduced in [6, 21]. The area under

the Receiver Operator Characteristics (ROC) curve, i.e., the area under the curve (AUC), was used to quantitatively evaluate the model performance.

Fig. 2 qualitatively shows the comparison between our saliency maps and the fixation density maps generated from the sum of all 2D Gaussians approximations of the drop-off of the density of the human fixations. We also compared our saliency maps with the other four state-of-the-art approaches [4, 6, 18, 19] in Fig. 3. The comparison results show that the most salient locations (the regions in the red circles in particular) on our saliency maps are more consistent with the human fixation density maps. For instance, in the fifth image, the car near the center is attended by human observers, but it is not detected to be salient by all the other saliency detection methods except ours. As demonstrated in Table 1, our method outperforms the other four methods on predicting human fixations. Furthermore, we compared the ROC curves in Fig. 4 which shows that our method achieves higher hit rates and lower false positive rates.

Table 1: Performances on the color image dataset 1

Attention Model AUC Improvement

Itti et al. [4] 0.7049 - Bruce et al. [6] 0.7613 0.0564 Hou et al. [18] 0.7923 0.0310 Harel et al. [19] 0.8021 0.0098 Our method 0.8333 0.0312

Input Itti Bruce Hou Harel Our method Human fixations

Figure 3: Results for a qualitative comparison between our method and the other four approaches on the color image dataset 1. The columns from the left to the right are: the input images, the saliency maps of Itti et al.’s method [4], Bruce et al.’s method [6], Hou et al.’s method [18], Harel et al.’s method [19], our method and the human fixation density maps. The most salient locations (the regions in the red circles in particular) on our saliency maps are more consistent with the human fixation density maps.

The results listed in Table 1 for those four compared methods are different from the results published in their corresponding papers [6, 18, 19]. This is because the sampling density which we used to obtain the thresholds is different with what they used. However, in our experiment, they were all evaluated on the same validation approach (we generated the ROC curves by using Harel et al. [19]’s code), so their relative performance should not be affected.

We compared the results generated by our method with and without doing dimensionality reduction. The comparison results are show in Fig 5. Based on these two curves, we investigated the relationship between the AUC area and the size of the image patches. In Fig. 5(a), although the trends of these two curves are nearly the same, we should notice that the method without dimensionality reduction always gets a smaller AUC for the same size of patches. Numerically, the average difference of AUC between these two methods is 0.02, which means that it is important to reduce the dimensionality according to the third column in Table 1. In Fig. 5(a), our proposed method (the red curve with circles) shows that the AUCs corresponding to the sizes of patches from 11 to 24 are above 0.83 and they are not significantly different. Among these sizes, we choose 14 as stated in subsection 4.1.

Then, in Fig. 5(b), we analyze the relationship between the AUC and the dimension d to which each vector representing an image patch is reduced, when the size k of image patch is 14. When the dimensionality d is bigger than 10, the AUC decreases as the dimensionality increases. That is to say, except the first few PCs, image patches is 14. In Fig. 5(b), when the dimensionality is bigger than 10, the AUC decreases as the dimensionality increases. That is to say, except the first few PCs, most of

the other ones rarely contribute to saliency detection . In fact, according to Hyv?rinen et al. [28], these PCs (except the first few ones) do not have meaningful spatial structure. In addition, in Fig. 5(b), the AUCs above 0.83 correspond to the dimensionality d from 6 to 17. To maximize saliency predictions, we choose d =11 in our method.

We also investigated the role of the two weighting

mechanisms 1ωand 2ω

as mentioned in Section 3.3. By setting 2ω to 1, the AUC decreases from 0.8333 to 0.7986, which is lower than the second largest AUC of 0.8021 by using Harel et al. [19]’s method. If we use the flat weights, i.e., 1ω and 2ω are set to 1, the AUC decreases from largely from 0.8333 to 0.7297. Therefore, both two spatially weighting mechanisms play the significant role in saliency computing.

(a)

(b)

Figure 5: (a) shows the comparison between our method with and without doing dimensionality reduction. Each curve illustrates the relationship between the AUC and the size k of patches. (b) shows the relationship between the AUC and the dimension d , when k is 14.

Figure 4: The ROC curves of our model and the other four approaches on the color image dataset 1.

As proposed in Section 3.1, non-overlapping patches are drawn from images in our method. When patches are half

overlapped, the AUC decreased from 0.8333 to 0.8276. This is because more similar patches produced near the current patch are given the higher weights by the item in Equation (1) and they decrease the dissimilarity between them and the current patch dominantly. However, the decreased AUC 0.8276 is still the highest result among other methods of the state of the art. In addition, adopting arbitrary location of grid divisions does not make a significant variation on the AUC results as shown in Table 2. For example, “Offset = 0.5” in Table 2 indicates that the grid divisions is moved with a 0.5 block size rightwards and downwards, the AUC has a little variation of -0.0006 and is still the highest result.

Table 2: Arbitrary location of grid divisions and AUC results

Offset (block size) AUC

0.1 0.8333 0.3 0.8322 0.5 0.8327 0.7 0.8300 0.9 0.8288 1.0 0.8293

4.3. Results on Color Image Dataset 2

We tested our method on another color image dataset introduced in Judd et al. [3]. The same parameters as stated in Section 4.1 were used. There are 1003 natural images containing different scenes and objects in this dataset, and the corresponding human fixations are also recorded. The size of the images in this dataset is not the same (the width

varies from 682 to 1024 pixels, and the height varies from 628 to 1024 pixels), which is different from the Bruce’s dataset used in Section 4.2. The AUC results and are shown in Table 3. Our method achieved the highest AUC results. The comparison between saliency maps is shown in Fig.6.

Table 3: Performances on the color image dataset 2

Attention Model AUC Bruce et al. [6] 0.7181 Hou et al. [18] 0.7625 Itti et al. [4] 0.7640 Harel et al. [19] 0.8172 Our method 0.8330

4.4. Results on Gray Image Dataset

The gray image dataset is DOVES which is introduced in van der Linde et al. [22]. The same parameters as stated in Section 4.1 are used. The DOVES dataset collects a set of visual eye movements from 29 human observers during viewing 101 natural calibrated images. We removed the first fixations of each eye movement trace superimposed on an image, because these fixations are forced fixations [22], they are not influenced by the saliency information on the current image. In Fig. 7, we compared our saliency maps with Itti et al.’s approach [4]. We used AUC to qualitatively evaluate the performance of our proposed method and the evaluating results are listed in Table 4.

Comparing to the AUC results on color images, the results on gray images do not decrease largely as shown in Table 4. However, the salient regions detected by Itti et al. [4]’s method and our method on gray images are not as consistent with human fixations as the regions detected by

Input Bruce Hou Itti Harel Our method Human Fixations

Figure 6: Results for a qualitative comparison between our method and the other four approaches on the color image dataset 2. The columns from the left to the right are: the input images, the saliency maps of Bruce et al.’s method [6], Hou et al.’s method [18], Itti et al.’s method [4], Harel et al.’s method [19], our method and the human fixation density maps.

these methods on color images. This might be because the images from the color image dataset contain semantic objects, but the images from the gray image dataset contain more raw signals. Meanwhile, Heidemann [23] stated that “for the same eigenvalue the corresponding PC from the color image has lower spatial frequency than the PC from the grey image, and features with lower spatial frequency are more preferable for many vision tasks (e.g., biological vision task) for they are more robust against translation or image distortion”.

Table 4: Performances on the gray image dataset

Attention model Itti et al. [4] Our method AUC 0.7101

0.8352

5.Discussions

The saliency maps produced by our method are a little spread out, i.e., large regions of the image appear to all be quite saliency. On the contrary, Achanta et al. [8]’s method generated sharper and uniformly highlighted salient regions. Using the above method of ROC curve, we found the AUC of [8] was smaller than the results by using Harel et al. [19] and our method. As pointed out by [8], “the true usefulness of a saliency map is determined by the application”. Our saliency maps contain information from several elements of biological plausible characteristics. We can get different salient regions by setting different thresholds for different visual tasks. For example, by setting a higher threshold, the saliency map may produce sparse salient regions approximating human fixations; by setting a lower threshold, the diffuse saliency map may produce the “coarse segmentation” effect that means salient regions contain salient objects. Based on it, the skeleton features can be extracted and used for fine segmentation. As shown

in Achanta et al. [8] and Yu et al. [26], their work indicated the possible application of the proposed method for saliency-guided segmentation.

6.Conclusions and Future Work

In this paper, we have presented a visual saliency detection method. To measure the saliency, we actually combine spatially weighted dissimilarity with a weighting mechanism indicating central bias for human fixations. In Section 4.2, we demonstrate the significant role of the above two weighing mechanisms. We extract PCs by sampling the patches from current image, because we suppose that PCA over patches within each image emphasize the variability within the image, which is what will drive fixation. Experimental results on three public image datasets show that our method outperforms some state-of-the-art saliency detection approaches on predicting human fixations. The saliency maps produced by our method are spread out; however, we can get different salient regions by setting different thresholds for different visual tasks.

One of the limitations of the PCA is that it only captures the structure of data in which the linear pairwise correlations are the most important form of statistical dependence [24]. Therefore we are considering how to extend our analysis to include the non-linear dependencies

in the data. Also, the size of the image patches was predefined, and the image was not represented in a multi-scale way. Therefore, the performance of our method may be degraded on the multi-scales visual attention regions. In future, we will extend our work to find a mechanism which can select suitable scales for a particular location. In addition, in our PCA-based method, the 2D image patches must be transformed into 1D vectors, so the resulting vectors lead to a high dimensional space. To spend less time on determining the corresponding eigenvectors and evaluate the covariance matrix more accurately, we will try to use the two-dimensional PCA introduced in Yang et al. [25], which does not transform the image matrix into a vector.

Acknowledgement. The authors would like to thank the technical assistance from the graduate student Chen Chi and the helpful discussion with Dr. Zhen Yang and Dr. Yuzhi Chen. This research is partially sponsored by National Basic Research Program of China (No.2009CB320902), Hi-Tech Research and Development Program of China (No.2006AA01Z122), Beijing Natural Science Foundation(Nos.4072023 and 4102013), Natural Science Foundation of China (Nos.60702031, 60970087, 61070116, 61070149 and 60802067) and President Fund of Graduate University of Chinese Academy of Sciences (No.085102HN00).

Input Itti Our method Human Figure 7: Results for a qualitative comparison between our method and Itti et al.’s approach [4] on the gray image dataset. The first column is input images, the second columns is the saliency maps of Itti et al.’s method, the third column is the saliency maps of our method, and the fourth column is the human fixation density maps.

References

[1]R. Rensink, K. O’Regan, J. Clark. To see or not to see: The

need for attention to perceive changes in scenes. In:

Psychological Sciences, 1997.

[2]J. K. Tsotsos, L. Itti and G. Rees. A brief and selective

history of attention. In: Neurobiology of Attention, Editors

Itti, Rees & Tsotsos, Elsevier Press, 2005.

[3]T. Judd, K. Ehinger, F. Durand and A. Torralba. Learning to

predict where humans look. In: ICCV, 2009.

[4]L. Itti, C. Koch and E. Niebur. A model of saliency-based

visual attention for rapid scene analysis. In: PAMI, 1998. [5]T. Kadir and M. Brady. Saliency, scale and image

description. In: IJCV, 2001.

[6]N. D. B. Bruce and J. K. Tsotsos. Saliency based on

information maximization. In: NIPS, 2005.

[7] D. Gao and N. Vasconcelos. Bottom-up saliency is a

discriminant process. In: ICCV, 2007.

[8]R. Achanta, S. Hemami, F. Estrada and S. Susstrunk.

Frequency-tuned salient region detection. In: CVPR, 2009. [9]U. Rajashekar, L. K. Cormack and A. C. Bovik. Image

features that draw fixations. ICIP, 2003.

[10]P. Reinagel and A. M. Zador. Natural scene statistics at the

centre of gaze. Network: Computation in Neural Systems,

1999.

[11]A. M. Treisman and G. Gelade. A feature-integration theory

of attention. Cognitive Psychology, 1980.

[12]T. Liu, J. Sun, N. N. Zheng, X. Tang and H. Y. Shum.

Learning to detect a salient object. CVPR, 2007.

[13]V. Gopalakrishnan, Y. Hu and D. Rajan. Salient region

detection by modeling distributions of color and orientation.

IEEE Transactions on Multimedia, 2009.

[14]W. Kienzle, M. O. Franz, B. Scholkopf and F. A. Wichmann.

Center-surround patterns emerge as optimal predictors for

human saccade targets. Journal of Vision, 2009.

[15]X. Hou and L. Zhang. Saliency detection: A spectral residual

approach. CVPR, 2007.

[16]C. Guo, Q. Ma and L. Zhang. Spatio-temporal saliency

detection using phase spectrum of quaternion fourier transform. CVPR, 2008.

[17]J. Gluckman. Higher order whitening of natural images.

CVPR, 2005.

[18]X. Hou and L. Zhang. Dynamic visual attention: Searching

for coding length increments. NIPS, 2008.

[19]J. Harel, C. Koch and P. Perona. Graph-based visual

saliency. NIPS, 2006.

[20]V. Gopalakrishnan, Y. Hu and D. Rajan. Random walks on

graphs to model saliency in images. CVPR, 2009.

[21]B. W. Tatler, R. J. Baddeley, I. D. Gilchrist. Visual correlates

of fixation selection: effects of scale and time. Vision

Research, 2005.

[22]I. van der Linde, U. Rajashekar, A.C. Bovik, and L.K.

Cormack, DOVES: A database of visual eye movements.

Spatial Vision, 2009.

[23]G. Heidemann. The principal components of natural images

revisited. PAMI, 2006.

[24]B. A. Olshausen and D. J. Field. Emergence of simple-cell

receptive field properties by learning a sparse code for

natural images. Nature, 1996.

[25]J. Yang, D. Zhang, A. F. Frangi and J. Y. Yang.

Two-dimensional PCA: A New Approach to

Appearance-Based Face Representation and Recognition.

PAMI, 2004.

[26]H. Yu, J. Li, Y. Tian and T. Huang. Automatic interesting

object extraction from images using complementary saliency maps. ACM International Conference on Multimedia, 2010.

[27]B. W. Tatler. The central fixation bias in scene viewing:

Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision, 2007.

[28]A. Hyv?rinen, J. Hur r i, and P. O. Hoyer. Natur al Image

Statistics: A Probabilistic Approach to Early Computational Vision, Springer: London, 2009.

结合域变换和轮廓检测的显著性目标检测

第30卷第8期计算机辅助设计与图形学学报Vol.30No.8 2018年8月Journal of Computer-Aided Design & Computer Graphics Aug. 2018结合域变换和轮廓检测的显著性目标检测李宗民1), 周晨晨1), 宫延河1), 刘玉杰1), 李华2) 1) (中国石油大学(华东)计算机与通信工程学院青岛 266580) 2) (中国科学院计算技术研究所智能信息处理重点实验室北京 100190) (lizongmin@https://www.360docs.net/doc/fb14927710.html,) 摘要: 针对多层显著性图融合过程中产生的显著目标边缘模糊、亮暗不均匀等问题, 提出一种基于域变换和轮廓检测的显著性检测方法. 首先选取判别式区域特征融合方法中的3层显著性图融合得到初始显著性图; 然后利用卷积神经网络计算图像显著目标外部轮廓; 最后使用域变换将第1步得到的初始显著性图和第2步得到的显著目标轮廓图融合. 利用显著目标轮廓图来约束初始显著性图, 对多层显著性图融合产生的显著目标边缘模糊区域进行滤除, 并将初始显著性图中检测缺失的区域补充完整, 得到最终的显著性检测结果. 在3个公开数据集上进行实验的结果表明, 该方法可以得到边缘清晰、亮暗均匀的显著性图, 且准确率和召回率、F-measure, ROC以及AUC等指标均优于其他8种传统显著性检测方法. 关键词: 显著性目标; 卷积神经网络; 轮廓检测; 域变换融合中图法分类号:TP391.41 DOI: 10.3724/SP.J.1089.2018.16778 Saliency Object Detection Based on Domain Transform and Contour Detection Li Zongmin1), Zhou Chenchen1), Gong Yanhe1), Liu Yujie1), and Li Hua2) 1)(College of Computer & Communication Engineering , China University of Petroleum, Qingtao 266580) 2) (Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) Abstract: In order to solve the problem of edge blur and brightness non-uniformity of salient object in the process of multi-level saliency maps integration, this paper proposes a saliency detection method based on domain transform and contour detection. Firstly, we obtain initial saliency map by integrate three saliency maps using DRFI method. Then, the salient object contour of image are computed by convolutional neural network. Finally, we use domain transform to blend the initial saliency map and the salient object contour. Under the constraints of the salient object contour, we can filter out the errors in the initial saliency map and compensate the missed region. The experimental results on three public datasets demonstrates that our method can produce a pixel-wised clearly saliency map with brightness uniformity and outperform other eight state-of-the-art saliency detection methods in terms of precision-recall curves, F-measure, ROC and AUC. Key words: salient object; convolutional neural network; contour detection; domain transform fuse 人类视觉系统可以快速定位图像中的显著区域, 显著性检测就是通过模拟人类大脑的视觉收稿日期: 2017-07-10; 修回日期: 2017-09-12. 基金项目: 国家自然科学基金(61379106, 61379082, 61227802); 山东省自然科学基金(ZR2013FM036, ZR2015FM011). 李宗民(1965—), 男, 博士, 教授, 博士生导师, CCF会员, 主要研究方向为计算机图形学与图像处理、模式识别; 周晨晨(1992—), 女, 硕士研究生, 主要研究方向为计算机图形学与图像处理; 宫延河(1989—), 男, 硕士研究生, 主要研究方向为计算机图形学与图像处理; 刘玉杰(1971—), 男, 博士, 副教授, CCF会员, 主要研究方向为计算机图形图像处理、多媒体数据分析、多媒体数据库; 李华(1956—), 男, 博士, 教授, 博士生导师, CCF会员, 主要研究方向为计算机图形图像处理. 万方数据

基于全局对比度的显著性区域检测

基于全局对比度的显著性区域检测 Ming-Ming Cheng1Guo-Xin Zhang1 Niloy J. Mitra2 Xiaolei Huang3Shi-Min Hu1 1TNList, Tsinghua University 2 KAUST 3 Lehigh University 摘要视觉显著性的可靠估计能够实现即便没有先验知识也可以对图像适当的处理，因此在许多计算机视觉任务中留有一个重要的步骤，这些任务包括图像分割、目标识别和自适应压缩。我们提出一种基于区域对比度的视觉显著性区域检测算法，同时能够对全局对比度差异和空间一致性做出评估。该算法简易、高效并且产出满分辨率的显著图。当采用最大的公开数据集进行评估时，我们的算法比已存的显著性检测方法更优越，具有更高的分辨率和更好的召回率。我们还演示了显著图是如何可以被用来创建用于后续图像处理的高质量分割面具。 1 引言人们经常毫不费力地判断图像区域的重要性，并且把注意力集中在重要的部分。由于通过显著性区域可以优化分配图像分析和综合计算机资源，所以计算机检测图像的显著性区域存在着重要意义。提取显著图被广泛用在许多计算机视觉应用中，包括对兴趣目标物体图像分割[13, 18]、目标识别[25]、图像的自适应压缩[6]、内容感知图像缩放[28, 33,30, 9]和图像检索[4]等。显著性源于视觉的独特性、不可预测性、稀缺性以及奇异性，而且它经常被归因于图像属性的变化，比如颜色、梯度、边缘和边界等。视觉显著性是通过包括认知心理学[26, 29]、神经生物学[8, 22]和计算机视觉[17, 2]在内的多学科研究出来的，与我们感知和处理视觉刺激密切相关。人类注意力理论假设人类视力系统仅仅详细处理了部分图像，同时保持其他的图像基本未处理。由Treisman和Gelade [27]，Koch和Ullman [19]进行的早期工作，以及随后由Itti，Wolfe等人提出的注意力理论提议将视觉注意力分为两个阶段：快速的、下意识的、自底向上的、数据驱动显著性提取；慢速的、任务依赖的、自顶向下的、目标驱动显著

显著性目标检测中的视觉特征及融合

第34卷第8期2017年8月计算机应用与软件 Computer Applications and Software VoL34 No.8 Aug.2017 显著性目标检测中的视觉特征及融合袁小艳u王安志1潘刚2王明辉1 \四川大学计算机学院四川成都610064) 2 (四川文理学院智能制造学院四川达州635000) 摘要显著性目标检测，在包括图像/视频分割、目标识别等在内的许多计算机视觉问题中是极为重要的一步，有着十分广泛的应用前景。从显著性检测模型过去近10年的发展历程可以清楚看到，多数检测方法是采用视觉特征来检测的，视觉特征决定了显著性检测模型的性能和效果。各类显著性检测模型的根本差异之一就是所选用的视觉特征不同。首次较为全面地回顾和总结常用的颜色、纹理、背景等视觉特征，对它们进行了分类、比较和分析。先从各种颜色特征中挑选较好的特征进行融合，然后将颜色特征与其他特征进行比较，并从中选择较优的特征进行融合。在具有挑战性的公开数据集ESSCD、DUT-0M0N上进行了实验，从P R曲线、F-M easure方法、M A E绝对误差三个方面进行了定量比较，检测出的综合效果优于其他算法。通过对不同视觉特征的比较和融合，表明颜色、纹理、边框连接性、Objectness这四种特征在显著性目标检测中是非常有效的。关键词显著性检测视觉特征特征融合显著图中图分类号TP301.6 文献标识码 A DOI：10. 3969/j. issn. 1000-386x. 2017.08. 038 VISUAL FEATURE AND FUSION OF SALIENCY OBJECT DETECTION Yuan Xiaoyan1,2Wang Anzhi1Pan Gang2Wang Minghui1 1 (College o f Computer Science，Sichuan University，Chengdu 610064，Sichuan，China) 2 {School o f Intelligent M anufacturing, Sichuan University o f A rts and Science, Dazhou 635000, Sichuan, China) Abstract The saliency object detection is a very important step in many computer vision problems, including video image segmentation, target recognition, and has a very broad application prospect. Over the past 10 years of development of the apparent test model, it can be clearly seen that most of the detection methods are detected by using visual features, and the visual characteristics determine the performance and effectiveness of the significance test model. One of the fundamental differences between the various saliency detection models is the chosen of visual features. We reviewed and summarized the common visual features for the first time, such as color, texture and background. We classified them, compared and analyzed them. Firstly, we selected the better features from all kinds of color features to fuse, and then compared the color features with other characteristics, and chosen the best features to fuse. On the challenging open datasets ESSCD and DUT-OMON, the quantitative comparison was made from three aspects：PR curve, F-measure method and MAE mean error, and the comprehensive effect was better than other algorithms. By comparing and merging different visual features, it is shown that the four characteristics of color, texture, border connectivity and Objectness are very effective in the saliency object detection. Keywords Saliency detection Visual feature Feature fusion Saliency map 收稿日期:2017-01-10。国家重点研究与发展计划项目（2016丫?80700802,2016丫?80800600);国家海洋局海洋遥感工程技术研究中心创新青年项目（2015001)。袁小艳，讲师，主研领域:计算机视觉，机器学习，个性化服务。王安志，讲师。潘刚，讲师。王明辉，教授。

显著性区域检测matlab_code_20170412

1.Main_program %====test1 clear close all clc %img_in,输入的待处理的图像 %n，图像分块的patch的大小 %img_rgb:缩放后的rgb图 %img_lab：rgb2lab后的图 %h，w，图像高宽 %mex_store,存储矩阵 img=imread('test1.jpg'); % img=imread('E:\例程\matlab\显著性分割\假设已知均值\test picture\0010.tif'); max_side=120; yu_value=0.8; img_lab = pre_rgb2lab(img,max_side); [ img_scale_1,img_scale_2,img_scale_3,img_scale_4 ] = get4Scale( img_lab ); [ DistanceValue_scale_1_t1,DistanceValue_scale_1_exp,DistanceValue_scale_1_t1_rang,DistanceValue_scale_1_exp_rang] = distanceValueMap_search_onescale_2( img_scale_1,max_side ); [ DistanceValue_scale_2_t1,DistanceValue_scale_2_exp,DistanceValue_scale_2_t1_rang,DistanceValue_scale_2_exp_rang] = distanceValueMap_search_onescale_2( img_scale_2,max_side ); [ DistanceValue_scale_3_t1,DistanceValue_scale_3_exp,DistanceValue_scale_3_t1_rang,DistanceValue_scale_3_exp_rang] = distanceValueMap_search_onescale_2( img_scale_3,max_side ); [ DistanceValue_scale_4_t1,DistanceValue_scale_4_exp,DistanceValue_scale_4_t1_rang,DistanceValue_scale_4_exp_rang] = distanceValueMap_search_onescale_2( img_scale_4,max_side ); value_C_1=Center_weight( DistanceValue_scale_1_exp_rang,yu_value ); value_C_2=Center_weight( DistanceValue_scale_2_exp_rang,yu_value ); value_C_3=Center_weight( DistanceValue_scale_3_exp_rang,yu_value ); value_C_4=Center_weight( DistanceValue_scale_4_exp_rang,yu_value ); [h,w]=size(value_C_1); value_C_1_resize=value_C_1; value_C_2_resize=imresize(value_C_2,[h,w]); value_C_3_resize=imresize(value_C_3,[h,w]); value_C_4_resize=imresize(value_C_4,[h,w]); value_C_sum=(value_C_1_resize+value_C_2_resize+value_C_3_resize+value_C_4_resize)/4; figure, imshow(value_C_sum); figure, imshow(value_C_1_resize); figure, imshow(value_C_2_resize); figure, imshow(value_C_3_resize); figure, imshow(value_C_4_resize); 8888888888888888888888888888888888888888888888888888888888888888888888888888

图像显著性目标检测算法研究

图像显著性目标检测算法研究随着移动电子设备的不断升级与应用,使用图像来记录或表达信息已成为一种常态。我们要想快速地在海量图像中提取出有价值的信息,那么需要模拟人类视觉系统在机器视觉系统进行计算机视觉热点问题的研究。图像显著性目标检测对图像中最引人注意且最能表征图像内容的部分进行检测。在图像显著性目标检测任务中,传统的方法一般利用纹理、颜色等低层级视觉信息自下向上地进行数据驱动式检测。对于含有单一目标或高对比度的自然场景图像,可以从多个角度去挖掘其显著性信息,如先验知识、误差重构等。然而,对于那些具有挑战性的自然场景图像,如复杂的背景、低对比度等,传统的方法通常会检测失败。基于深度卷积神经网络的算法利用高层级语义信息结合上下文充分挖掘潜在的细节,相较于传统的方法已取得了更优越的显著性检测性能。本文对于图像显著性检测任务存在的主要问题提出了相应的解决方法。本文的主要贡献如下:为充分挖掘图像多种显著性信息,并使其能够达到优势互补效果,本文提出了一种有效的模型,即融合先验信息和重构信息的显著性目标检测模型。重构过程包括密度重构策略与稀疏重构策略。密度重构其优势在于能够更准确地定位存在于图像边缘的显著性物体。而稀疏重构更具鲁棒性,能够更有效地抑制复杂背景。先验过程包含背景先验策略与中心先验策略,通过先验信息可更均匀地突出图像中的显著性目标。最后,把重构过程与先验过程生成的显著特征做非线性融合操作。实验结果充分说明了该模型的高效性能与优越性能。针对图像中存在多个显

著性目标或者检测到的显著性目标存在边界模糊问题,本文提出了一种基于多层级连续特征细化的深度显著性目标检测模型。该模型包括三个阶段:多层级连续特征提取、分层边界细化和显著性特征融合。首先,在多个层级上连续提取和编码高级语义特征,该过程充分挖掘了全局空间信息和不同层级的细节信息。然后,通过反卷积操作对多层级特征做边界细化处理。分层边界细化后,把不同层级的显著特征做融合操作得到结果显著图。在具有挑战性的多个基准数据集上使用综合评价指标进行性能测试,实验结果表明该方法具有优越的显著性检测性能。对于低对比度或者小目标等问题,本文提出一种新颖模型,即通道层级特征响应模型。该模型包含三个部分:通道式粗特征提取,层级通道特征细化和层级特征图融合。该方法基于挤压激励残差网络,依据卷积特征通道之间的相关性进行建模。首先,输入图像通过通道式粗特征提取过程生成空间信息丢失较多的粗糙特征图。然后,从高层级到低层级逐步细化通道特征,充分挖掘潜在的通道相关性细节信息。接着,对多层级特征做融合操作得到结果显著图。在含有复杂场景的多个基准数据集上与其它先进算法进行比较,实验结果证明该算法具有较高的计算效率和卓越的显著性检测性能。

视觉显著性检测方法及其应用研究

视觉显著性检测方法及其应用研究随着科学技术和多媒体技术的发展,人们在日常生活中产生的多媒体数据,尤其是图像数据呈指数级增长。海量的图像数据除了使人们的日常生活变得丰富多彩和便利之外,也给计算机视觉处理技术提出了新的挑战。大部分图像中只包含了少量重要的信息,人眼视觉系统则具有从大量数据中找出少量重要信息并进行进一步分析和处理的能力。计算机视觉是指使用计算机模拟人眼视觉系统的机理,并使其可以像人类一样视察与理解事物,其中的一个关键问题为显著性检测。本文针对目前已有显著性检测方法存在的问题,将重点从模拟人眼视觉注意机制以及针对图像像素和区域的鲁棒特征提取方法进行专门的研究。同时,本文还将显著性检测思想和方法引入到场景文本检测的研究中,既能提高场景文本检测的性能,又能拓展基于显著性检测的应用范畴。针对人眼视觉注意机制的模拟,本文提出了一种基于超像素聚类的显著性检测方法。该方法分析了人眼视觉注意机制中由粗到细的过程,并采用计算机图像处理技术来模拟该过程。具体而言,本文首先将原始图像分割为多个超像素,然后采用基于图的合并聚类算法将超像素进行聚类,直到只有两个类别为止,由此得到一系列具有连续类别(区域)个数的中间图像。其中在包含类别数越少的中间图像中的区域被给予更大的权重,并采用边界连通性度量来计算区域的显著性值,得到初始显著性图。最终基于稀疏编码的重构误差和目标偏见先验知识对初始显著性图进一步细化得到最终的显著性图。针对鲁棒特征提取,本文提出了一种基于区域和像素级融合的显著性检测方法。

对于区域级显著性估计,本文提出了一种自适应区域生成技术用于区域提取。对于像素级显著性预测,本文设计了一种新的卷积神经网络(CNN)模型,该模型考虑了不同层中的特征图之间的关系,并进行多尺度特征学习。最后,提出了一种基于CNN的显著性融合方法来充分挖掘不同显著性图(即区域级和像素级)之间的互补信息。为了提高性能和效率,本文还提出了另一种基于深层监督循环卷积神经网络的显著性检测方法。该网络模型在原有的卷积层中引入循环连接,从而能为每个像素学习到更丰富的上下文信息,同时还在不同层中分别引入监督信息,从而能为每个像素学习到更具区分能力的局部和全局特征,最后将它们进行融合,使得模型能够进行多尺度特征学习。针对基于文本显著性的场景文本检测方法的研究,本文提出了一种仅对文本区域有效的显著性检测CNN模型,该模型在不同层使用了不同的监督信息,并将多层信息进行融合来进行多尺度特征学习。同时为了提高文本检测的性能,本文还提出了一种文本显著性细化CNN模型和文本显著性区域分类CNN模型。细化CNN模型对浅层的特征图与深层的特征图进行整合,以便提高文本分割的精度。分类CNN模型使用全卷积神经网络,因此可以使用任意大小的图像作为模型的输入。为此,本文还提出了一种新的图像构造策略,以便构造更具区分能力的图像区域用于分类并提高分类精度。