Visualizing data with bounded uncertainty

合集下载

数据挖掘与数据分析,数据可视化试题

数据挖掘与数据分析,数据可视化试题

数据挖掘与数据分析,数据可视化试题1. Data Mining is also referred to as ……………………..data analysisdata discovery(正确答案)data recoveryData visualization2. Data Mining is a method and technique inclusive of …………………………. data analysis.(正确答案)data discoveryData visualizationdata recovery3. In which step of Data Science consume Almost 80% of the work period of the procedure.Accumulating the dataAnalyzing the dataWrangling the data(正确答案)Recapitulation of the Data4. Which Step of Data Science allows the model to consistently improve and provide punctual performance and deliverapproximate results.Wrangling the dataAccumulating the dataRecapitulation of the Data(正确答案)Analyzing the data5. Which tool of Data Science is robust machine learning library, which allows the implementation of deep learning ?algorithms. STableauD3.jsApache SparkTensorFlow(正确答案)6. What is the main aim of Data Mining ?to obtain data from a less number of sources and to transform it into a more useful version of itself.to obtain data from a less number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a more useful version of itself.(正确答案)7. In which step of data mining the irrelevant patterns are eliminated to avoid cluttering ? Cleaning the data(正确答案)Evaluating the dataConversion of the dataIntegration of data8. Data Science t is mainly used for ………………. purposes. Data mining is mainly used for ……………………. purposes.scientific,business(正确答案)business,scientificscientific,scientificNone9. Pandas ………………... is a one dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).Series(正确答案)FramePanelNone10. How many principal components Pandas DataFrame consists of ?4213(正确答案)11. Important data structure of pandas is/are ___________SeriesData FrameBoth(正确答案)None of the above12. Which of the following command is used to install pandas?pip install pandas(正确答案)install pandaspip pandasNone of the above13. Which of the following function/method help to create Series? series()Series()(正确答案)createSeries()None of the above14. NumPY stands for?Numbering PythonNumber In PythonNumerical Python(正确答案)None Of the above15. Which of the following is not correct sub-packages of SciPy? scipy.integratescipy.source(正确答案)scipy.interpolatescipy.signal16. How to import Constants Package in SciPy?import scipy.constantsfrom scipy.constants(正确答案)import scipy.constants.packagefrom scipy.constants.package17. ………………….. involveslooking at and describing the data set from different angles and then summarizing it ?Data FrameData VisualizationEDA(正确答案)All of the above18. what involves the preparation of data sets for analysis by removing irregularities in the data so that these irregularities do not affect further steps in the process of data analysis and machine learning model building ?Data AnalysisEDA(正确答案)Data FrameNone of the above19. What is not Utility of EDA ?Maximize the insight in the data setDetect outliers and anomaliesVisualization of dataTest underlying assumptions(正确答案)20. what can hamper the further steps in the machine learning model building process If not performed properly ?Recapitulation of the DataAccumulating the dataEDA(正确答案)None of the above21. Which plot for EDA to check the dependency between two variables ? HistogramsScatter plots(正确答案)MapsTime series plots22. What function will tell you the top records in the data set?shapehead(正确答案)showall of the aboce23. what type of data is useful for internal policymaking and business strategy building for an organization ?public dataprivate data(正确答案)bothNone of the above24. The ………… function can “fill in” NA valueswith non-null data ?headfillna(正确答案)shapeall of the above25. If you want to simply exclude the missing values, then what function along with the axis argument will be use?fillnareplacedropna(正确答案)isnull26. Which of the following attribute of DataFrame is used to display data type of each column in DataFrame?DtypesDTypesdtypes(正确答案)datatypes27. Which of the following function is used to load the data from the CSV file into a DataFrame?read.csv()readcsv()read_csv()(正确答案)Read_csv()28. how to Display first row of dataframe ‘DF’ ?print(DF.head(1))print(DF[0 : 1])print(DF.iloc[0 : 1])All of the above(正确答案)29. Spread function is known as ................ in spreadsheets ?pivotunpivot(正确答案)castorder30. ................. extract a subset of rows from a data fram based on logical conditions ? renamefilter(正确答案)setsubset31. We can shift the DataFrame’s index by a certain number of periods usingthe …………. Method ?melt()merge()tail()shift()(正确答案)32. We can join melted DataFrames into one Analytical Base Table using the ……….. function.join()append()merge()(正确答案)truncate()33. What methos is used to concatenate datasets along an axis ?concatenate()concat()(正确答案)add()merge()34. Rows can be …………….. if the number of missing values is insignificant, as thiswould not impact the overall analysis results.deleted(正确答案)updatedaddedall35. There is a specific reason behind the missing value.What stands for Missing not at randomMCARMARMNAR(正确答案)None of the above36. While plotting data, some values of one variable may not lie beyond the expectedrange, but when you plot the data with some other variable, these values may lie far from the expected value.Identify the type of outliers?Univariate outliersMultivariate outliers(正确答案)ManyVariate outlinersNone of the above37. if numeric values are stored as strings, then it would not be possible to calculatemetrics such as mean, median, etc.Then what type of data cleaning exercises you will perform ?Convert incorrect data types:(正确答案)Correct the values that lie beyond the rangeCorrect the values not belonging in the listFix incorrect structure:38. Rows that are not required in the analysis. E.g ifobservations before or after a particular date only are required for analysis.What steps we will do when perform data filering ?Deduplicate Data/Remove duplicateddataFilter rows tokeep only therelevant data.(正确答案)Filter columns Pick columnsrelevant toanalysisBring the datatogether, Groupby required keys,aggregate therest39. you need to…………... the data in order to get what you need for your analysis. searchlengthorderfilter(正确答案)40. Write the output of the following ?>>> import pandas as pd >>> series1 =pd.Series([10,20,30])>>> print(series1)0 101 202 30dtype: int64(正确答案)102030dtype: int640 1 2 dtype: int64None of the above41. What will be output for the following code?import numpy as np a = np.array([1, 2, 3], dtype = complex) print a[[ 1.+0.j, 2.+0.j, 3.+0.j]][ 1.+0.j]Error[ 1.+0.j, 2.+0.j, 3.+0.j](正确答案)42. What will be output for the following code?import numpy as np a =np.array([1,2,3]) print a[[1, 2, 3]][1][1, 2, 3](正确答案)Error43. What will be output for the following code?import numpy as np dt = dt =np.dtype('i4') print dtint32(正确答案)int64int128int1644. What will be output for the following code?import numpy as np dt =np.dtype([('age',np.int8)]) a = np.array([(10,),(20,),(30,)], dtype = dt)print a['age'][[10 20 30]][10 20 30](正确答案)[10]Error45. We can add a new row to a DataFrame using the _____________ methodrloc[ ]iloc[ ]loc[ ](正确答案)None of the above46. Function _____ can be used to drop missing values.fillna()isnull()dropna()(正确答案)delna()47. The function to perform pivoting with dataframes having duplicate values is _____ ? pivot(unique = True)pivot()pivot_table(unique = True)pivot_table()(正确答案)48. A technique, which when performed on a dataframe, rearranges the data from rows and columns in a report form, is called _____ ?summarisingreportinggroupingpivoting(正确答案)49. Normal Distribution is symmetric is about ___________ ?VarianceMean(正确答案)Standard deviationCovariance50. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above51. Fill in the blank in the given code, if we want to plot a line chart for values of list ‘a’ vs values of list ‘b’.a = [1, 2, 3, 4, 5]b = [10, 20, 30, 40, 50]import matplotlib.pyplot as pltplt.plot __________(a, b)(正确答案)(b, a)[a, b]None of the above52. #Loading the datasetimport seaborn as snstips =sns.load_dataset("tips")tips.head()In this code what is tips ?plotdataset name(正确答案)paletteNone of the above53. Visualization can make sense of information by helping to find relationships in the data and support (or disproving) ideas about the dataAnalyzeRelationShip(正确答案)AccessiblePrecise54. In which option provides A detailed data analysis tool that has an easy-to-use tool interface and graphical designoptions for visuals.Jupyter NotebookSisenseTableau DesktopMATLAB(正确答案)55. Consider a bank having thousands of ATMs across China. In every transaction, Many variables are recorded.Which among the following are not fact variables.Transaction charge amountWithdrawal amountAccount balance after withdrawalATM ID(正确答案)56. Which module of matplotlib library is required for plotting of graph?plotmatplotpyplot(正确答案)None of the above57. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above58. What will happen when you pass ‘h’ as as a value to orient parameter of the barplot function?It will make the orientation vertical.It will make the orientation horizontal.(正确答案)It will make line graphNone of the above59. what is the name of the function to display Parameters available are viewed .set_style()axes_style()(正确答案)despine()show_style()60. In stacked barplot, subgroups are displayed as bars on top of each other. How many parameters barplot() functionhave to draw stacked bars?OneTwoNone(正确答案)three61. In Line Chart or Line Plot which parameter is an object determining how to draw the markers for differentlevels of the style variable.?x.yhuemarkers(正确答案)legend62. …………………..similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y axis.Pie ChartLine ChartViolin Chart(正确答案)None63. By default plot() function plots a ________________HistogramBar graphLine chart(正确答案)Pie chart64. ____________ are column-charts, where each column represents a range of values, and the height of a column corresponds to how many values are in that range.Bar graphHistograms(正确答案)Line chartpie chart65. The ________ project builds on top of pandas and matplotlib to provide easy plotting of data.yhatSeaborn(正确答案)VincentPychart66. A palette means a ________.. surface on which a painter arranges and mixed paints. circlerectangularflat(正确答案)all67. The default theme of the plotwill be ________?Darkgrid(正确答案)WhitegridDarkTicks68. Outliers should be treated after investigating data and drawing insights from a dataset.在调查数据并从数据集中得出见解后,应对异常值进行处理。

《一类边界条件依赖特征参数多项式的不连续Sturm-Liouville问题特征值的渐近估计》范文

《一类边界条件依赖特征参数多项式的不连续Sturm-Liouville问题特征值的渐近估计》范文

《一类边界条件依赖特征参数多项式的不连续Sturm-Liouville问题特征值的渐近估计》篇一一、引言Sturm-Liouville问题是一类重要的数学物理问题,广泛存在于量子力学、工程力学、控制论等多个领域。

其特征值和特征函数的求解对于理解这些领域的物理现象具有重要意义。

当Sturm-Liouville问题的边界条件依赖于特征参数时,问题的复杂性增加,尤其是当问题中涉及多项式且存在不连续性时,特征值的渐近估计成为研究的关键。

本文将探讨一类具有这种特性的Sturm-Liouville问题的特征值渐近估计方法。

二、问题描述考虑一类具有不连续边界条件的Sturm-Liouville问题,其特征参数以多项式的形式出现在边界条件中。

该问题的微分方程部分在连续区域内具有标准的Sturm-Liouville形式,但在不连续点处,由于边界条件的变化,导致问题的求解变得复杂。

我们的目标是找到这类问题的特征值的渐近估计。

三、特征值的渐近估计方法为了求解此类问题,我们首先需要对方程进行适当的变换,以便利用现有的渐近估计方法。

具体来说,我们将采用匹配渐近法,通过在不同区域建立适当的近似解,并在交界处进行匹配,从而得到特征值的渐近估计。

1. 区域划分与近似解的建立根据问题的特点,我们将整个定义域划分为若干个区域。

在每个区域内,由于边界条件的变化较小,我们可以采用标准的Sturm-Liouville方法求解微分方程,得到该区域的近似解。

这些近似解将作为后续匹配的基础。

2. 匹配渐近法在交界处,我们需要将不同区域的近似解进行匹配。

这通常涉及到求解一系列非线性方程,以确定匹配条件。

通过求解这些方程,我们可以得到特征值的渐近估计。

四、数值实验与结果分析为了验证我们的方法,我们进行了一系列的数值实验。

我们构造了一个具体的不连续Sturm-Liouville问题,其边界条件依赖于特征参数的多项式。

然后,我们使用我们的方法进行特征值的渐近估计,并将结果与精确解进行比较。

维数约简技术在图像识别中的效果

维数约简技术在图像识别中的效果

维数约简技术在图像识别中的效果一、维数约简技术概述维数约简技术是数据预处理中的一种重要方法,它通过降低数据的维度来减少计算复杂度和提高数据分析的效率。

在图像识别领域,维数约简技术的应用尤为重要,因为图像数据通常具有高维性,这使得直接处理变得非常困难和低效。

维数约简技术的核心目标是在保留图像数据重要特征的同时,去除冗余信息,从而提高图像识别的准确性和速度。

1.1 维数约简技术的核心原理维数约简技术的核心原理是将原始高维数据映射到一个低维空间中,这个映射过程需要尽可能地保留原始数据的结构和特征。

常见的维数约简方法包括主成分分析(PCA)、线性判别分析(LDA)、奇异值分解(SVD)等。

这些方法通过不同的数学手段来实现数据的降维,以达到优化识别效果的目的。

1.2 维数约简技术在图像识别中的应用图像识别是计算机视觉领域的一个关键应用,它涉及到从图像中识别和分类不同的对象。

维数约简技术在图像识别中的应用主要体现在以下几个方面:- 提高计算效率:通过降低数据维度,可以减少模型训练和预测时的计算量。

- 增强特征表达:在降维过程中,可以突出图像中的关键特征,抑制噪声和不相关的变化。

- 避免过拟合:高维数据容易导致模型过拟合,而维数约简有助于提取更加泛化的特征,提高模型的泛化能力。

二、维数约简技术在图像识别中的关键方法在图像识别领域,维数约简技术的应用需要结合图像数据的特点,选择合适的方法来实现最佳的降维效果。

以下是一些在图像识别中常用的维数约简方法:2.1 主成分分析(PCA)主成分分析是一种统计方法,通过正交变换将可能相关的变量转换为一组线性不相关的变量,这些变量称为主成分。

在图像识别中,PCA可以有效地减少图像特征的维度,同时保留大部分的数据方差。

2.2 线性判别分析(LDA)线性判别分析是一种监督学习的降维技术,它不仅考虑了数据的方差,还考虑了数据的类别信息。

LDA的目标是找到一个最佳的线性组合,使得不同类别的数据在该组合下具有最大的分离度。

《2024年度二维弹性快速多极边界元算法及截断误差分析》范文

《2024年度二维弹性快速多极边界元算法及截断误差分析》范文

《二维弹性快速多极边界元算法及截断误差分析》篇一一、引言在计算力学领域,边界元法(Boundary Element Method, BEM)是一种重要的数值技术,尤其适用于解决弹性力学问题。

随着计算机技术的发展,二维弹性快速多极边界元算法成为了研究的热点。

该算法结合了多极展开与边界元法,可大幅提高计算效率并降低存储需求。

本文将详细介绍二维弹性快速多极边界元算法的原理,并对其截断误差进行分析。

二、二维弹性快速多极边界元算法原理二维弹性快速多极边界元算法基于边界元法的基本原理,通过多极展开技术来提高计算效率。

该算法的主要步骤包括:问题定义、离散化处理、多极展开、快速求解和后处理。

1. 问题定义在二维弹性问题中,我们需定义问题的几何形状、材料属性及边界条件。

这些信息将作为后续计算的基础。

2. 离散化处理将问题域离散化为一系列的子域,每个子域的边界上设置节点。

这些节点将用于后续的边界元法计算。

3. 多极展开多极展开是该算法的核心部分。

通过将场源项的多极展开,可以将计算复杂度从O(N^2)降低到O(NlogN),其中N为节点的数量。

这一步显著提高了计算效率。

4. 快速求解利用快速求解技术,如迭代法或直接法,对离散化后的系统进行求解。

通过多极展开和快速求解的结合,可大幅提高计算速度。

5. 后处理对求解结果进行后处理,如提取应力、位移等数据,并进行可视化展示。

三、截断误差分析在二维弹性快速多极边界元算法中,截断误差主要来源于多极展开的截断和数值求解的误差。

本部分将对这两部分误差进行分析。

1. 多极展开的截断误差多极展开的截断会导致一定的计算误差。

这种误差主要与展开的阶数、场源项的复杂性及问题的规模有关。

一般来说,阶数越高,截断误差越小,但计算复杂度也越高。

因此,在实际应用中,需要根据问题的规模和精度要求来选择合适的阶数。

2. 数值求解的误差数值求解的误差主要来源于迭代法或直接法的求解过程。

这种误差可以通过选择合适的求解器、优化算法参数等方式来减小。

维度缺失补全方法

维度缺失补全方法

维度缺失补全方法Data completeness is crucial in any data analysis or machine learning task. However, missing values are a common issue that needs to be addressed before conducting any meaningful analysis. Incomplete data can lead to biased or inaccurate results, affecting the overall interpretation of the data. Therefore, it is essential to have reliable methods to handle missing values effectively. 数据完整性在任何数据分析或机器学习任务中都至关重要。

然而,缺失值是一个常见问题,需要在进行任何有意义的分析之前解决。

不完整的数据可能导致结果有偏差或不准确,影响数据的整体解释。

因此,有可靠的方法来有效处理缺失值是至关重要的。

One common method for dealing with missing values is to simply remove observations with missing data. While this approach is straightforward, it may not always be the best solution, as it can lead to a loss of valuable information. Removing observations with missing data can reduce the sample size and potentially bias the results. Therefore, it is important to consider other methods for handling missing values that do not result in the loss of significant data. 处理缺失值的一种常见方法是简单地删除带有缺失数据的观测。

变分自编码器损失函数 -回复

变分自编码器损失函数 -回复

变分自编码器损失函数-回复什么是变分自编码器损失函数?变分自编码器(Variational Autoencoder,简称VAE)是一种生成模型,其中变分自编码器损失函数是用于训练VAE的一种关键性方法。

在理解变分自编码器损失函数之前,我们需要先了解一些背景知识。

自编码器(Autoencoder)是一种无监督学习算法,它用于将输入数据压缩为低维特征空间,并通过重构目标来学习数据的隐藏表示。

自编码器由编码器和解码器两个部分组成。

编码器将输入数据映射到低维潜在空间,解码器则将潜在表示映射回原始数据空间。

然而,传统的自编码器存在一些问题。

首先,它们的潜在表示是固定的,即使在数据分布中存在明显的变化,它们也不能自适应地生成对应的样本。

其次,自编码器难以学习数据的分布结构,因为它们没有明确的概率模型。

这些问题限制了自编码器的生成能力和可解释性。

为了解决这些问题,研究者提出了变分自编码器。

变分自编码器引入了概率建模的思想,将自编码器扩展为生成模型。

它通过从潜在空间中采样,并通过解码器生成样本。

这样,变分自编码器不仅可以压缩输入数据,还可以生成新的样本。

在变分自编码器中,损失函数被设计为最大化数据的对数似然。

变分自编码器损失函数可以写为如下形式:\[\mathcal{L} = -\mathbb{E}_{q(z x)}[\log p(x z)] + D_{\text{KL}}[q(z x) p(z)]\]其中,第一项是重构误差,衡量解码器重构样本与原始样本之间的差异。

第二项是正则化项,通过计算编码器生成的潜在表示与先验分布之间的差异来惩罚模型。

这两个项共同组成了变分自编码器损失函数。

让我们逐步解释变分自编码器损失函数的每一部分。

首先,我们来看重构误差,即第一项\(\mathbb{E}_{q(z x)}[\log p(x z)]\)。

其中,\(q(z x)\) 是给定输入数据\(x\) 后,在潜在空间中采样\(z\) 所得到的编码器的输出。

变分自编码器损失函数 -回复

变分自编码器损失函数 -回复

变分自编码器损失函数-回复什么是变分自编码器(VAE)损失函数?变分自编码器(Variational Autoencoder, VAE)是一种生成模型,通过学习数据的潜在分布来实现数据的生成和重构。

VAE使用了一种特殊的损失函数,被称为“变分自编码器损失函数”。

这种损失函数结合了自编码器(Autoencoder)和变分推断(Variational Inference)的思想,使得VAE能够在数据生成和推断中达到更好的性能。

要理解VAE的损失函数,首先需要了解自编码器和变分推断的基本概念。

自编码器是一种无监督学习算法,其任务是将输入数据重新构建出来,同时学习数据的有效表示。

自编码器包括一个编码器和一个解码器,编码器将输入数据映射到低维潜在空间中,解码器将潜在空间的表示映射回原始数据空间。

变分推断是一种贝叶斯推断方法,用于从给定的观测数据中推断潜在变量的分布。

它通过引入一个潜变量的分布来近似真实的后验分布,从而解决了直接计算后验分布困难的问题。

VAE结合了自编码器和变分推断的思想,使用编码器将输入数据映射到潜在空间中的分布,并通过采样来生成潜在变量,解码器将潜在变量映射回原始数据空间。

为了保证学习到的分布能够更好地逼近真实的潜在分布,VAE引入了KL散度(Kullback-Leibler Divergence)作为正则化项。

VAE的损失函数可以表示为重构误差和KL散度的加权和。

具体来说,VAE 的损失函数可以分为两个部分,分别是重构损失和正则化项。

重构损失是衡量VAE重构能力的指标,它表示重构数据与原始数据之间的差异。

重构损失一般使用均方误差(Mean Squared Error, MSE)或交叉熵(Cross Entropy)来计算,其中交叉熵在处理离散数据时更为常用。

正则化项则是通过KL散度来约束潜在变量的分布接近单位高斯分布。

KL 散度用于衡量两个分布之间的差异,VAE使用KL散度作为正则化项的原因是为了使得学习到的分布更加平滑和连续。

变分自编码器损失函数 -回复

变分自编码器损失函数 -回复

变分自编码器损失函数-回复什么是变分自编码器(VAE)损失函数?变分自编码器(Variational Autoencoder,简称VAE)是一种生成模型,可以通过学习输入数据的概率分布来生成新的样本。

VAE的损失函数是用来衡量网络生成样本与原始数据之间的差异并通过训练来减小这种差异。

在VAE中,损失函数包括重建误差和KL散度两部分。

重建误差是指生成样本与原始数据的差距,通常使用均方误差(MSE)或交叉熵损失来衡量。

而KL散度则是衡量生成样本与潜在空间(latent space)中的分布之间的差异。

下面将详细介绍VAE的损失函数及其每个部分:1. 重建误差:VAE的目标是学习生成数据的概率分布,并通过采样从该分布中生成新的样本。

重建误差是衡量生成样本与原始数据之间的相似性的度量。

对于二进制数据,常用交叉熵损失;对于连续数据,常用均方误差(MSE)。

对于二进制数据:设原始数据为x,生成样本为x_hat,重建误差损失函数为:L_recon = -∑(x * log(x_hat) + (1 - x) * log(1 - x_hat))对于连续数据:L_recon = (x - x_hat) 2重建误差越小,生成样本越接近原始数据。

2. KL散度:KL散度是一种度量两个概率分布之间差异的指标。

在VAE中,KL散度用于衡量生成样本分布与潜在空间中先验分布之间的差异。

先验分布通常假设为高斯分布,均值为0,方差为1。

KL散度越小,表示生成样本分布越接近先验分布。

设生成样本的分布为Q(Z X),潜在空间的先验分布为P(Z),则KL散度损失函数为:L_KL = KL(Q(Z X), P(Z))KL散度通常可以通过解析计算,对于高斯分布来说具体的计算公式为:KL(Q(Z X), P(Z)) = -0.5 * ∑(1 + log(var) - mean^2 - var)其中mean和var分别表示生成样本分布的均值和方差。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Visualizing Data with Bounded UncertaintyChris Olston∗Stanford University olston@Jock D.Mackinlay Palo Alto Research Center mackinlay@AbstractVisualization is a powerful way to facilitate data analy-sis,but it is crucial that visualization systems explicitlyconvey the presence,nature,and degree of uncertaintyto users.Otherwise,there is a danger that data willbe falsely interpreted,potentially leading to inaccurateconclusions.A common method for denoting uncer-tainty is to use error bars or similar techniques designedto convey the degree of statistical uncertainty.Whileuncertainty can often be modeled statistically,a secondform of uncertainty,bounded uncertainty,can also arisethat has very different properties than statistical uncer-tainty.Error bars should not be used for bounded uncer-tainty because they do not convey the correct properties,so a different technique should be used instead.In thispaper we describe a technique for conveying boundeduncertainty in visualizations and show how it can beapplied systematically to common displays of abstractcharts and graphs.Interestingly,it is not always possi-ble to show the exact degree of uncertainty,and in somecases it can only be displayed approximately.Keywords:uncertainty visualization,bounded uncer-tainty1IntroductionIn most data-intensive applications,uncertainty is a fact of life. For example,in scientific applications,error-prone measurements or incomplete sampling often result in uncertain data.Another ex-ample isfinancial analysis,where it is common for some data to represent uncertain projections about future behavior.Even when it is possible to gather precise data,there are many real-time appli-cations,such as network monitoring,mobile object tracking,and wireless ecosystem monitoring,in which uncertainty may be in-troduced intentionally to conserve system resources while data is being transmitted or processed.When data is uncertain,it is crit-ically important that analysis tools,including information visual-ization tools,make users aware of the presence,nature,and de-gree of uncertainty in the data as these factors can greatly impact decision-making.If users are misinformed about the uncertainty2Related WorkIn certain visualization scenarios,data may be unavailable for display or even purposefully omitted for a variety of possible reasons,giving rise to uncertainty.The importance of visually informing the user of the absence of data has been identified [WO98]and techniques for doing so have been proposed in,e.g., Clouds[HAC+99]and Restorer[TCS94].We focus on a different type of uncertainty where all the data is present but precise values are not known.Numerous ways to convey the degree of uncertainty in data using overlayed annotations and glyphs have been proposed,as in,e.g.,[PWL97].Another approach is to make the positions of grid lines used for positional reference ambiguous[CR00].Un-certainty can also be indicated by adjusting the color,hue,trans-parency,etc.of graphical features as in,e.g.,[DK97,Mac92, vdWvdGG98].Some techniques for conveying uncertainty by widening the boundaries of graphical elements have also been proposed.For example,in[WSF+96],the degree of uncertainty in the angle of rotation of vectors is encoded in the width of the vector arrows.Also,[PWL97]proposes varying the thickness of three-dimensional surfaces to indicate the degree of uncertainty.To our knowledge,however,none have focused on accurately and unambiguously conveying not only the presence and degree but also the form of uncertainty in data,as we do.We also be-lieve that our work is thefirst to establish systematic methods for conveying bounded uncertainty by widening the boundaries and positions of graphical elements in abstract charts and graphs.The approach in[FWR99]for displaying cluster densities gives a vi-sual appearance similar to our ambiguated line charts(discussed later)but serves a different purpose.3Forms of UncertaintyIn this paper we consider two commonplace forms of uncertainty, as described in[TK94],[PWL97],and elsewhere.Consider a nu-meric data object O whose exact value V is not known with cer-tainty.There are two predominant ways in which partial knowl-edge about the possible values of V can be represented:statis-tical uncertainty and bounded uncertainty.Under statistical un-certainty,the uncertain value of a data object can be represented in a number of ways,depending on the statistical model.In one common case,when errors follow a normal distribution,the un-certain value of a data object can be represented by a three-tuple E,D,P of real numbers,where D≥0and P∈(0,1].Here, E is an estimate that represents the most likely candidate for the unknown value V,and P is the probability that V lies in the con-fidence interval[E−D,E+D].Typically,P isfixed at,say, P=0.95,and D is chosen so that the value V lies inside the confidence interval[E−D,E+D]with probability P.Under bounded uncertainty,there is some numeric interval[L,H]that is guaranteed to contain the exact value V,i.e.,L≤V≤H.Under bounded uncertainty,the probability that V is outside the interval is zero,but,unlike with statistical uncertainty,no assumptions can be made about the probability distribution of possible values in-side the interval.Both forms of uncertainty commonly occur in scientific and other applications[TK94].For example,bounded uncertainty can occur when measurements are taken using a device hav-ing an unknown degree of imprecision that lies within known bounds.Statistical uncertainty can occur,for example,when sin-gle or repeated measurements are taken in conditions exhibiting experimental variability,often resulting in an unbounded proba-bility distribution over possible values featuring a central peak. Both bounded and statistical uncertainty can also occur in emerg-ing data delivery paradigms that intentionally introduce uncer-tainty for performance reasons,e.g.,[HAC+99,OW02].In these paradigms there is often the opportunity to adjust the uncertainty levels interactively,unlike with traditional sources of uncertainty. In the extended version of this paper[OM02],we discuss interac-tive data delivery techniques that exhibit these properties.4Representing Uncertain Data Visually Having described the two common forms of uncertainty and some ways they can occur,we are now ready to discuss ways to repre-sent uncertain data visually.In most abstract charts and graphs, data values are graphically encoded either in the positions of graphical elements,as in a scatterplot,or in the extent(size)of elements along one or more dimensions,as in a bar chart.When the underlying data is uncertain,we believe it is appropriate to clearly indicate not only the presence and degree but also the form of uncertainty.As described in Section3,statistical and bounded uncertainty encode two dramatically different distributions of po-tential values.Due to this key difference,using the same display technique to represent both forms of uncertainty could mislead the user.Instead,we advocate two alternative methods for conveying uncertainty in the positions or extents of graphical representations of data:error bars for statistical uncertainty and ambiguation for bounded uncertainty.We begin by describing these general tech-niques and then show how they can be applied to some common types of charts and graphs.4.1Error BarsError bars and their variants have been well studied as a suitable means to convey statistical uncertainty[Cle85,Tuf01,Tuk77]. For each uncertain data value to be represented visually,the idea is to use the normal display technique to render the estimate E in place of the unknown exact value V.Error bars are then added to indicate uncertainty in the position or boundary location in pro-portion to the size of the confidence interval[E−D,E+D]. Some standard uses of error bars are illustrated in the upper left quadrant of Figure1.When uncertainty occurs in bounded rather than statistical form,it is important to avoid the use of error bars since the accepted interpretation implies a potentially unbounded distribution extending beyond the error bars.Even worse,render-ing an exact estimate using the normal display technique strongly implies the existence of a most likely value E,but in bounded uncertainty no most likely value can be assumed.4.2AmbiguationTo convey the presence and degree of bounded uncertainty,we propose the use of a technique we call ambiguation.The main idea behind ambiguation when uncertain data is encoded in the extent of a graphical element is to widen the boundary to suggest a range of possible boundary locations and therefore a range of pos-sible extents.The ambiguous region between possible boundaries can be drawn as graphical fuzz,giving an effect that resembleserror bars ambiguation100%absoluteFigure 1:Error bars and ambiguation applied to some common chart types.ink smearing.A straightforward application of this technique is illustrated in the ambiguated bar chart in the upper right quadrant of Figure 1.To indicate positional uncertainty,rather than draw-ing a crisp representation of the graphical element at a particular position,the representation is elongated in one or more directions and drawn using fuzz.A simple application of this technique is il-lustrated in the ambiguated scatterplot in the upper right quadrant of Figure 1.Other variations of boundary or position ambiguation may be possible,but the necessary feature is that no particular estimate or most likely value should be indicated.Rather,the entire range of possible values for the boundary or position of the graphical element should be presented with equal weight.This key charac-teristic is in contrast with error bars and other approaches such as fuzzygrams and gradient range symbols [Har99]that emphasize a known probability distribution over data values.4.3DiscussionThe complementary use of error bars and ambiguation makes the presence,degree,and form of uncertainty clear.First,these tech-niques make it easy to identify the specific data values that are uncertain by suggesting imprecision in the graphical property (po-sition or boundary location)in which the values are encoded.For bounded uncertainty,the position or boundary is made ambigu-ous using fuzzy ink,and for statistical uncertainty,error bars are added to visually suggest the possibility of a shift in position or boundary location.Second,these techniques allow the degree of uncertainty to be read in a straightforward manner using the same scale used to interpret the data itself.Finally,the use of two visu-ally distinct techniques makes it clear which of the two forms of uncertainty is present,and each technique conveys the properties of the form of uncertainty it represents.Ambiguation and error bars work well when data is encoded as the position or extent of graphical elements.Coping with dis-plays that use other graphical attributes such as color and texture to encode data is left as a topic for future work.In the absence of analogous techniques for other graphical attributes,when uncer-tainty is present it is desirable to only use charts and graphs that encode data using position and extent alone so the presence,de-gree,and form of uncertainty can be clearly and unambiguously depicted.4.4Application to Common Chart TypesFigure 1illustrates how error bars and ambiguation can be applied to some common chart types (exhaustive illustration on all known chart types is omitted for brevity).While these techniques are general and can be applied to a broad range of displays that use position and extent to encode data,we focus on abstract charts and graphs,which can be classified into two categories:absolute dis-plays and 100%displays .In absolute displays,each data value is given a graphical representation whose extent or position is plot-ted on an absolute scale.Examples of absolute displays include simple bar charts (which encode data in the upper boundaries of bars),scatterplots (which encode data in the positions of points),and line graphs (which encode data in the positions of points and lines).It is generally straightforward to add error bars or apply ambiguation to boundaries and positions in absolute displays suchas those displayed in the top half of Figure1.1In100%displays,the scale ranges from0%to100%,and n values V1,V2,...,V n are plotted on this relative scale.Each value V i is plotted as a graphical element whose size is propor-tional to the fraction V i1Displays such as stacked bar charts that are not normalized to sum to100%are problematic when used in conjunction with these uncertainty indicators because the interpretation can be ambiguous.For example,in an absolute stacked bar chart,an error bar or a fuzzy region appearing at the top of the stack can be interpreted either as uncertainty in the topmost element or as uncertainty in the overall height of the stack(the sum over all elements).[HAC+99]J.M.Hellerstein,R.Avnur,A.Chou,C.Hidber,C.Olston,V.Raman,T.Roth,and P.Haas.In-teractive data analysis with CONTROL.IEEEComputer,August1999.[Har99]rmation Graphics.Oxford Uni-versity Press,1999.[Mac92] A.M.MacEachren.Visualizing uncertain infor-mation.Cartographic Perspective,(13):10–19,1992.[OM02] C.Olston and J.Mackinlay.Visualizingdata with bounded uncertainty.Technical re-port,Stanford University Computer ScienceDepartment,2002./pub/2002-20.[OW02] C.Olston and J.Widom.Approximate cachingfor continuous queries over distributed datasources.Technical report,Stanford Uni-versity Computer Science Department,2002./pub/2002-8.[PWL97] A.T.Pang,C.M.Wittenbrink,and S.K.Lodha.Approaches to uncertainty visualization.The Vi-sual Computer,pages370–390,November1997. [TCS94]R.Twiddy,J.Cavallo,and S.M.Shiri.Restorer:A visualization technique for handling missingdata.In Proceedings of the IEEE VisualizationConference,pages212–216,Washington,D.C.,October1994.[TK94] B.N.Taylor and C.E.Kuyatt.Guidelinesfor evaluating and expressing the uncertaintyof nist measurement results.Technical report,National Institute of Standards and TechnologyNote1297,Gaithersburg,Maryland,1994. [Tuf01] E.R.Tufte.The Visual Display of QuantitativeInformation,Second Edition.Graphics Press,2001.[Tuk77]J.W.Tukey.Exploratory Data Analysis.AddisonWesley,1977.[vdWvdGG98] F.J.M.van der Wel,L.C.van der Gaag,andB.G.H.Gorte.Visual exploration of uncertaintyin remote-sensing classifiputers andGeosciences,24(4):335–343,1998.[WO98] A.Woodruff and C.Olston.Iconificationand omission in information exploration.In SIGCHI’98Workshop on Innovationand Evaluation in Information ExplorationInterfaces,Los Angeles,CA,April1998./ConferencesWorkshops/CHI98IE/submissions/woodruff.htm.[WSF+96] C.M.Wittenbrink, E.Saxon,J.J.Furman,A.T.Pang,and S.K.Lodha.Glyphs forvisualizing uncertainty in environmental vectorfields.IEEE Transactions on Visualization andComputer Graphics,pages266–279,September1996.。

相关文档
最新文档