全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 SPSS论坛
4070 5
2014-03-25
毕业论文还有3天就要交了,写到需要对时间序列进行密度函数核估计,求出拟合模型,这个需要怎样做呀???哪位大神能帮忙解答??万分感谢!!!
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2014-3-26 23:28:49
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-3-26 23:29:44
For Stata

http://www.stata.com/manuals13/rkdensity.pdf
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-3-26 23:31:37



Exploratory Data Analysis: Kernel Density Estimation using R


Introduction


Recently, I began a series on exploratory data analysis; so far, I have written about computing descriptive statistics and creating box plots in R for a univariate data set with missing values.  Today, I will continue this series by analyzing the same data set with kernel density estimation, a useful non-parametric technique for visualizing the underlying distribution of a continuous variable.  I will show how to construct kernel density estimates and plot them in R.  I will also introduce rug plots and show how they can complement kernel density plots.  Recall from my previous post that I am using the ozone data from the “airquality” data set that is built into R and simulating a second ozone data set from a fictitious city called Ozonopolis.  Just like the post on box plots, this is a long post with many details and useful tips in R, so read it carefully.

What is a Kernel?

Before defining kernel density estimation, let’s define a kernel.  (To my surprise and disappointment, many textbooks that talk about kernel density estimation or use kernels do not define this term.)

A kernel is a special type of probability density function (PDF) with the added property that it must be even.  Thus, a kernel is a function with the following properties:
  • Non-negative
  • Real-valued
  • Its definite integral over its support set must equal to 1
  • Some common PDFs are kernels; they include the Uniform(-1,1) and standard normal distributions.  Wikipedia has a list of common kernels.

What is Kernel Density Estimation?

Kernel density estimation is a non-parametric method of estimating the probability density function (PDF) of a continuous random variable.  It is non-parametric because it does not assume any underlying distribution for the variable.  Essentially, at every datum, a kernel function is created with the datum at its centre – this ensures that the kernel is symmetric about the datum.  The PDF is then estimated by adding all of these kernel functions and dividing by the number of data to ensure that it satisfies the 2 properties of a PDF:
  • Every possible value of the PDF is non-negative.
  • The definite integral of the PDF over its support set equals to 1.


Intuitively, a kernel density estimate is a sum of “bumps”.  A “bump” is assigned to every datum, and the size of the “bump” represents the probability assigned at the neighbourhood of values around that datum; thus, if the data set contains

2 data at x = 1.5
1 datum at x = 0.5
then the “bump” at x = 1.5 is twice as big as the “bump” at x = 0.5.

Each “bump” is centred at the datum, and it spreads out symmetrically to cover the datum’s neighbouring values.  Each kernel has a bandwidth, and it determines the width of the “bump” (the width of the neighbourhood of values to which probability is assigned).  A bigger bandwidth results in a shorter and wider “bump” that spreads out farther from the centre and assigns more probability to the neighbouring values.

Constructing a Kernel Density Estimate: Step by Step

1) Choose a kernel; the common ones are normal (Gaussian), uniform (rectangular), and triangular.
2) At each datum, xi, build the scaled kernel function
3) Add all of the individual scaled kernel functions and divide by n; this places a probability of 1/n to each x_i.  It also ensures that the kernel density estimate integrates to 1 over its support set.


The density() function in R computes the values of the kernel density estimate.  Applying the plot() function to an object created by density() will plot the estimate.  Applying the summary() function to the object will reveal useful statistics about the estimate.

Choosing the Bandwidth

It turns out that the choosing the bandwidth is the most difficult step in creating a good kernel density estimate that captures the underlying distribution of the variable (Trosset, Page 166).  Choosing the bandwidth is a complicated topic that is better addressed in a more advanced book or paper, but here are some useful guidelines:

  • A small h results in a small standard deviation, and the kernel places most of the probability on the datum.  Use this when the sample size is large and the data are tightly packed.
  • A large h results in a large standard deviation, and the kernel spreads more of the probability from the datum to its neighbouring values.  Use this when the sample size is small and the data are sparse.
There is a tremendous amount of literature on how best to choose the bandwidth, and I will not go into detail about that in this post.  There are 2 methods that are common in R: nrd0 and SJ, which refers to the method by Sheather & Jones.  These are string arguments for the option ‘bw’ in the density() function.  The default in density() is bw = ‘nrd0′.

Here is some R code to generate the following plots of 2 uniform kernel functions; note the use of the dunif() function.

# Plot 2 Uniform Kernel Functions with Different Bandwidths
# define support set of X
x = seq(-1.5, 1.5, by = 0.01)

# Obtain uniform kernel function values
uniform1 = dunif(x, min = -0.25, max = 0.25)
uniform2 = dunif(x, min = -1.00, max = 1.00)

# Optional printing of PNG image to chosen directory
png('INSERT YOUR DIRECTORY PATH HERE/uniform kernels.png')
plot(x, uniform1, type = 'l', ylab = 'f(x)', xlab = 'x', main = '2 Uniform Kernels with Different Bandwidths', col = 'red')

# Add plot of second kernel function
lines(x, uniform2, col = 'blue')

# Add legend; must specify 'lty' option, because these are line plots
legend(0.28, 1.5, c('Uniform(-0.25, 0.25)', 'Uniform(-1.00, 1.00)'), lty = c(1,1), col = c('red', 'blue'), box.lwd = 0)
dev.off()
uniform kernels

Example: Ozone Pollution Data from New York and Ozonopolis



Recall that I used 2 sets of ozone data in my last post about box plots.  One came from the “airquality” data set that is built into R.  I simulated the other one and named its city of origin “Ozonopolis”.  Here are the code and the plot of the kernel density estimates of the 2 ozone pollution data sets.  I used the default settings in density() – specifically, I used the normal (Gaussian) kernel and the “nrd0” method of choosing the bandwidth.  I encourage you to try the other settings.  I have used the set.seed() function so that you can replicate my random numbers.

# Kernel Density Estimation

# extract "Ozone" data vector for New York
ozone = airquality$Ozone

# calculate the number of non-missing values in "ozone"

n = sum(!is.na(ozone))

# calculate mean, variance and standard deviation of "ozone" by excluding missing values
mean.ozone = mean(ozone, na.rm = T)
var.ozone = var(ozone, na.rm = T)
sd.ozone = sd(ozone, na.rm = T)

# simulate ozone pollution data for ozonopolis
# set seed for you to replicate my random numbers for comparison
set.seed(1)
ozone2 = rgamma(n, shape = mean.ozone^2/var.ozone+3, scale = var.ozone/mean.ozone+3)

# obtain values of the kernel density estimates
density.ozone = density(ozone, na.rm = T)
density.ozone2 = density(ozone2, na.rm = T)

# number of points used in density plot
n.density1 = density.ozone$n
n.density2 = density.ozone2$n

# bandwidth in density plot
bw.density1 = density.ozone$bw
bw.density2 = density.ozone2$bw

png('INSERT YOUR DIRECTORY PATH HERE/kernel density plot ozone.png')
plot(density.ozone2, main = 'Kernel Density Estimates of Ozone \n in New York and Ozonopolis', xlab = 'Ozone (ppb)', ylab = 'Density', ylim = c(0, max(density.ozone$y, na.rm = T)), lty = 1)

# add second density plot
lines(density.ozone, lty = 3)

# add legends to state sample sizes and bandwidths; notice use of paste()
legend(100, 0.015, paste('New York: N = ', n.density1, ', Bandwidth = ', round(bw.density1, 1), sep = ''), bty = 'n')
legend(100, 0.013, paste('Ozonopolis: N = ', n.density2, ', Bandwidth = ', round(bw.density2, 1), sep = ''), bty = 'n')

# add legend to label plots
legend(115, 0.011, c('New York', 'Ozonopolis'), lty = c(3,1), bty = 'n')
dev.off()
kernel density plot ozone
It is clear that Ozonopolis has more ozone pollution than New York.  The right-skewed shapes of both curves also suggest that the normal distribution may not be suitable.  (If you read my blog post carefully, you will already see evidence of a different distribution!)  In a later post, I will use quantile-quantile plots to illustrate this.  Stay tuned!

To give you a better sense of why the density plots have higher “bumps” at certain places, take a look at the following plot of the ozone pollution just in New York.  Below the density plot, you will find a rug plot – a plot of tick marks along the horizontal axis indicating where the data are located.  Clearly, there are more data in the neighbourhood between 0 and 50, where the highest “bump” is located.  Use the rug() function to get the rug plot in R.

# Kernel Density Plot with Rug Plot
png('INSERT YOUR DIRECTORY PATH HERE/kernel density plot with rug plot ozone New York.png')
plot(density.ozone, main = 'Kernel Density Plot and Rug Plot of Ozone \n in New York', xlab = 'Ozone (ppb)', ylab = 'Density')
rug(ozone)
dev.off()

References
  • Trosset, Michael W. An introduction to statistical inference and its applications with R. Chapman and Hall/CRC, 2011.
  • Everitt, Brian S., and Torsten Hothorn. A handbook of statistical analyses using R. Chapman and Hall/CRC, 2006.

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-3-26 23:53:56
I dont think you can do KDE in SPSS except Kernel Density Plots as below. In Weka you can do only univariate KDE. In R you can use the "np" package to do KD.
Kernel Density Plots in R,SAS and SPSS:


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-3-27 00:01:19
for SAS, you can try Proc KDE and Proc timeseries.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群