全部版块 我的主页
论坛 提问 悬赏 求职 新闻 读书 功能一区 悬赏大厅 求助成功区
2142 5
2017-03-07
悬赏 15 个论坛币 已解决
请问各位坛友:

我有两笔用csv格式保存的数据,其变量名称相同,观察值数量相同,每个观察值的取值也相同,可以说这两笔数据应该是一模一样的。

但我把它们分别导入stata后,再用cf命令两两组数据进行对比,stata却认为其中有一部分变量不相同,截图如下:

然而,这些所谓“不一致”的观察值从数字上来看还是一样的。

请问为什么会出现这样的情况?
明明一模一样的数据,stata却认为不一样?



附件: 您需要登录才可以下载或查看附件。没有帐号?我要注册

最佳答案

Newkoarla 查看完整内容

basically I would ask you if you sorted both data groups? the best practice is to sort data before you run compare. You can refer the article I copy below: How do I check that the same data input by two people are consistently entered? | Stata FAQ When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double ...
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-3-7 21:50:56
basically I would ask you if you sorted both data groups? the best practice is to sort data before you run compare.

You can refer the article I copy below:
How do I check that the same data input by two people are consistently entered? | Stata FAQ
When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2. After we read in the data, we sort the datasets by the id variable id and then save the data.
clear
input id str8 name  age ht wt income
11 john    23 68 145 23000
12 charlie 25 72 178 45000
13 sally   21 64 135 12000
4  mike    34 70 156  5600
43 paul    30 73 189 15600
end

sort id
save person1, replace

clear
input id str8 name age ht wt income
11 john    23.5 68 145 23000
12 charles   25 52 178 45000
13 sally     21 64  .  12000
4  michael   34 70 156  5600
43 Paul      30 73 189  5600
end

sort id
save person2, replace
We compare the two datasets with the cf command to see if any discrepancies exist between the two datasets.
use person1, clear
cf _all using person2, verbose

              id:  match
            name:  3 mismatches
             age:  1 mismatches
              ht:  1 mismatches
              wt:  1 mismatches
          income:  1 mismatches
r(9);
The cf command revealed that differences do exist, however, it did not specify for which observations the mismatches occurred, which is our main objective. To find out where the errors occurred, we start by creating a large dataset that combines the two. However, in the large dataset we must distinguish the data input by person1 and person2. We choose to rename all variables from person1, except for the id variable (this is for matching purposes), by adding the suffix "_person1" via the rename command. We use the foreach command to make the renaming process more efficient. Once we the variables are renamed, person2 is merged with person1 by the id variable, id, and then the merged dataset is listed.
use person1, clear

foreach var of varlist name-income{
  rename `var' `var'_person1
}

merge id using person2
list

     +---------------------------------------------------------------------------------------------------------+
     | id   name_p~1   age_pe~1   ht_per~1   wt_per~1   income~1      name    age   ht    wt   income   _merge |
     |---------------------------------------------------------------------------------------------------------|
  1. |  4       mike         34         70        156       5600   michael     34   70   156     5600        3 |
  2. | 11       john         23         68        145      23000      john   23.5   68   145    23000        3 |
  3. | 12    charlie         25         72        178      45000   charles     25   52   178    45000        3 |
  4. | 13      sally         21         64        135      12000     sally     21   64     .    12000        3 |
  5. | 43       paul         30         73        189      15600      Paul     30   73   189     5600        3 |
     +---------------------------------------------------------------------------------------------------------+
In exploring the discrepancies, we can either display discrepancies by the variables or discrepancies by observations. We start by listing the discrepancies by the variables. We start by using the foreach command and reference the variables from person2 (they do not have the suffix), name-income. We use the if clause, `var’ != `var’_person1, which lists only observations for a given variable, the given variable referenced by `var’ from the foreach command, when the data entered by person2 (`var’) is not equal to person1 (`var’_person1). When this condition is met, we list id, the value entered by person2 (`var’) and the value entered by person1 (`var’_person1).
Note that when we list the variables, the variables with no suffix correspond to the entries made by person2.
*Discrepancies listed by variables.

foreach var of varlist name-income{
  list id `var' `var'_person1 if `var' != `var'_person1, abbreviate(15)
}
     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
  3. | 12   charles        charlie |
  5. | 43      Paul           paul |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+
When we list discrepancies by observations, we need to modify the prior program to evaluate the variables on a case-by-case basis i.e., for observation 1, we evaluate the entries across all variables given in the foreach. Once observation 1 is checked and discrepancies listed, we move to observation 2. This process is repeated until the last observation is completed. First, we find how many observations are in the data with the count command and then insert that value in the forvalues loop. The forvalues argument will allow us to evaluate discrepancies on a case-by-case basis. We added _n == `i’ to the if clause in the list command to evaluate the variables in the foreach command for a given observation before moving to the next observation.
*Discrepancies listed by id variable.

count
    5

forvalues i = 1/5 {
   foreach var of varlist name-income{
   list id `var' `var'_person1 if (`var' != `var'_person1) & _n == `i', abbreviate(15)
   }
}

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  3. | 12   charles        charlie |
     +-----------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +--------------------------+
     | id   name   name_person1 |
     |--------------------------|
  5. | 43   Paul           paul |
     +--------------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+


If you can post the data group and the code, I might be able to help you out here.

Good luck!
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-3-8 14:44:49
这要问专业的计算机人员了
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-3-8 18:58:51
应该跟计算机的精度有关系,具体你应该看一下数值计算这本书
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-3-10 00:22:07
Here is the syntax of CF
附件: 您需要登录才可以下载或查看附件。没有帐号?我要注册
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-4-23 17:11:21
Newkoarla 发表于 2017-3-10 00:22
Here is the syntax of CF
非常感谢!
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群