判断两组数据的观察值是否一致

xmwise

2142

收藏 2017-03-07

悬赏 15 个论坛币已解决

请问各位坛友：

我有两笔用csv格式保存的数据，其变量名称相同，观察值数量相同，每个观察值的取值也相同，可以说这两笔数据应该是一模一样的。

但我把它们分别导入stata后，再用cf命令两两组数据进行对比，stata却认为其中有一部分变量不相同，截图如下：

然而，这些所谓“不一致”的观察值从数字上来看还是一样的。

请问为什么会出现这样的情况？
明明一模一样的数据，stata却认为不一样？

附件: 您需要登录才可以下载或查看附件。没有帐号？我要注册

最佳答案

Newkoarla 查看完整内容

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

Newkoarla

2017-3-7 21:50:56

basically I would ask you if you sorted both data groups? the best practice is to sort data before you run compare.

You can refer the article I copy below:
How do I check that the same data input by two people are consistently entered? | Stata FAQ
When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2. After we read in the data, we sort the datasets by the id variable id and then save the data.
clear
input id str8 name  age ht wt income
11 john 23 68 145 23000
12 charlie 25 72 178 45000
13 sally 21 64 135 12000
4  mike 34 70 156  5600
43 paul 30 73 189 15600
end

sort id
save person1, replace

clear
input id str8 name age ht wt income
11 john 23.5 68 145 23000
12 charles 25 52 178 45000
13 sally    21 64  .  12000
4  michael 34 70 156  5600
43 Paul    30 73 189  5600
end

sort id
save person2, replace
We compare the two datasets with the cf command to see if any discrepancies exist between the two datasets.
use person1, clear
cf _all using person2, verbose

            id:  match
         name:  3 mismatches
         age:  1 mismatches
            ht:  1 mismatches
            wt:  1 mismatches
      income:  1 mismatches
r(9);
The cf command revealed that differences do exist, however, it did not specify for which observations the mismatches occurred, which is our main objective. To find out where the errors occurred, we start by creating a large dataset that combines the two. However, in the large dataset we must distinguish the data input by person1 and person2. We choose to rename all variables from person1, except for the id variable (this is for matching purposes), by adding the suffix "_person1" via the rename command. We use the foreach command to make the renaming process more efficient. Once we the variables are renamed, person2 is merged with person1 by the id variable, id, and then the merged dataset is listed.
use person1, clear

foreach var of varlist name-income{
  rename `var' `var'_person1
}

merge id using person2
list

   +---------------------------------------------------------------------------------------------------------+
   | id name_p~1 age_pe~1 ht_per~1 wt_per~1 income~1    name age ht wt income _merge |
   |---------------------------------------------------------------------------------------------------------|
  1. |  4    mike       34       70       156    5600 michael    34 70 156    5600       3 |
  2. | 11    john       23       68       145    23000    john 23.5 68 145 23000       3 |
  3. | 12 charlie       25       72       178    45000 charles    25 52 178 45000       3 |
  4. | 13    sally       21       64       135    12000    sally    21 64    . 12000       3 |
  5. | 43    paul       30       73       189    15600    Paul    30 73 189    5600       3 |
   +---------------------------------------------------------------------------------------------------------+
In exploring the discrepancies, we can either display discrepancies by the variables or discrepancies by observations. We start by listing the discrepancies by the variables. We start by using the foreach command and reference the variables from person2 (they do not have the suffix), name-income. We use the if clause, `var’ != `var’_person1, which lists only observations for a given variable, the given variable referenced by `var’ from the foreach command, when the data entered by person2 (`var’) is not equal to person1 (`var’_person1). When this condition is met, we list id, the value entered by person2 (`var’) and the value entered by person1 (`var’_person1).
Note that when we list the variables, the variables with no suffix correspond to the entries made by person2.
*Discrepancies listed by variables.

foreach var of varlist name-income{
  list id `var' `var'_person1 if `var' != `var'_person1, abbreviate(15)
}
   +-----------------------------+
   | id    name name_person1 |
   |-----------------------------|
  1. |  4 michael          mike |
  3. | 12 charles       charlie |
  5. | 43    Paul          paul |
   +-----------------------------+

   +-------------------------+
   | id age age_person1 |
   |-------------------------|
  2. | 11 23.5          23 |
   +-------------------------+

   +----------------------+
   | id ht ht_person1 |
   |----------------------|
  3. | 12 52          72 |
   +----------------------+

   +----------------------+
   | id wt wt_person1 |
   |----------------------|
  4. | 13 .       135 |
   +----------------------+

   +------------------------------+
   | id income income_person1 |
   |------------------------------|
  5. | 43    5600          15600 |
   +------------------------------+
When we list discrepancies by observations, we need to modify the prior program to evaluate the variables on a case-by-case basis i.e., for observation 1, we evaluate the entries across all variables given in the foreach. Once observation 1 is checked and discrepancies listed, we move to observation 2. This process is repeated until the last observation is completed. First, we find how many observations are in the data with the count command and then insert that value in the forvalues loop. The forvalues argument will allow us to evaluate discrepancies on a case-by-case basis. We added _n == `i’ to the if clause in the list command to evaluate the variables in the foreach command for a given observation before moving to the next observation.
*Discrepancies listed by id variable.

count
5

forvalues i = 1/5 {
foreach var of varlist name-income{
list id `var' `var'_person1 if (`var' != `var'_person1) & _n == `i', abbreviate(15)
}
}

   +-----------------------------+
   | id    name name_person1 |
   |-----------------------------|
  1. |  4 michael          mike |
   +-----------------------------+

   +-------------------------+
   | id age age_person1 |
   |-------------------------|
  2. | 11 23.5          23 |
   +-------------------------+

   +-----------------------------+
   | id    name name_person1 |
   |-----------------------------|
  3. | 12 charles       charlie |
   +-----------------------------+

   +----------------------+
   | id ht ht_person1 |
   |----------------------|
  3. | 12 52          72 |
   +----------------------+

   +----------------------+
   | id wt wt_person1 |
   |----------------------|
  4. | 13 .       135 |
   +----------------------+

   +--------------------------+
   | id name name_person1 |
   |--------------------------|
  5. | 43 Paul          paul |
   +--------------------------+

   +------------------------------+
   | id income income_person1 |
   |------------------------------|
  5. | 43    5600          15600 |
   +------------------------------+

If you can post the data group and the code, I might be able to help you out here.

Good luck!

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

血浪星空

2017-3-8 14:44:49

这要问专业的计算机人员了

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

lile23

2017-3-8 18:58:51

应该跟计算机的精度有关系，具体你应该看一下数值计算这本书

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

Newkoarla

2017-3-10 00:22:07

Here is the syntax of CF

附件: 您需要登录才可以下载或查看附件。没有帐号？我要注册

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

xmwise

2017-4-23 17:11:21

Newkoarla 发表于 2017-3-10 00:22
Here is the syntax of CF

非常感谢！

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

最佳答案

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群