| Title |
| Dealing with reports of repeated time values within panel |
| Author | Nicholas J. Cox, Durham University, UK
Michael Mulcahy, University of Connecticut |
| Date | December 2005 |
QuestionI have panel data. I want to exploit the power of
tsset (see [TS]
tsset), but when I type
. tsset id timeI get a report of
repeated time values within panel r(451);
What should I do next
?
AnswerPanel data are defined by an identifier variable and a time variable. Each combination of identifier and time should occur, at most, once. That is, any such combination might appear either once or not at all, as gaps are allowed in panel data. The report of "repeated time values within panel" is thus serious, as Stata is unable to proceed with any commands that depend upon your data being accepted as panel data.
Two common reactions to this report are to suppose that it cannot be true, as you know you have panel data, or that there must be a bug or at least a misunderstanding here. In our experience, the misunderstanding will, on closer inspection, be found embedded in the dataset. Here we discuss various methods for approaching the problem. The underlying idea is that knowing several ways of going further is much better than knowing none. All the methods discussed are also applicable to other problems.
1. Do identifier and time uniquely identify the data?Observations in panel data are uniquely identified by the combination of identifier and year. Thus
isid may be used to check for this, for example,
. isid id timeWith
isid,
no news is good news. However, if the variables specified do not jointly identify the data, an error message will appear.
The logic of
isid may be implemented in other ways. At its heart is an operation
. bysort id time: assert _N == 1
asserting that each combination of identifier and time is unique. Again, with
assert no news is good news. If the statement asserted is not true everywhere that it is tested, an error message will ensue.
2. Check for duplicatesIf you have received confirmation of a problem, the next step is to track it down. With a very small dataset, a
list or
edit of the data may be sufficient, but even then, a more systematic approach is preferable. Here is what we did in a specific example using the
duplicates command, which is a small bundle of tools for investigating possible problems arising from duplicated observations.
The dataset consists of several variables for various cities and years, with identifier
id and time variable
year. The number of values is 7,813, large enough for a visual scan of the data to be a poor solution. The subcommand
duplicates report quantifies the extent of the problem, 26 pairs of values of
id and
year. The subcommand
duplicates list finds that they involve
id 467. The subcommand
duplicates tag is used to tag the observations to examine more closely. An
edit then gives all the details.
. duplicates report id year
Duplicates in terms of id year
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 7787 0
2 | 26 13
--------------------------------------
. duplicates list id year
Duplicates in terms of id year
+----------------------------+
| group: obs: id year |
|----------------------------|
| 1 6059 467 1990 |
| 1 6060 467 1990 |
| 2 6061 467 1991 |
| 2 6062 467 1991 |
| 3 6063 467 1992 |
|----------------------------|
| 3 6064 467 1992 |
| 4 6065 467 1993 |
| 4 6066 467 1993 |
| 5 6067 467 1994 |
| 5 6068 467 1994 |
|----------------------------|
| 6 6069 467 1995 |
| 6 6070 467 1995 |
| 7 6071 467 1996 |
| 7 6072 467 1996 |
| 8 6073 467 1997 |
|----------------------------|
| 8 6074 467 1997 |
| 9 6075 467 1998 |
| 9 6076 467 1998 |
| 10 6077 467 1999 |
| 10 6078 467 1999 |
|----------------------------|
| 11 6079 467 2000 |
| 11 6080 467 2000 |
| 12 6081 467 2001 |
| 12 6082 467 2001 |
| 13 6083 467 2002 |
|----------------------------|
| 13 6084 467 2002 |
+----------------------------+
. duplicates tag id year, gen(isdup)
Duplicates in terms of id year
. edit if isdup
. drop isdup
The final
edit command reveals the precise problem: two cities, Royal Oak, MI, and Bristol, CT, have been assigned the same identifier. We need to fix that by changing the identifier of one city to something else.
Not all these steps are essential. Some users omit the
report. On the other hand, in a large dataset, the
list could be lengthy. Either way,
duplicates offers various handles for the problem.