Basic description
Think of merge as being master + using = merged result.
Call the dataset in memory the master dataset, and the dataset on disk the using dataset. This way
we have general names that are not dependent on individual datasets.
Suppose we have two datasets,
master in memory on disk in file filename
+-------+ +--------+
|id age| |id wgt|
|-------| |--------|
| 1 22| | 1 130|
| 2 56| | 2 180|
| 5 17| | 4 110|
+-------+ +--------+
We would like to join together the age and weight information. We notice that the id variable
identifies unique observations in both datasets: if you tell me the id number, then I can tell you
the one observation that contains information about that id. This is true for both the master and
the using datasets.
Because id uniquely identifies observations in both datasets, this is a 1:1 merge. We can bring in
the dataset from disk by typing
. merge 1:1 id using filename
in memory in filename.dta
master + using = merged result
+-------+ +--------+ +------------+
|id age| |id wgt| |id age wgt|
|-------| |--------| |------------|
| 1 22| | 1 130| | 1 22 130| (matched)
| 2 56| | 2 180| | 2 56 180| (matched)
| 5 17| | 4 110| | 5 17 .| (master only)
+-------+ +--------+ | 4 . 110| (using only)
+------------+
The original data in memory are called the master data. The data in filename.dta are called the
using data. After merge, the merged result is left in memory. The id variable is called the key
variable. Stata jargon is that the datasets were merged on id.
Observations for id==1 existed in both the master and using datasets and so were combined in the
merged result. The same occurred for id==2. For id==5 and id==4, however, no matches were found
and thus each became a separate observation in the merged result. Thus each observation in the
merged result came from one of three possible sources:
numeric equivalent
code word description
------------------------------------------------------------
1 master originally appeared in master only
2 using originally appeared in using only
3 match originally appeared in both
------------------------------------------------------------
merge encodes this information into new variable _merge, which merge adds to the merged result:
in memory in filename.dta
master + using = merged result
+-------+ +--------+ +--------------------+
|id age| |id wgt| |id age wgt _merge|
|-------| |--------| |--------------------|
| 1 22| | 1 130| | 1 22 130 3|
| 2 56| | 2 180| | 2 56 180 3|
| 5 17| | 4 110| | 5 17 . 1|
+-------+ +--------+ | 4 . 110 2|
+--------------------+
Note: Above we show the master and using data sorted by id before merging; this was for
illustrative purposes. The dataset resulting from a 1:1 merge will have the same data, regardless
of the sort order of the master and using datasets.
The formal definition for merge behavior is the following: Start with the first observation of the
master. Find the corresponding observation in the using data, if there is one. Record the matched
or unmatched result. Proceed to the next observation in the master dataset. When you finish
working through the master dataset, work through unused observations from the using data. By
default, unmatched observations are kept in the merged data, whether they come from the master
dataset or the using dataset.
Remember this formal definition. It will serve you well.
1:1 merges
The example shown above is called a 1:1 merge, because the key variable uniquely identified each
observation in each of the datasets.
A variable or variable list uniquely identifies the observations if each distinct value of the
variable(s) corresponds to one observation in the dataset.
In some datasets, multiple variables are required to identify the observations. Imagine data
obtained by observing patients at specific points in time so that variables pid and time, taken
together, identify the observations. Below we have two such datasets and run a 1:1 merge on pid
and time,
. merge 1:1 pid time using filename
master + using = merged result
+-------------+ +-------------+ +-------------------------+
|pid time x1| |pid time x2| |pid time x1 x2 _merge|
|-------------| |-------------| |-------------------------|
| 14 1 0| | 14 1 7| | 14 1 0 7 3|
| 14 2 0| | 14 2 9| | 14 2 0 9 3|
| 14 4 0| | 16 1 2| | 14 4 0 . 1|
| 16 1 1| | 16 2 3| | 16 1 1 2 3|
| 16 2 1| | 17 1 5| | 16 2 1 3 3|
| 17 1 0| | 17 2 2| | 17 1 0 5 3|
+-------------+ +-------------+ | 17 2 . 2 2|
+-------------------------+
This is a 1:1 merge because the combination of the values of pid and time uniquely identifies
observations in both datasets.
By default, there is nothing about a 1:1 merge that implies that all, or even any of, the
observations match. Above five observations matched, one observation was only in the master
(subject 14 at time 4), and another was only in the using (subject 17 at time 2).
m:1 merges
In an m:1 merge, the key variable or variables uniquely identify the observations in the using
data, but not necessarily in the master data. Suppose you had person-level data within regions and
you wished to bring in regional data. Here is an example:
. merge m:1 region using filename
master + using = merged result
+--------------+ +----------+ +--------------------------+
|id region a| |region x| |id region a x _merge|
|--------------| |----------| |--------------------------|
| 1 2 26| | 1 15| | 1 2 26 13 3|
| 2 1 29| | 2 13| | 2 1 29 15 3|
| 3 2 22| | 3 12| | 3 2 22 13 3|
| 4 3 21| | 4 11| | 4 3 21 12 3|
| 5 1 24| +----------+ | 5 1 24 15 3|
| 6 5 20| | 6 5 20 . 1|
+--------------+ | . 4 . 11 2|
+--------------------------+
To bring in the regional information, we need to merge on region. The values of region identify
individual observations in the using data, but it is not an identifier in the master data.
We show the merged dataset sorted by id because this makes it easier to see how the merged dataset
was constructed. For each observation in the master data, merge finds the corresponding
observation in the using data. merge combines the values of the variables in the using dataset to
the observations in the master dataset.
1:m merges
1:m merges are similar to m:1, except that now the key variables identify unique observations in
the master dataset. Any datasets that can be merged using an m:1 merge may be merged using a 1:m
merge by reversing the roles of the master and using datasets. Here is the same example as used
previously, with the master and using datasets reversed:
. merge 1:m region using filename
master + using = merged result
+----------+ +--------------+ +--------------------------+
|region x| |id region a| |region x id a _merge|
|----------| |--------------| |--------------------------|
| 1 15| | 1 2 26| | 1 15 2 29 3|
| 2 13| | 2 1 29| | 1 15 5 24 3|
| 3 12| | 3 2 22| | 2 13 1 26 3|
| 4 11| | 4 3 21| | 2 13 3 22 3|
+----------+ | 5 1 24| | 3 12 4 21 3|
| 6 5 20| | 4 11 . . 1|
+--------------+ | 5 . 6 20 2|
+--------------------------+
This merged result is identical to the merged result in the previous section, except for the sort
order and the contents of _merge. This time, we show the merged result sorted by region rather
than id. Reversing the roles of the files causes a reversal in the 1s and 2s for _merge: where
_merge was previously 1, it is now 2, and vice versa. These exchanged _merge values reflect the
reversed roles of the master and using data.
For each observation in the master data, merge found the corresponding observation(s) in the using
data and then wrote down the matched or unmatched result. Once the master observations were
exhausted, merge wrote down any observations from the using data that were never used.
m:m merges
m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched
within equal values of the key variable(s), with the first observation being matched to the first;
the second, to the second; and so on. If the master and using have an unequal number of
observations within the group, then the last observation of the shorter group is used repeatedly to
match with subsequent observations of the longer group. Thus m:m merges are dependent on the
current sort order -- something which should never happen.
Because m:m merges are such a bad idea, we are not going to show you an example. If you think that
you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or
m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
Sequential merges
In a sequential merge, there are no key variables. Observations are matched solely on their
observation number:
. merge 1:1 _n using filename
master + using = merged result
+--+ +--+ +----------------+
|x1| |x2| |x1 x2 _merge|
|--| |--| |----------------|
|10| | 7| |10 7 3|
|30| | 2| |30 2 3|
|20| | 1| |20 1 3|
| 5| | 9| | 5 9 3|
+--+ | 3| | . 3 2|
+--+ +----------------+
In the example above, the using data are longer than the master, but that could be reversed. In
most cases where sequential merges are appropriate, the datasets are expected to be of equal
length, and you should type
. merge 1:1 _n using filename, assert(match) nogen
Sequential merges, like m:m merges, are dangerous. Both depend on the current sort order of the
data.