Clusters are the key to DNA analysis
The following article was published in The Ancestral Searcher Vol. 49, No. 2, my local family history society:
What is a cluster?
This is the simplest statement in all of genetic genealogy –
a cluster is a group of matches who share DNA with each other. To most people
the term genetic network is synonymous.
We have been manually clustering in a piecemeal fashion for
years, but perhaps only focusing on high cM matches, and those we can place on
our tree, or looking for people with known surnames. It was all I thought you
could do with DNA, till I found my own mystery ancestral couple and had to
learn new methods of manually bringing matches into clusters. An auto-cluster automates
the process we have used ever since DNA matches became available, that of
bringing together people who share the same bits of DNA, with us and with each
other. Auto-clusters, by default, include the important component of “matching
with each other”.
Who has been gathered into a cluster?
The most important thing about a cluster is to understand
who was gathered into it when we start adding people we had never heard of.
Never be discouraged by this – it is these strangers who hold the potential to
tell you something that you don’t know.
So, if you recognise one or more matches who represent the
family of interest (see below) then you can rightly assume that the other
members of the auto-cluster also share that family of interest (in general). If
you then gathered the shared matches of every member in the auto-cluster, to
expand this cluster, you would bring in all the matches who share this Family’s
DNA but who may not have satisfied the tight guard-rails of the Ancestry
auto-cluster. The sad news to people who have already extensively dotted/labelled much of your
match list, auto-clusters have little to offer you, even with the new Ancestry
features.
There is a caveat so important, it deserves to be called a
principle – every cluster contains descendants of the couple where the mystery
lies – as well, both the paternal and maternal families of that mystery
ancestor or couple. In other words, every cluster contains multiple family
lines. This critical principle that underpins working with clusters is
elegantly termed Theirs, His and Hers, and was coined by DNA guru Dana Leeds.
In general, we know the THEIRS because this is the generation we can trace back
to. But for clusters to break through brickwalls, we need the families we don’t
yet know – we must expect to find two additional families – a
paternal and a maternal family – HIS and HERS.
The history of auto-clusters
Auto-clusters have been around for well over eight years. They are created as matrices by
computer algorithms, that methodically go through your entire match list,
identifying commonality in sharing DNA often enough to be drawn together. Who
matches who and who else?
Initially marketed by a subscription service called Genetic
Affairs; the methodology was quickly sold to MyHeritage and Gedmatch and
appeared in other third-party tools like DNAGedcom. FamilyTreeDNA offers a
matrix tool which isa very similar process but presented as numbers, rather
than coloured boxes; so too DNAPainter now offers a tool for you to build your
own number-based matrix, which allows you to combine matches from different
testing companies. But the tool that made the biggest splash was undoubtedly
the auto-cluster at MyHeritage.
MyHeritage set the bar for auto-clusters when they were
released. This almost magical dynamic display was enough to entice people to
transfer their DNA results to MyHeritage largely for this auto-cluster, to
access segment data and for other good tools. Clusters at all sites change as
new matches join, and some kits are deleted. So, it is good practice to screen
shot your auto-clusters for
Much attention has been paid recently to the launch of auto-clusters for Ancestry ProTools
subscribers. As Ancestry has the largest database of matches it was hoped that
these would be more useful. What an auto-cluster does extremely efficiently is
to bring together many matches who share DNA, with some important guard-rails
to help us. At Ancestry, every match in the cluster has to match you at least
20cM, and also match almost everyone else in the cluster at least 20cM. This
prevents anomalous matches taking up our research time. These misleading or
false matches occur less frequently at
Ancestry because of this 20cM guard-rail, but other companies are reporting
matches down to 8cM. A 10cM match is 50% likely to be false or be misattributed
to a known common ancestor, when the true source of the DNA is an entirely
different common ancestor, or one or neither party has been able to document.
At 20cM the risk of this is 0%.
The following graphics hopefully illustrate the process
whereby a cluster is formed biologically. In this highly schematic
example, Sue shares DNA with you on chromosomes 3, 4, 10, 13, 16, and 19 (six
segments) and you happily think this accords with her status as a known second
cousin. John is a more distant cousin. He shared parts of chromosome 16 with
you. In this it is part of the exact same segment as Sue does. We call this triangulation
when three people descending from the common ancestors have inherited the same
exact bits of DNA.
However it is not common that the triad of you, your chosen
match and any one person on the shared match list do triangulate. Let’s look at another John. Sue is still your
second cousin, sharing the segments on the same chromosomes with you. However,
John shares parts of chromosome 1 and 17 with you. But why is he a shared match
of Sue? Well in this schematic example Sue and John share segments on different
chromosomes, perhaps on 1, 2, 8 and 17. There is no overlap with Sue, and
this is very much the norm. But there must be overlap between each pair in
this group of three – you must match Sue, you must match John, and Sue must
match John, this is often called three pair-wise comparisons.
I’d like you to think that this process of one segment of DNA matching the next new person who comes into the cluster, brings in yet again new bits of DNA that might bring in more people is a lot like a chain reaction, and this process is how clusters are formed. In this example, all the different segments of DNA were derived from the common ancestral couple. But let me stress - we don’t need to care about which segments are involved to safely assume these three people share a common ancestral couple/ancestor.
Importantly clusters minimise the risk of paying attention
to misleading or false matches.
Key message – clusters of matches are far more powerful
than any single match (except for very high matches say over 2000cM)
Working with clusters
Increasingly, working with DNA has evolved into a process of
finding a cluster of people mostly unknown to us but who between them share
DNA, and labelling them in such a way as to identify which of our family lines
they belong to. The power of DNA is in connecting people who share DNA – not
characteristics such as born in Germany or surnamed Jones.
With this awareness, the important task is to springboard
off matches we do recognise, and who are in the family of interest, but then to
gather all their shared matches into a much bigger group, which we call a
cluster or a network. The cluster
therefore includes some people we know, perhaps a sibling or 1C or 2C who also
inherited that shared segment of DNA, and who will be at the top of a cluster
list based on cM values. It then includes the many other people we do not
recognise. However, what we can recognise about every match is their
genetic relationships to us, which is the cM value, and this is a proxy for
their genealogical relationship to us. So, strangers on our cluster list with
cM values of 200-250 are likely to be second cousins. Strangers with cM values of 75-100 are likely
to be third cousins, those with cM values of 25-50 likely to be fourth cousins.
Those in the low 20s probably fifth cousins. See below for a table and
discussion of half and removed relationships.
When you have a group of matches, either found through an
auto-cluster, or through dotting, and there is internal consistency within the
cluster you’ll know you are not committing the Cardinal Sin of creating
clusters – bringing in the wrong DNA. We don’t want DNA from other family lines
to contaminate the cluster. This will cause you to spend many hours looking for
connections that do not exist. What you’ll find as you review the shared
matches of each person in the cluster is that the same matches appear over and
over, with only a few new ones for any one match. And eventually as you gather
the shared matches of the smallest cM value members of the cluster, there will
be no new matches. Now your cluster is complete.
Our known or recognised matches can only point the way to a
particular family line. Because we already know where they sit on our tree,
they cannot break through brickwalls. It is these small mystery matches who
hold the key.
Key message – it is the smaller cM value matches within a
cluster that break through brick walls
Examining a cluster
MyHeritage and Ancestry auto-clusters are presented as a
grid of coloured boxes, with the same names along the left and top sides. Thus,
the darker box on the diagonal is where the names on the left match the names
along the top. A blank box indicates that the two matches do not share enough
DNA to satisfy the testing company’s’ parameters to be a match.
When you hover over any box you see how many cM are shared
between the two matches.
In this example of my largest
cluster on MyHeritage, I examine it for the following features:
Do I recognise any matches – and
I see the names McGuiness, Hunt, Buckley, Reeves – all members of my paternal
grandfather’s line
Do I NOT recognise any names –
e.g. O’Brien, Barr, Private – further research identifies them
Are there any “irregularities”? –
Yes – Blackie and illingham match very
few other people in the cluster – are they still important?
Only the last line (a McGuiness)
and Hunt (#3) match ALL people in the cluster – they are likely to be from an
earlier generation than others in the cluster
How does a cluster break through brickwalls to mystery ancestors?
We are generally asking DNA to break down brick-walls where
we do not know the name of a particular ancestor. This might because of a known illegitimate
birth where no father was named or because we are at say a x2 or x3 or x4 great
grandmother who has been identified only by her first name. What has also
emerged from the millions of DNA tests already taken, is the high frequency of
non-expected-paternal events, or NPEs, generally of the father but it also
applies to women who did not raise their child. We expect to find surname X
because it is in our tree and on all the documents, but none of our matches
have surname X, or an ancestor named X. Instead, there will be a group of
matches who do all share DNA and do not sit on any other family line we have documented,
but with a name that we have never heard before and therefore can’t recognise.
In the past we may have passed them off as ‘too-hard-to work out’. At an
individual match level, it is far ‘too-hard-to-work-out’ such people. But with
a group of matches embodying the DNA of Family X, some with good trees back to
the generation where our two lines would connect, others with tiny trees, but
with names that repeat enough to signal the connection, we can work out who
Family X are. This process is described in the case study of working with
clusters elsewhere in this edition of TAS [and blog]



Comments
Post a Comment