Clusters are the key to DNA analysis

The following article was published in The Ancestral Searcher Vol. 49, No. 2, my local family history society:

What is a cluster?

This is the simplest statement in all of genetic genealogy – a cluster is a group of matches who share DNA with each other. To most people the term genetic network is synonymous.

We have been manually clustering in a piecemeal fashion for years, but perhaps only focusing on high cM matches, and those we can place on our tree, or looking for people with known surnames. It was all I thought you could do with DNA, till I found my own mystery ancestral couple and had to learn new methods of manually bringing matches into clusters. An auto-cluster automates the process we have used ever since DNA matches became available, that of bringing together people who share the same bits of DNA, with us and with each other. Auto-clusters, by default, include the important component of “matching with each other”.

Who has been gathered into a cluster?

The most important thing about a cluster is to understand who was gathered into it when we start adding people we had never heard of. Never be discouraged by this – it is these strangers who hold the potential to tell you something that you don’t know.

So, if you recognise one or more matches who represent the family of interest (see below) then you can rightly assume that the other members of the auto-cluster also share that family of interest (in general). If you then gathered the shared matches of every member in the auto-cluster, to expand this cluster, you would bring in all the matches who share this Family’s DNA but who may not have satisfied the tight guard-rails of the Ancestry auto-cluster. The sad news to people who have already  extensively dotted/labelled much of your match list, auto-clusters have little to offer you, even with the new Ancestry features.

There is a caveat so important, it deserves to be called a principle – every cluster contains descendants of the couple where the mystery lies – as well, both the paternal and maternal families of that mystery ancestor or couple. In other words, every cluster contains multiple family lines. This critical principle that underpins working with clusters is elegantly termed Theirs, His and Hers, and was coined by DNA guru Dana Leeds. In general, we know the THEIRS because this is the generation we can trace back to. But for clusters to break through brickwalls, we need the families we don’t yet know – we must expect to find two additional families    a paternal and a maternal family – HIS and HERS.

 


The history of auto-clusters

Auto-clusters have been around for well over eight years.  They are created as matrices by computer algorithms, that methodically go through your entire match list, identifying commonality in sharing DNA often enough to be drawn together. Who matches who and who else?

Initially marketed by a subscription service called Genetic Affairs; the methodology was quickly sold to MyHeritage and Gedmatch and appeared in other third-party tools like DNAGedcom. FamilyTreeDNA offers a matrix tool which isa very similar process but presented as numbers, rather than coloured boxes; so too DNAPainter now offers a tool for you to build your own number-based matrix, which allows you to combine matches from different testing companies. But the tool that made the biggest splash was undoubtedly the auto-cluster at MyHeritage.

MyHeritage set the bar for auto-clusters when they were released. This almost magical dynamic display was enough to entice people to transfer their DNA results to MyHeritage largely for this auto-cluster, to access segment data and for other good tools. Clusters at all sites change as new matches join, and some kits are deleted. So, it is good practice to screen shot your auto-clusters for

Much attention has been paid recently to the launch  of auto-clusters for Ancestry ProTools subscribers. As Ancestry has the largest database of matches it was hoped that these would be more useful. What an auto-cluster does extremely efficiently is to bring together many matches who share DNA, with some important guard-rails to help us. At Ancestry, every match in the cluster has to match you at least 20cM, and also match almost everyone else in the cluster at least 20cM. This prevents anomalous matches taking up our research time. These misleading or false matches occur  less frequently at Ancestry because of this 20cM guard-rail, but other companies are reporting matches down to 8cM. A 10cM match is 50% likely to be false or be misattributed to a known common ancestor, when the true source of the DNA is an entirely different common ancestor, or one or neither party has been able to document. At 20cM the risk of this is 0%. 

The following graphics hopefully illustrate the process whereby a cluster is formed biologically. In this highly schematic example, Sue shares DNA with you on chromosomes 3, 4, 10, 13, 16, and 19 (six segments) and you happily think this accords with her status as a known second cousin. John is a more distant cousin. He shared parts of chromosome 16 with you. In this it is part of the exact same segment as Sue does. We call this triangulation when three people descending from the common ancestors have inherited the same exact bits of DNA.

However it is not common that the triad of you, your chosen match and any one person on the shared match list do triangulate.  Let’s look at another John. Sue is still your second cousin, sharing the segments on the same chromosomes with you. However, John shares parts of chromosome 1 and 17 with you. But why is he a shared match of Sue? Well in this schematic example Sue and John share segments on different chromosomes, perhaps on 1, 2, 8 and 17. There is no overlap with Sue, and this is very much the norm. But there must be overlap between each pair in this group of three – you must match Sue, you must match John, and Sue must match John, this is often called three pair-wise comparisons.


I’d like you to think that this process of one segment of DNA matching the next new person who comes into the cluster, brings in yet again new bits of DNA that might bring in more people is a lot like a chain reaction, and this process is how clusters are formed. In this example, all the different segments of DNA were derived from the common ancestral couple. But let me stress - we don’t need to care about which segments are involved to safely assume these three people share a common ancestral couple/ancestor.

Importantly clusters minimise the risk of paying attention to misleading or false matches.

Key message – clusters of matches are far more powerful than any single match (except for very high matches say over 2000cM)

Working with clusters

Increasingly, working with DNA has evolved into a process of finding a cluster of people mostly unknown to us but who between them share DNA, and labelling them in such a way as to identify which of our family lines they belong to. The power of DNA is in connecting people who share DNA – not characteristics such as born in Germany or surnamed Jones.

With this awareness, the important task is to springboard off matches we do recognise, and who are in the family of interest, but then to gather all their shared matches into a much bigger group, which we call a cluster or a network.  The cluster therefore includes some people we know, perhaps a sibling or 1C or 2C who also inherited that shared segment of DNA, and who will be at the top of a cluster list based on cM values. It then includes the many other people we do not recognise. However, what we can recognise about every match is their genetic relationships to us, which is the cM value, and this is a proxy for their genealogical relationship to us. So, strangers on our cluster list with cM values of 200-250 are likely to be second cousins.  Strangers with cM values of 75-100 are likely to be third cousins, those with cM values of 25-50 likely to be fourth cousins. Those in the low 20s probably fifth cousins. See below for a table and discussion of half and removed relationships.

When you have a group of matches, either found through an auto-cluster, or through dotting, and there is internal consistency within the cluster you’ll know you are not committing the Cardinal Sin of creating clusters – bringing in the wrong DNA. We don’t want DNA from other family lines to contaminate the cluster. This will cause you to spend many hours looking for connections that do not exist. What you’ll find as you review the shared matches of each person in the cluster is that the same matches appear over and over, with only a few new ones for any one match. And eventually as you gather the shared matches of the smallest cM value members of the cluster, there will be no new matches. Now your cluster is complete.

Our known or recognised matches can only point the way to a particular family line. Because we already know where they sit on our tree, they cannot break through brickwalls. It is these small mystery matches who hold the key.

Key message – it is the smaller cM value matches within a cluster that break through brick walls

Examining a cluster

MyHeritage and Ancestry auto-clusters are presented as a grid of coloured boxes, with the same names along the left and top sides. Thus, the darker box on the diagonal is where the names on the left match the names along the top. A blank box indicates that the two matches do not share enough DNA to satisfy the testing company’s’ parameters to be a match.

When you hover over any box you see how many cM are shared between the two matches. 



In this example of my largest cluster on MyHeritage, I examine it for the following features:

Do I recognise any matches – and I see the names McGuiness, Hunt, Buckley, Reeves – all members of my paternal grandfather’s line

Do I NOT recognise any names – e.g. O’Brien, Barr, Private – further research identifies them

Are there any “irregularities”? – Yes –  Blackie and illingham match very few other people in the cluster – are they still important?

Only the last line (a McGuiness) and Hunt (#3) match ALL people in the cluster – they are likely to be from an earlier generation than others in the cluster

How does a cluster break through brickwalls to mystery ancestors?

We are generally asking DNA to break down brick-walls where we do not know the name of a particular ancestor.  This might because of a known illegitimate birth where no father was named or because we are at say a x2 or x3 or x4 great grandmother who has been identified only by her first name. What has also emerged from the millions of DNA tests already taken, is the high frequency of non-expected-paternal events, or NPEs, generally of the father but it also applies to women who did not raise their child. We expect to find surname X because it is in our tree and on all the documents, but none of our matches have surname X, or an ancestor named X. Instead, there will be a group of matches who do all share DNA and do not sit on any other family line we have documented, but with a name that we have never heard before and therefore can’t recognise. In the past we may have passed them off as ‘too-hard-to work out’. At an individual match level, it is far ‘too-hard-to-work-out’ such people. But with a group of matches embodying the DNA of Family X, some with good trees back to the generation where our two lines would connect, others with tiny trees, but with names that repeat enough to signal the connection, we can work out who Family X are. This process is described in the case study of working with clusters elsewhere in this edition of TAS [and blog]

 

 

 



Comments

Popular posts from this blog

What is a Family-Specific match list?

Bottoms Up!

Finding Isabella and Dan Inglis

So few descendants - why finding DNA matches can be so hard

My life in ScotlandsPeople Centres

The Expanding Buchan Study in 2025