Burdett DNA Project Analysis

The folks at the Burdett DNA Project have done a great job at presenting and keeping updated the results of the project, but I wanted to see HOW individual related groups were related, as a first step to trying to match the data to a family tree.

One way of looking at the data is as a mathematical graph, a collection of nodes and interconnecting undirected edges. This analysis doesn't use any known or suspected family tree information, only the DNA results (so far). Here, the nodes represent a particular DNA marker set result, and are labelled with the kit number(s) of the DNA tests, and the group from the project results page. The group information is used only to display on the result. The nodes connected with lines are a genetic distance of 1 apart, and the lines are labelled with the name of the marker that changed. Fast-changing markers are in red, slower-changing markers are in black.

For a description of how these graphs are created, and how you could do it manually, see the howto page.

Sometimes nodes that are further apart than 1 are still related enough to be interesting. In this case the program will create nodes with intermediate DNA results, and labels them with 't' followed by a number so you can refer to them. Since the distance between such nodes is greater than one, and usually more than one marker has changed, and you don't know which marker changed first, there are usually 2 intermediate nodes between nodes of distance 2, and 6 intermediate nodes between nodes of distance 3. If you look at a graph of two nodes of distance 3 with the 6 intermediate nodes cockeyed, it will look like a cube (perhaps bent out of shape a bit). This is not a coincidence. There are normally 3 different markers involved, and each is effectively a different dimension of difference.

This example graph uses 37 markers and creates intermediate nodes for nodes of distance 2. Some things to note:

  • 73417 and 73476 both have the same DNA results for 37 markers. Listing all the identical results (up to some number of markers) in the same node is something that will not scale well when the project has hundreds of results, and eventually I'll have to call it "Class 1" or something and list the members of Class 1 at the bottom of the graph.
  • 77434 and 73544 are distance 1 apart, and CDY a is the marker that changed between them.
  • Nodes with no connections, such as 76600 and 76794, are too far apart from any other node to be considered connected. Placement of the node relative to other nodes is somewhat random and irrelevant, and the graph visualization software I'm using tends to arrange them in a strange way..
  • The nodes t1 and t2 do not have any corresponding test results but one of them might correspond to the DNA of a common ancestor.
  • 67 markers, distance 2
  • 67 markers, distance 3
  • 37 markers, distance 2
  • 37 markers, distance 3
  • 25 markers, distance 2
  • 25 markers, distance 3
  • Filling in nodes of distance 2 or 3 for 12-marker DNA results is stretching the concept of "relatedness" beyond the breaking point, except perhaps in the sense of "I am more closely related to this chimpanzee than this lizard". Linkups found at best show relationships between haplogroups.
  • 12 markers, distance 2
  • 12 markers, distance 3

    Matching DNA to Family Trees

    Now I'm going to try to use the difference graphs and known relationships to figure out where all the mutations are. I'm assuming that mutations are relatively rare: in any line of descent, it's possible for marker X to mutate in one generation and mutate back in the next, leaving no net change. I'll assume that doesn't happen.

    It takes at least 3 tests to make this interesting. For one test, you can predict that all of the ancestors match exactly, in the absence of any other information. For two tests, you can predict that any differences between them occurred somewhere along the one of the lines back to the common ancestor, but you don't know where.

    For three or more tests, you can start predicting where the change occurred. If three people descended from a common ancestor, and two have one marker value and the other has a different one, you can assume the mutation happened in the line of the odd man out.

    On the descendant charts:

  • A blue node represents a 67-marker test.
  • A green node represents a 37-marker test.
  • An orange node represents a 25-marker test.
  • A yellow node represents a 12-marker test.
  • A grey node represents a test with results pending.
  • A pink node represents ancestors presumed to exist from family tree information.
  • A red dashed arrow represents an unknown number of generations to connect up two lines with unknown relationship with a common ancestor.

    Group 1 is so far the only interesting group. Here, there are 3 related subgroups but the exact relationship is unknown.

    Group 3 has the potential to be interesting, but it contains a disconnect somewhere, as two supposedly-related people end up being unrelated according to DNA.

    The graph is divided into "segments". A segment is part of a line of descent that stops at either a person with a test result or someone with more than one descendent. The segment is marked on the graph as "[X]" and a mutation can be located to somewhere between a person on a specific segment and their father. There is no way to determine which mutation happened first. Exactly where can't be determined without more tests of different people.