How Do I Generate the Graphs?
The graphs are just a way to visualize genetic distance.
The idea behind genetic distance is that mutations are fairly uncommon,
and when they do occur, one of the markers will change by one, either
up or down. The number of mutations (and, therefore, some idea of
the relationship closeness) is approximated by the genetic distance,
which is mathematically the sum of the absolute values of the differences
in corresponding markers. In simpler terms, subtract numbers for the
same markers in two different people, treat the results as positive
numbers, and add them.
I use a program to generate the graphs, but here's a description of
how you could do it manually.
Step 1: Gather Your Data
For this example, here is the data being used. A smaller number of
markers makes the example easier to work with. This is fictional data
with an unrealistically small number of markers. Perhaps we're doing
DNA tests on cockroaches, or only looking at a subset of the markers.
| Name | Markers |
| 111 | 222 | 333 | 444 | 555 | 666 |
| Andy | 11 | 12 | 13 | 27 | 15 | 12 |
| Bob | 11 | 12 | 13 | 26 | 15 | |
| Charles | 11 | 12 | 13 | 27 | 15 | 13 |
| David | 11 | 12 | 13 | 26 | 16 | 12 |
| Frank | 11 | 12 | 13 | 26 | | |
| George | 11 | 12 | 13 | 27 | | |
| Harry | 12 | 12 | 11 | 25 | 15 | |
Step 2: Decide How Many Markers and Distance for the Graph
You can make a graph for any set of markers and distances 1 (don't
create any intermediate nodes at all), 2, or 3. But you have to decide
now. For this simplified example, I'll be using 5 markers and distance 1.
The more markers you use, the more detail you get but you also lose people
who didn't get the tests for those markers. The obvious values to use
for Family Tree DNA tests are 12, 25, 37, and 67. You probably want to
use all 4 values if all 4 types of tests are in the data.
Step 3: Throw Out Incomplete Data
Eliminate any people who don't have data for all the markers required.
For this example, goodbye Frank and George.
They would show up on a 3-marker graph.
Step 4: Group Identical Results
Find all of the people with identical markers, subject to the number of
markers being used, and form groups of them. For our example, Andy and Charles
have identical markers (marker 666 doesn't count, since we are only
using 5 markers). So, our groups are (Andy and Charles), (Bob), and (David) and (Harry).
Step 5: Generate Intermediate Nodes
Intermediate nodes are created to force a connection between nodes that are
"close enough to be interesting", but not of distance 1. (This is especially
true of 37 and 67 marker FTDNA tests).
For a graph of
distance 1, you don't create any intermediate nodes. For a graph of distance
2, you create intermediate nodes between each pair of original nodes that
are distance 2. For a graph of distance 3, you create
intermediate nodes between each pair of original nodes that are
distance 2 or 3. Ignore nodes that duplicate the DNA of nodes you've
already got.
If you're reading this for the first time, and don't need all the gory
details, you might want to skip the rest of this section.
Normally, for two nodes of distance N, there are N*N-2 intermediate
nodes to create. (2 nodes for distance 2 (looks like a square),
6 nodes for distance 3 (looks like a cube),
and 14 nodes for distance 4 (resembles a 4-dimensional "cube", whatever
that looks like). An exception to this happens when the difference
involves a step of more than 1 in any single marker. Some of the
intermediate nodes end up having the same DNA, so there are fewer
of them when duplicates are removed.
You create intermediate nodes between pairs of nodes by noting which
markers differ between the two.
For example, the (Andy, Charles) node and the (David) node
are distance 2 apart, and differ in markers 444 and 555. Now take the markers
from the (Andy, Charles) node, change one of the markers one step towards
the (David) node. Repeat for each of the other markers that are different.
Now you've got nodes for 11:12:13:26:15 and 11:12:13:27:16. Compare these to
existing nodes. 11:12:13:26:15 duplicates the node for Bob, so you don't
need to create a new one. 11:12:13:27:16 is new, so give it a temporary name
(e.g. t1) and add it to the list.
Step 6: Link Groups With Distance 1
Group (Andy and Charles) is distance 1 from group (Bob), so link these two (marker 444 is different).
Group (Bob) is distance 1 from group (David), so link these two (marker 555 is different).
If we're using the intermediate node t1, it's distance 1 from (Andy and Charles) and distance 1 from (David), so link these.
Step 7: Draw the Graph
Draw ovals for each group, label the group, and draw lines between each
pair of linked groups. Label each line with the name of the marker that's
different between the two. I use some free graph visualization software to come
up with reasonable placement of the groups, but doing it by hand with a
small number of groups is fairly simple. It's also not that hard to do
for a graph with many small clusters of interconnected groups, and
many isolated (entirely unconnected) groups.
Here's what the graph looks like for distance 1 (no intermediate nodes
added):
And here's what the graph looks like for distance 2 (an intermediate
node was added):
Interpreting the Graphs
People listed in the same node have the same DNA (for the given number of
markers), as in Andy and Charles.
People listed in connected nodes have a genetic distance of 1
(for the given number of markers), as in Andy and Bob.
In the above graph, Andy and Charles
are distance 2 from David, and are connected through Bob. This probably
means that Andy and David are more distantly related than Andy and Bob or
Bob and David, and that Andy and David's relationship is probably through
Bob's ancestors.
Harry's node is not connected, so it's distant from the others.