What Can You Do With the World's Largest Family Tree?

Scientists are beginning to find out.

Daniel Maurer / AP

Your family tree might contain a few curious revelations. It might alert you to the existence of long-lost third cousins. It might tell you your 10-times-great-grandfather once bought a chunk of Brooklyn. It might reveal that you have royal blood. But when family trees includes millions of people—maybe even tens of millions of people—then we’re beyond the realm of individual stories.

When genealogies get so big, they’re not just the story of a family anymore; they contain the stories of whole countries and, at the risk of sounding grandiose, even all of humanity.

Last week, scientists using data from Ancestry.com and Geni.com each unveiled papers analyzing the genealogies for patterns like migrations, lifespan, and when people stopped marrying family members. Ancestry.com sells both subscriptions to its genealogy research site and a popular genetic test through its subsidiary AncestryDNA. Its geneticists— along with a historian—used the genetic data of 770,000 AncestryDNA customers along with the genealogy records of their ancestors to map migrations in North America. The team first analyzed the DNA tests to find clusters of closely related people in the present. Then, they matched up the people in those clusters with genealogy records containing 20 million people, which included the birthplaces of several generations of ancestors. With that, they could march backwards in time to see how those ancestors migrated across the U.S.

Erin Battat, a historian and author of Ain’t Got No Home: America’s Great Migrations and the Making of an Interracial Left, joined the research, to verify the patterns in Ancestry’s data. She noticed, for example, that Alabama saw an influx of people from South Carolina in the early 19th century. What happened was that intensive cotton cultivation had depleted the soil in South Carolina and Georgia. And in 1814, the Treaty of Fort Jackson compelled the Creek Indians to cede land in Alabama. This set off an episode of “Alabama Fever,” where South Carolinians traveled through Georgia to settle in a new state with land open for cultivation. “That’s the kind of puzzle I was solving as a historian,” says Battat.

In the case of Geni.com’s data, the company allowed scientists from the New York Genome Center, Columbia, MIT, and Harvard to scrape crowdsourced public records that ultimately contained 43 million people, largely in North America and western Europe. It included the single largest known family tree with 13 million people. (And yes, that family tree included Kevin Bacon.)

The researchers were largely geneticists and computational biologists, but they also recognized the potential value of the data for historians and social scientists. So in an analysis published to the preprint server bioRxiv that is not yet peer-reviewed, they looked at several different variables, such as: the distance men traveled before marrying (on average, longer than women) and the genetic relatedness of couples (decreases markedly after 1850). They also noticed that even when couples started marrying people further away from their birth locations, they didn’t stop marrying their relatives right away. The decline in marrying relatives, the team hypothesizes, might have more to do with changing cultural taboos than the ability to move further via swifter transportation. (Yaniv Erlich, who led this work, declined an interview about his paper because the preprint is still under review at a journal.)

These observations are interesting, but do they reveal anything new? Jan Van Bavel at the University of Leuven writes in an email that they largely confirm earlier research in demography, which is the use of statistics to study the structure of historical populations. “But I think that is a good thing.” he writes, “First, these databases need to be validated, i.e. see if they can replicate well-known facts. If that is the case, that is reassuring to go on and use these data to answer new questions.”

One of the drawbacks of these user-generated genealogies is that they are neither a complete nor random sample of the population. It underrepresents people who don’t have descendants or don’t have descendants with an interest in genealogy contributing to these sites. “Modern demographers really want to know about the whole population,” says Philip Cohen, a demographer at the University of Maryland. “We would be very reluctant to generalize to the whole social order.” What it might be most useful for are specific subpopulations, say in a specific region, where the records are quite complete.

A good example of such a group are the Mormons. The Mormon church has a keen interest in genealogy, and its records are the original backbone of the Utah Population Database, which merges family, medical, and genetic data. “Genealogies are an amazing resource because they are the bedrock from which you can do very interesting and innovative genetics,” says Ken Smith, who is charge of the database housed at the Huntsman Cancer Institute at the University of Utah. Research with the database has led to breakthroughs in the genetics of melanoma, breast cancer, colon cancers, and cardiac arrhythmia.

Knowing how volunteers in a genetics study are related can be a shortcut to pinpointing genes involving disease. Studying families where colon cancer is common, for example, originally helped geneticists find genetic causes of the disease. Having deep genealogies meant that the researchers could also connect people in different states, who shared the same cancer gene and a common ancestor hundreds of years ago.

The Utah Population Database lays out a blueprint for melding of genealogy and genetics research. The groups working with Ancestry and Geni’s data have an eye on genetics, too. Ancestry’s scientists note, for example, that their study illuminates how different populations of people in various regions of the country might have different risks for disease alleles. That could matter in recruiting patients to clinical trials. And the group using Geni’s data created an interface that can match patients in a genetics study to match, via a Facebook login-like mechanism, to their genealogy records. (Otherwise, the data is anonymized.) To further drive home the importance of genealogy records in genetics research, consider this: Erlich announced last week he was going on leave from his academic job to be chief scientific officer of MyHeritage, Geni’s parent company.

Research using large-scale genealogies is only just beginning to take off. The family trees on Ancestry and Geni are the work of genealogy enthusiasts, interested in tracing the stories of their individual families. But together, they have accidentally created a database that traces the stories of entire populations, and potentially even more. Don’t we all, after all, belong to a single giant family tree?

Sarah Zhang is a staff writer at The Atlantic.