Scaling the Heights of Data Science
rOpenSci moves mountains of data to transform environmental research
America’s archetypal ecologist may be someone like Aldo Leopold: binoculars to the eyes, boots on the ground, equally at ease with pad and pencil or rod and reel, a thinking man at home in the field. But this romanticized vision overlooks the myriad challenges and tensions facing the modern ecologist, who must do more than traipse through fields and forests observing flora and fauna. He or she must also climb mountains of data, run complex computer models, and transcend space and time to account for a planet in flux due to climate change.
Perhaps the quintessential ecologist of the 21st century looks more like Carl Boettiger. A 31-year-old assistant professor of eco-informatics in the Department of Environmental Science, Policy, and Management, Boettiger believes that the key to the field’s future—and, in a sense, to humanity’s proper stewardship of the natural world—is technology, and more specifically, technology’s ability to unify a global community of researchers and their data.
“We’re at a time of unprecedented access to data and increased ability to share and move large amounts of data,” Boettiger says. “I think this has the potential to be transformative in the ecological and environmental sciences.” This transformation, he explains, is not a matter of altering the core purpose of ecology—to understand how natural systems function and, often, to use this understanding to inform management and policy decisions—but rather of expanding researchers’ perspectives to include new variables, new data sources, and new environments.
Connecting the dots
Say you’re a researcher forecasting wildfire severity in the Jeffrey pine forests of the Eastern Sierra. One species of native ant is known to clear pine needles from around its nests at the bases of some trees, thus reducing the likelihood that those trees will burn. Wouldn’t you want to take that into account, especially if someone else had already collected the data and shared it online?
The ethics of the new ecologist, Boettiger believes, are born not of solitude in the great outdoors or the lab, but rather of cooperation and sharing. They’re rooted in software development as much as in hunting and fishing, drawing less from religion and philosophy and more from mathematics and statistics.
Traditional methods in ecology rely on extrapolating from limited data sets collected firsthand. What Boettiger espouses—through his course Data Science in Ecology and the Environment, his own research into ways of improving ecological decision-making, and an open-science software platform called rOpenSci that he helped launch in 2011—is a more inclusive, communal approach. “We want to do the classic science, and then we want to scale it up, learn more from it,” he explains. “We just want to take the next step up.”
Boettiger came to this way of thinking naturally but unpredictably. As an undergraduate studying physics at Princeton, he didn’t even know that ecology was an academic field. By the time he learned, it was too late to switch majors. After graduating, he studied ecology at UC Davis, where he subsequently received funding to do graduate research at the Lawrence Berkeley National Laboratory, aided by the U.S. Department of Energy’s “humongous” computers there. “That forced me, almost against my will, to think more computationally,” he says.
In 2012, he completed his PhD in population biology at Davis and headed to UC Santa Cruz, where, as a postdoctoral scholar, he explored themes that continue to dominate his research and teaching today: moving advanced ecological concepts and data science out of the realm of ideas and into real-world management and decision-making.
This progression occurred against the backdrop of a larger evolution in science itself. Thanks to new tools for data collection like microsensors—capable of collecting tiny bits of data more accurately and efficiently than older instruments—and remote sensing, wherein satellite-borne devices record massive amounts of information about the earth’s atmosphere and surface from thousands of miles away, the pool of pertinent data potentially available to researchers is growing every day.
Making “scaled-up science” more accessible
Added to this enormous data pool is climate change and its seemingly endless perturbations of the natural world—some understood, some still a mystery, but all representing a massive amount of data.
Consider California’s declining steelhead trout population. At a certain age, the fish decide whether to stay in their home river or head to the ocean. Boettiger says there’s growing concern that climate change is influencing their decision: that shifting temperatures are confusing the fish. This could have important consequences for how we manage fish populations—and thus requires a careful consideration of how, exactly, temperatures in their territory are changing.
But that doesn’t have to mean months or years of data collection; the rOpenSci website offers easy access to the National Oceanic and Atmospheric Administration’s exhaustive temperature and climate database. For that matter, rOpenSci is also tapped into AntWeb, the world’s largest database of ant info.
The project’s straightforward interface and use of the R programming language, which has been widely adopted by scientists and statisticians around the world, lower the bar for researchers to access and make better use of sources like these, as well as share their own field data. The bigger leap toward embracing open science is conceptual, not technical.
“The goal of rOpenSci has been to lower the barriers, so if you’re just curious about something like how global average temperatures over the last 30 years have affected your field site, you can just pull up the data and look at it,” Boettiger says. Simply put, the project aims to make “scaled-up science” more practical than ever before.
Expanding to other disciplines
From ecologist Aldo Leopold’s journal: A hand-drawn graph and chart of ornithological observations made in 1943.Archive photo courtesy of the Aldo Leopold Foundation
Some researchers still aren’t convinced, however, and won’t be until they can see a direct benefit, says rOpenSci cofounder Karthik Ram, a quantitative ecologist at the Berkeley Initiative for Global Change Biology (BiGCB) and a fellow at the Berkeley Institute for Data Science. So it helps that many of the site’s tools aren’t geared toward sharing or consuming data at all. In addition to those functions, rOpenSci offers software tools for proofing, organizing, visualizing, correcting, or otherwise processing data in an efficient way, often using statistical methods or accessing other databases. These features prove to be an easy sell—and a sort of icebreaker for rOpenSci’s broader purpose.
“We’re helping people clean up their existing data based on current information and automating many of those things that they would normally do by hand over many months,” Ram says. “Once people see this, they get more excited about getting involved and participating in open science.”
Growth in rOpenSci’s early years has been rapid, though “not quite exponential,” he concedes. To date, rOpenSci has registered about 100 contributors—power users who participate in the building of this online ecosystem, so to speak, by writing their own software or improving existing tools. But it has many thousands of users who log on simply to access those tools, of which there are now about 75.
Recent private funding has allowed rOpenSci to broaden its reach beyond its core audience—which is scientists studying the natural world—into a wide range of disciplines. It has developed new tools and run workshops at several conferences and universities as far away as Switzerland. And it has expanded its team from two to six.
Indeed, open science offers immense appeal, and utility, to people working in a host of fields. Look no further than Boettiger’s Data Science class for proof: About half of the students are majoring in math, statistics, computer science, and engineering, while the other half come from nontechnical fields like business, anthropology, biology, and ecology. (Like the campus-wide data-science initiative of which it is a part, the class is geared toward freshmen, though in its first year it included a mix of freshmen and sophomores.)
Today, free online fish-catch data help researchers quickly assess the impact of fisheries on marine ecosystems worldwide.
Boettiger embraces this diversity in his environment-centric class, which includes units on climate change, fisheries collapse, megafaunal extinctions, and ecological networks. Students learn to locate, access, and understand messy, real-world data sets, writing software tools that allow them to “walk through the data” and “discover the story themselves,” shaping research from a “raw forest” of information. Boettiger calls this “the art of data carpentry,” a skill rarely taught to undergraduates despite an abundance of material. “Data science is not just a professional skill that we need to give to graduate students who are closer to research; it’s an important part of a liberal education,” he says.
From undergrads to managers of wildfire risk or steelhead trout habitat, Boettiger believes that working with big data is not just a new skill but a fundamental one. “We’re lucky we already have science informing policy, but we need to inform science that it can use all of the available data to make better decisions,” he says. “We can do so much better if we have more data and we know how to use it.”