As I mentioned a while back, Alex Albright (of The Little Dataset That Could) and I had the chance to present some of our thoughts at Bloomberg's first annual Data for Good Exchange. We decided to talk about what we view as the shortcomings in popular data science education programs and bootcamps. Specifically, we wanted to shine a light on the ways that data scientists are (and are not) adequately trained to contribute to social good projects and work with foreign data. I've included the abstract (below) and introduction (after the jump). You can also read the full text and check out the poster. Thanks to Alex for working on this while on vacation in Portland and thanks to SLS for letting us write things we believe and not firing us for it.

One Size Does Not Fit All: The Shortcomings of the Mainstream Data Scientist Working for Social Good

Data scientists are increasingly called on to contribute their analytical skills outside of the corporate sector in pursuit of meaningful insights for nonprofit organizations and social good projects. We challenge the assumption that the skills and methods necessary for successful data analysis come in a “one size fits all” package for both the nonprofit and for-profit sectors. By comparing and contrasting the key elements of data science in both domains, we identify the skills critical for the successful application of data science to social good projects. We then analyze five well-known data science programs and bootcamps in order to evaluate their success in providing training that transfers smoothly to social impact projects. After surveying these programs, we make a number of recommendations with respect to data science training curricula, non-profit hiring systems, and the data science for social good community’s practices.

While the overwhelming majority of data scientists are employed in the for-profit sector, there is a growing movement taking advantage of their technological savvy and unique toolkit for the benefit of social good projects and programs. Conventionally trained data-scientists are encouraged more and more to play a pivotal role in data-driven social good projects as team members, consultants, or volunteers. However, this phenomenon assumes that the data scientists’ standard toolkit in the for-profit sector translates seamlessly to the realm of social good. We challenge this assumption and argue that while the term “data scientist” has become an amorphous catch-all for programmers, statisticians, bloggers, and other empirically inclined individuals, the skills and methodological knowledge required of a data scientist can and should differ across the for-profit and non-profit sectors. We use this paper as an opportunity to highlight the shortcomings of mainstream data science education and practice when it comes to the non-profit sector and social impact endeavors.

We begin by comparing and contrasting the roles of data scientists in the for-profit and non-profit environments, and identify three key differences. First, while for-profit data scientists often work with in-house data, non-profit data science often involves working with foreign data that merits greater scrutiny and sensitivity in its treatment. Second, while the corporate environment provides control over the quality of “insights” in the form of management, the non-profit environment can lack effective checks and balances on data and analysis quality. Third, in experimental design, for- profit data scientists often have near-omniscient control over the environment containing study variables, whereas real-world data and studies are seldom so fortunate. We conclude that whereas for-profit data science can often afford to be “insights”-driven and results-oriented, non-profit data science must be less content- driven and more process oriented to avoid results, conclusions, and even policies that are built on poor quality data and inappropriate methods.

Next, we survey popular data science curricula across bootcamps, online courses, and master’s degree programs in order to generalize the baseline knowledge of emerging data scientists. We then compare and contrast the skills delivered by contemporary data science education with those required for meaningful contribution to social impact projects, and find that the former caters strikingly to a for-profit position. For example, we find that there is little to no focus in current data science education on investigating the quality of data or the identification and integrity of experimental variables. The curricula of these courses illustrate that data scientists are molded to be corporate workers as the default, necessitating a further mechanism to help empirical researchers transition across sectors, even if they bear the same title: “data scientist.”

Ultimately, we make several recommendations as to (1) how data science training programs can better prepare their students for roles in organizations doing social good, (2) how non-profit organizations can and must be more targeted in their hiring practices to find data scientists who are adequately suited for their projects, and (3) how the data science for social good community can and must develop best practices and ethical codes akin to those in the academic community.