19 Dec

Google BigQuery provides insight into Stack Overflow discussion data

    Software development discussion site Stack Overflow has started offering quarterly snapshots of its question-and-answer database through Google’s BigQuery.

    Stack Exchange, parent company for Stack Overflow and its sister sites, has previously made its data available to researchers throught its online data explorer. But now researchers with a Google Cloud Platform account can plug directly into the data set using Google’s data exploration tools, which have fewer limitations than Stack Overflow’s.

    If you have a Google Cloud account, you can log in and begin exploring the data directly from a SQL-style web interface. Results from queries can be exported to CSV or JSON, saved to other tables in Google BigQuery, or exported to Google Sheets. BigQuery also comes with a REST API, so it can be used with third-party visualization tools or software stacks.

    Stack Overflow’s question-and-answer format is popular with developers seeking quick solutions to common problems. Though it has a reputation for being insular and unwelcoming, it’s  widely trafficked, and many of its highest-voted answers are widely circulated as great explainers. For example, a popular question about why processing a sorted array is faster than working with an unsorted one not only gives a detailed technical answer, but also serves as great explainer for the concept of branch prediction failure.

    One possible application for Stack Overflow’s data, with or without BigQuery’s tool set, is sentiment analysis of topics and discussions taking place on Stack Overflow–in other words, getting broad hints about developers’ feelings about a technology.

    If discussions about a language are paired with discussions about an IDE for that language, those threads could be parsed for details about what people are (or aren’t) doing most often with that pairing. Thus, you could figure out what developers might need but aren’t yet asking for.

    Stack Overflow’s yearly surveys of its developers provide a similar snapshot of its audience’s mindsets: what languages are popular or how developers classify themselves. But such surveys are self-conscious and self-reporting, and they’re limited to the categories devised for them. Discussions on the site could provide more open-ended, direct, and detailed data about what developers like, hate, look for, and struggle with.

    Note that this data set comes from Stack Overflow, and not from any of the other IT-related Stack Exchange sites, such as Server Fault (for IT admins) or Super User (for “computer enthusiasts and power users”). If these data sets go online through Google BigQuery as well, they could open up possibilities for even larger and more sophisticated analyses across multiple IT disciplines.


    Share this