{"id":2687,"date":"2019-03-26T21:07:54","date_gmt":"2019-03-26T21:07:54","guid":{"rendered":"http:\/\/www.ralphstraumann.ch\/blog\/?p=2687"},"modified":"2019-03-27T16:58:57","modified_gmt":"2019-03-27T16:58:57","slug":"scraping-data-from-a-gis-web-forum-using-python","status":"publish","type":"post","link":"https:\/\/www.ralphstraumann.ch\/blog\/2019\/03\/scraping-data-from-a-gis-web-forum-using-python\/","title":{"rendered":"Scraping data from a GIS web forum using Python"},"content":{"rendered":"\r\n<p><strong>For a private project of mine as well as in the context of a data analysis foray on behalf of <\/strong><a href=\"https:\/\/www.geobeer.ch\/\"><strong>GeoBeer<\/strong><\/a><strong>, I wanted to use the content that has been published on the GIS-centric message board <\/strong><a href=\"https:\/\/geowebforum.ch\/\"><strong>Geowebforum<\/strong><\/a><strong>. In order to achieve this, I had to do some webscraping. To that end, I used Python and BeautifulSoup.<\/strong><\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Two principal news sources<\/h2>\r\n\r\n\r\n\r\n<p>The GIS community of Switzerland has two forums for discussions and the exchange of ideas, I&#8217;d say: First, there is the <a href=\"https:\/\/twitter.com\/rastrau\/lists\/swissgis\">#SwissGIS list on Twitter<\/a>: This is a Twitter list of &#8220;Switzerland-based people who talk about spatial data science, GIS, geoviz, carto and geo&#8221; that is curated by me.<\/p>\r\n\r\n\r\n\r\n<p>At the time of writing, the list features about 400 members and about 100 subscribers. I use the list as a means to follow Switzerland-related GIS news. To that end, I display the list as a column in <a href=\"https:\/\/tweetdeck.twitter.com\/\">Tweetdeck<\/a>. Over time, the SwissGIS list has been the <a href=\"http:\/\/www.ralphstraumann.ch\/blog\/2014\/11\/geocoding-twitter-users-the-swissgis-map\/\">subject of<\/a> <a href=\"https:\/\/geo.ebp.ch\/2015\/11\/11\/swiss-gis-network-on-twitter\">several<\/a> <a href=\"http:\/\/www.ralphstraumann.ch\/blog\/2016\/08\/swiss-gis-news-twitter-edition\/\">analyses<\/a>. To me, this Twitter list is an invaluable resource for learning, sharing, and community building. It is also the nucleus of the <a href=\"https:\/\/www.geobeer.ch\/\">GeoBeer event series<\/a> that <a href=\"http:\/\/ping13.net\">Stephan<\/a> and I started in 2013 \u2013 after one of those analyses that are linked to above. But this is a topic for another post.<\/p>\r\n\r\n\r\n\r\n<p>The second forum for GIS in Switzerland, in my perception, is the <a href=\"https:\/\/geowebforum.ch\/\">Geowebforum<\/a>. It is a message board-style online forum in which, after free registration, anybody can share their news, announcements and ideas around GIS. The content is greenlit by a small group of moderators. This makes the forum a more static marketplace of ideas. As opposed to Twitter, it&#8217;s rare to see a thread growing into a discussion of 2, 3 or even more posts.<\/p>\r\n\r\n\r\n\r\n<p>(Of course, there are some other relevant channels besides Twitter and Geowebforum. In my case, e.g.: personal interactions, community events such as GeoBeer, and LinkedIn.)<\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Getting the data<\/h2>\r\n\r\n\r\n\r\n<p>For private projects of mine, I wanted to use the content that has been published on <a href=\"https:\/\/geowebforum.ch\/\">Geowebforum<\/a> since its initiation in 2003. While the content is public (anybody can read the entirety of the forum contents, even without registration), it is not available in structured form (other than the RSS feed as was pointed out to me; but the feed only contains a small subset of all posts, namely the most recent ones). Thus, in order to get the content of Geowebforum into a usable form, I did some webscraping and data structuring in a database.<\/p>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><a href=\"https:\/\/geowebforum.ch\/thema.php?themenID=454\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"684\" class=\"wp-image-2618\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum-jobs-1024x684.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum-jobs-1024x684.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum-jobs-300x200.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum-jobs-768x513.png 768w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum-jobs.png 1343w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a>\r\n<figcaption>The &#8220;Jobs&#8221; topic on the Geowebforum<\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<p>To that end, I used Python, the <a href=\"https:\/\/docs.python.org\/2\/library\/urllib2.html\">urllib2<\/a> library for making HTTP requests and the <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\">BeautifulSoup<\/a> library for analysing and extracting the page content. You can find all code associated with my project (with an open licence) (as well as the resulting database of Geowebforum content) on my GitHub: <a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\">github.com\/rastrau\/geowebforum-scraper<\/a>. The work is done in <a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\/blob\/master\/scrape-geowebforum.py\">scrape-geowebforum.py<\/a>. This script sets up an SQLite database and scrapes the forum content into it.<\/p>\r\n\r\n\r\n\r\n<p>The script first iterates through &#8216;topics&#8217; \u2013 what I called the highest level in the information architecture of the Geowebforum. In the screenshot above you can see, for example, the <a href=\"https:\/\/geowebforum.ch\/thema.php?themenID=454\">first page of the topic &#8216;Jobs&#8217;, listing its threads<\/a>.<\/p>\r\n\r\n\r\n\r\n<p>The Python code to request that page and to find all links to threads looks as follows:<\/p>\r\n\r\n\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">page = urllib2.urlopen('%s&amp;offset=%s' % (topic_url, offset))\r\nsoup = BeautifulSoup(page, 'lxml')\r\nthread_links = soup.find_all('a', {'class': ['threadtitle']})<\/pre>\r\n\r\n\r\n\r\n<p>With the call to <strong>soup.find_all(&#8216;a&#8217;, {&#8216;class&#8217;: [&#8216;threadtitle&#8217;]})<\/strong> we can extract all hyperlink DOM elements (&lt;a href=&#8221;&#8230;.&#8221;&gt; tags) of class &#8220;threadtitle&#8221; in the page. Later, when iterating through the collected thread links, we can access the name of the thread as well as the hyperlink to the thread as follows:<\/p>\r\n\r\n\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">thread_name = thread_link.string \r\nthread_url = 'https:\/\/geowebforum.ch\/%s' % thread_link['href']<\/pre>\r\n\r\n\r\n\r\n<p>Similarly, the code iterates through all topics, all threads within each topic and all posts within each thread. It records the URLs and the titles of these items as well as the information architecture of the website. For posts, it extracts a lot of additional information such as the post title, the post content with and without HTML tags, the post time and the anonymised post author.<\/p>\r\n<p>There are some peculiarities I had to take care of: e.g., upon paging too far through a thread, Geowebforum does not display an error page or return an HTTP error code, but rather shows an emptyish page \u2013 the structure of which I detect using BeautifulSoup in order to stop the paging process.<\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Structuring the data<\/h2>\r\n\r\n\r\n\r\n<p>The data that is obtained from scraping is structured and filled into an SQLite database that was set up at the start of the processing. This is done using the <strong>sqlite3 <\/strong>library of Python. For example, given an active connection to the database and a cursor, the code for ingesting thread data into the database looks as follows:<\/p>\r\n\r\n\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def ingest_thread(cursor, thread_id, thread_name, thread_url, topic_id):\r\n    cursor.execute(\"INSERT INTO threads VALUES (?, ?, ?, ?, ?)\",\r\n                   [thread_id, thread_name, thread_url, topic_id, 0])\r\n    return<\/pre>\r\n\r\n\r\n\r\n<p>The database is further set up with triggers, using Python. This way, upon adding a post into the table of posts, a trigger fires and updates the post count per thread in the table pertaining to forum threads.<\/p>\r\n<p>As a further add-on, I included a classifier of the language of posts. This is done using the <strong><a href=\"https:\/\/github.com\/Mimino666\/langdetect\">langdetect<\/a><\/strong> Python library. This library lets me compute the probability that a post is written in German, French, Italian or English. Out of the four Swiss national languages the (by far) smallest, Romansh, is not covered by langdetect. But I have my doubts that there are any posts in Romansh in the Geowebforum. Besides the language-specific probabilities, I store the most likely language in the database as well. But keep in mind that probabilities can be a very sensible metric to look at too, as, in my impression, a non-insignificant amount of forum posts combine two or more languages.<\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Getting the data<\/h2>\r\n\r\n\r\n\r\n<p>You can get all the data that I extracted from the Geoebforum between 2003 and 2018 from <a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\">github.com\/rastrau\/geowebforum-scraper<\/a>. Specifically, you&#8217;ll want to download the SQLite database <a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\/blob\/master\/data.sqlite\">data.sqlite<\/a>. You can look at, and analyse, this dataset with any SQLite compatible tool, for example <a href=\"https:\/\/dbeaver.io\/\">DBeaver<\/a> or R using <a href=\"https:\/\/db.rstudio.com\/databases\/sqlite\/\">RSQLite<\/a>.<\/p>\r\n\r\n\r\n\r\n<p>The database contains four tables:<\/p>\r\n\r\n\r\n\r\n<ul class=\"wp-block-list\">\r\n<li><strong>metadata<\/strong><\/li>\r\n<li><strong>topics<\/strong><\/li>\r\n<li><strong>threads<\/strong><\/li>\r\n<li><strong>posts<\/strong><\/li>\r\n<\/ul>\r\n\r\n\r\n\r\n<p>All these tables <a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper#outputs\">are described in detail in the Readme of my repository<\/a>. The posts table contains most information. It stores the following attributes for each forum post:<\/p>\r\n\r\n\r\n\r\n<ul class=\"wp-block-list\">\r\n<li><code>post_id<\/code> (int, primary key): numeric ID of the geowebforum post (not linked to geowebforum.ch, but inherent to <em>geowebforum-scraper<\/em>)<\/li>\r\n<li><code>post_time<\/code>\u00a0(text; SQLite has no datetime format): Timestamp of when the post was made. The timestamp follows the format\u00a0<code>%Y-%m-%d %H:%M:00.000<\/code>.<\/li>\r\n<li><code>post_author<\/code>\u00a0(text): Hashed (encrypted) name of the author of the post.<\/li>\r\n<li><code>post_content<\/code>\u00a0(text): Content of the post\u00a0<em>complete with HTML tags<\/em>, i.e.\u00a0<code>post_content<\/code>\u00a0for example contains link targets.<\/li>\r\n<li><code>post_text<\/code>\u00a0(text): Same as\u00a0<code>post_content<\/code>\u00a0but\u00a0<em>with HTML code removed<\/em>.\u00a0<code>post_text<\/code>\u00a0corresponds to the text you see when you read a post in your browser.<\/li>\r\n<li><code>post_lang<\/code>\u00a0(text): Most likely language of the post as detected using the Python package\u00a0<code>langdetect<\/code>. Possible values are\u00a0<code>de<\/code>,\u00a0<code>fr<\/code>,\u00a0<code>it<\/code>\u00a0and\u00a0<code>en<\/code>\u00a0for German, French, Italian and English, respectively. Romansh \/ Rumantsch cannot be detected by\u00a0<code>langdetect<\/code>.<\/li>\r\n<li><code>de<\/code>\u00a0(real): probability in the interval of 0 to 1 of the post being in German.<\/li>\r\n<li><code>fr<\/code>\u00a0(real): probability in the interval of 0 to 1 of the post being in French.<\/li>\r\n<li><code>it<\/code>\u00a0(real): probability in the interval of 0 to 1 of the post being in Italian.<\/li>\r\n<li><code>en<\/code>\u00a0(real): probability in the interval of 0 to 1 of the post being in English.<\/li>\r\n<li><code>thread_id<\/code>\u00a0(int, foreign key): numeric ID of the geowebforum thread that contains this post.<\/li>\r\n<li><code>topic_id<\/code>\u00a0(int, foreign key): numeric ID of the geowebforum topic that contains this post.<\/li>\r\n<\/ul>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\">Insights<\/h2>\r\n\r\n\r\n\r\n<p><a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\/tree\/master\/example-analyses\">I posted some insights that can be gained from this dataset complete with the code<\/a> (R and SQL) that was used to generate them. Here are some examples (<a href=\"https:\/\/github.com\/rastrau\/geowebforum-scraper\/blob\/master\/example-analyses\/README.md\">you can explore the full list here<\/a>):<\/p>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"696\" class=\"wp-image-2669\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/open-smart-3d-per-year-absolute-1024x696.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/open-smart-3d-per-year-absolute-1024x696.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/open-smart-3d-per-year-absolute-300x204.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/open-smart-3d-per-year-absolute-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\r\n<figcaption>Number of posts, per year, mentioning some trendy terms<br \/><br \/><\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<figure><\/figure>\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"696\" class=\"wp-image-2670\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/e-geo-per-year-absolute-1024x696.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/e-geo-per-year-absolute-1024x696.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/e-geo-per-year-absolute-300x204.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/e-geo-per-year-absolute-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\r\n<figcaption>A lifecycle: Number of posts, per year, mentioning the e-geo initiative<br \/><br \/><\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<figure><\/figure>\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"696\" class=\"wp-image-2668\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geosummit-geobeer-per-year-indexed-1024x696.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geosummit-geobeer-per-year-indexed-1024x696.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geosummit-geobeer-per-year-indexed-300x204.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geosummit-geobeer-per-year-indexed-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\r\n<figcaption>Some GIS-centric events in Switzerland: One of which takes place every two years and one of which was started only in 2013.<br \/><br \/><\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<figure><\/figure>\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"696\" class=\"wp-image-2671\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/mgdm-geodatenmodell-per-year-absolute-1024x696.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/mgdm-geodatenmodell-per-year-absolute-1024x696.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/mgdm-geodatenmodell-per-year-absolute-300x204.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/mgdm-geodatenmodell-per-year-absolute-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\r\n<figcaption>Number of posts, per year, mentioning standardisation instruments<br \/><br \/><\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<figure><\/figure>\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"696\" class=\"wp-image-2672\" src=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/bim-per-year-absolute-1024x696.png\" alt=\"\" srcset=\"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/bim-per-year-absolute-1024x696.png 1024w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/bim-per-year-absolute-300x204.png 300w, https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/bim-per-year-absolute-768x522.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\r\n<figcaption>Number of posts, per year, mentioning one of the most recent big GIS integration trends<br \/><br \/><\/figcaption>\r\n<\/figure>\r\n\r\n\r\n\r\n<p><strong>Please let me know if you build something on top of this data, if the code helped you or if you come across an interesting insight!<\/strong><\/p>\r\n","protected":false},"excerpt":{"rendered":"<p>For a private project of mine as well as in the context of a data analysis foray on behalf of GeoBeer, I wanted to use the content that has been published on the GIS-centric message board Geowebforum. In order to achieve this, I had to do some webscraping. To that end, I used Python and &hellip; <a href=\"https:\/\/www.ralphstraumann.ch\/blog\/2019\/03\/scraping-data-from-a-gis-web-forum-using-python\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Scraping data from a GIS web forum using Python<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2744,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[8],"tags":[],"class_list":["post-2687","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-piece"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/www.ralphstraumann.ch\/blog\/wp-content\/uploads\/2019\/03\/geowebforum.png","jetpack_shortlink":"https:\/\/wp.me\/p3pPwF-Hl","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/posts\/2687","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/comments?post=2687"}],"version-history":[{"count":17,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/posts\/2687\/revisions"}],"predecessor-version":[{"id":2756,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/posts\/2687\/revisions\/2756"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/media\/2744"}],"wp:attachment":[{"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/media?parent=2687"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/categories?post=2687"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ralphstraumann.ch\/blog\/wp-json\/wp\/v2\/tags?post=2687"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}