git rekt — gemini-redirect.git (4e76ffb372acbefd1eed19ac74caf7f564084240): blog/mdad/visualizing-caceres-opendata/index.html

blog/mdad/visualizing-caceres-opendata/index.html (view raw)
  1<!DOCTYPE html>
  2<html>
  3<head>
  4<meta charset="utf-8" />
  5<meta name="viewport" content="width=device-width, initial-scale=1" />
  6<title>Visualizing Cáceres’ OpenData</title>
  7<link rel="stylesheet" href="../css/style.css">
  8</head>
  9<body>
 10<main>
 11<p>The city of Cáceres has online services to provide <a href="http://opendata.caceres.es/">Open Data</a> over a wide range of <a href="http://opendata.caceres.es/dataset">categories</a>, all of which are very interesting to explore!</p>
 12<div class="date-created-modified">Created 2020-03-09<br>
 13Modified 2020-03-19</div>
 14<p>We have chosen two different datasets, and will explore four different ways to visualize the data.</p>
 15<p>This post is co-authored with Classmate.</p>
 16<h2 class="title" id="obtain_the_data"><a class="anchor" href="#obtain_the_data">¶</a>Obtain the data</h2>
 17<p>We are interested in the JSON format for the <a href="http://opendata.caceres.es/dataset/informacion-del-padron-de-caceres-2017">census in 2017</a> and those for the <a href="http://opendata.caceres.es/dataset/vias-urbanas-caceres">vias of the city</a>. This way, we can explore the population and their location in interesting ways! You may follow those two links and select the JSON format under Resources to download it.</p>
 18<p>Why JSON? We will be using <a href="https://python.org/">Python</a> (3.7 or above) and <a href="https://matplotlib.org/">matplotlib</a> for quick iteration, and loading the data with <a href="https://docs.python.org/3/library/json.html">Python’s <code>json</code> module</a> will be trivial.</p>
 19<h2 id="implementation"><a class="anchor" href="#implementation">¶</a>Implementation</h2>
 20<h3 id="imports_and_constants"><a class="anchor" href="#imports_and_constants">¶</a>Imports and constants</h3>
 21<p>We are going to need a lot of things in this code, such as <code>json</code> to load the data, <code>matplotlib</code> to visualize it, and other data types and type hinting for use in the code.</p>
 22<p>We also want automatic download of the JSON files if they’re missing, so we add their URLs and download paths as constants.</p>
 23<pre><code>import json
 24import re
 25import os
 26import sys
 27import urllib.request
 28import matplotlib.pyplot as plt
 29from dataclasses import dataclass
 30from collections import namedtuple
 31from datetime import date
 32from pathlib import Path
 33from typing import Optional
 34
 35CENSUS_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionCENSUS&amp;year=2017&amp;format=json'
 36VIAS_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:Via&amp;format=json'
 37
 38CENSUS_JSON = Path('data/demografia/Padrón_Cáceres_2017.json')
 39VIAS_JSON = Path('data/via/Vías_Cáceres.json')
 40</code></pre>
 41<h3 id="data_classes"><a class="anchor" href="#data_classes">¶</a>Data classes</h3>
 42<p><a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">Parse, don’t validate</a>. By defining a clear data model, we will be able to tell at a glance what information we have available. It will also be typed, so we won’t be confused as to what is what! Python 3.7 introduces <code>[dataclasses](https://docs.python.org/3/library/dataclasses.html)</code>, which are a wonderful feature to define… well, data classes concisely.</p>
 43<p>We also have a <code>[namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple)</code> for points, because it’s extremely common to represent them as tuples.</p>
 44<pre><code>Point = namedtuple('Point', 'long lat')
 45
 46@dataclass
 47class Census:
 48    year: int
 49    via: int
 50    count_per_year: dict
 51    count_per_city: dict
 52    count_per_gender: dict
 53    count_per_nationality: dict
 54    time_year: int
 55
 56@dataclass
 57class Via:
 58    name: str
 59    kind: str
 60    code: int
 61    history: Optional[str]
 62    old_name: Optional[str]
 63    length: Optional[float]
 64    start: Optional[Point]
 65    middle: Optional[Point]
 66    end: Optional[Point]
 67    geometry: Optional[list]
 68</code></pre>
 69<h3 id="helper_methods"><a class="anchor" href="#helper_methods">¶</a>Helper methods</h3>
 70<p>We will have a little helper method to automatically download the JSON when missing. This is just for convenience, we could as well just download it manually. But it is fun to automate things.</p>
 71<pre><code>def ensure_file(file, url):
 72    if not file.is_file():
 73        print('Downloading', file.name, 'because it was missing...', end='', flush=True, file=sys.stderr)
 74        file.parent.mkdir(parents=True, exist_ok=True)
 75        urllib.request.urlretrieve(url, file)
 76        print(' Done.', file=sys.stderr)
 77</code></pre>
 78<h3 id="parsing_the_data"><a class="anchor" href="#parsing_the_data">¶</a>Parsing the data</h3>
 79<p>I will be honest, parsing Cáceres’ OpenData is a pain in the neck! The official descriptions are huge and not all that helpful. Maybe if one needs documentation for a specific field. But luckily for us, the names are pretty self-descriptive, and we can explore the data to get a feel for what we will find.</p>
 80<p>We define two methods, one to iterate over <code>Census</code> values, and another to iterate over <code>Via</code> values. Here’s where our friend <code>[re](https://docs.python.org/3/library/re.html)</code> comes in, and oh boy the format of the data…</p>
 81<p>For example, the year and via identifier are best extracted from the URI! The information is also available in the <code>rdfs_label</code> field, but that’s just a Spanish text! At least the URI will be more reliable… hopefully.</p>
 82<p>Birth date. They could have used a JSON list, but nah, that would’ve been too simple. Instead, you are given a string separated by semicolons. The values? They could have been dictionaries with names for «year» and «age», but nah! That would’ve been too simple! Instead, you are given strings that look like «2001 (7)», and that’s the year and the count.</p>
 83<p>The birth place? Sometimes it’s «City (Province) (Count)», but sometimes the province is missing. Gender? Semicolon-separated. And there are only two genders. I know a few people who would be upset just reading this, but it’s not my data, it’s theirs. Oh, and plenty of things are optional. That was a lot of <code>AttributeError: 'NoneType' object has no attribute 'foo'</code> to work through!</p>
 84<p>But as a reward, we have nicely typed data, and we no longer have to deal with this mess when trying to visualize it. For brevity, we will only be showing how to parse the census data, and not the data for the vias. This post is already long enough on its own.</p>
 85<pre><code>def iter_census(file):
 86    with file.open() as fd:
 87        data = json.load(fd)
 88
 89    for row in data['results']['bindings']:
 90        year, via = map(int, row['uri']['value'].split('/')[-1].split('-'))
 91
 92        count_per_year = {}
 93        for item in row['schema_birthDate']['value'].split(';'):
 94            y, c = map(int, re.match(r'(\d+) \((\d+)\)', item).groups())
 95            count_per_year[y] = c
 96
 97        count_per_city = {}
 98        for item in row['schema_birthPlace']['value'].split(';'):
 99            match = re.match(r'([^(]+) \(([^)]+)\) \((\d+)\)', item)
100            if match:
101                l, _province, c = match.groups()
102            else:
103                l, c = re.match(r'([^(]+) \((\d+)\)', item).groups()
104
105            count_per_city[l] = int(c)
106
107        count_per_gender = {}
108        for item in row['foaf_gender']['value'].split(';'):
109            g, c = re.match(r'([^(]+) \((\d+)\)', item).groups()
110            count_per_gender[g] = int(c)
111
112        count_per_nationality = {}
113        for item in row['schema_nationality']['value'].split(';'):
114            match = re.match(r'([^(]+) \((\d+)\)', item)
115            if match:
116                g, c = match.groups()
117            else:
118                g, _alt_name, c = re.match(r'([^(]+) \(([^)]+)\) \((\d+)\)', item).groups()
119
120            count_per_nationality[g] = int(c)
121        time_year = int(row['time_year']['value'])
122
123        yield Census(
124            year=year,
125            via=via,
126            count_per_year=count_per_year,
127            count_per_city=count_per_city,
128            count_per_gender=count_per_gender,
129            count_per_nationality=count_per_nationality,
130            time_year=time_year,
131        )
132</code></pre>
133<h2 id="visualizing_the_data"><a class="anchor" href="#visualizing_the_data">¶</a>Visualizing the data</h2>
134<p>Here comes the fun part! After parsing all the desired data from the mentioned JSON files, we plotted the data in four different graphics making use of Python’s <a href="https://matplotlib.org/"><code>matplotlib</code> library.</a> This powerful library helps with the creation of different visualizations in Python.</p>
135<h3 id="visualizing_the_genders_in_a_pie_chart"><a class="anchor" href="#visualizing_the_genders_in_a_pie_chart">¶</a>Visualizing the genders in a pie chart</h3>
136<p>After seeing that there are only two genders in the data of the census, we, displeased, started work in a chart for it. The pie chart was the best option since we wanted to show only the percentages of each gender. The result looks like this:</p>
137<p><img src="pie_chart.png" alt="" /></p>
138<p>Pretty straight forward, isn’t it? To display this wonderful graphic, we used the following code:</p>
139<pre><code>def pie_chart(ax, data):
140    lists = sorted(data.items())
141
142    x, y = zip(*lists)
143    ax.pie(y, labels=x, autopct='%1.1f%%',
144            shadow=True, startangle=90)
145    ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
146</code></pre>
147<p>We pass the axis as the input parameter (later we will explain why) and the data collected from the JSON regarding the genders, which are in a dictionary with the key being the labels and the values the tally of each gender. We sort the data and with some unpacking magic we split it into two values: <code>x</code> being the labels and <code>y</code> the amount of each gender.</p>
148<p>After that we plot the pie chart with the data and labels from <code>y</code> and <code>x</code>, we specify that we want the percentage with one decimal place with the <code>autopct</code> parameter, we enable shadows for the presentation, and specify the start angle at 90º.</p>
149<h3 id="date_tick_labels"><a class="anchor" href="#date_tick_labels">¶</a>Date tick labels</h3>
150<p>We wanted to know how many of the living people were born on each year, so we are making a date plot! In the census we have the year each person was born in, and using that information is an easy task after parsing the data (parsing was an important task of this work). The result looks as follows:</p>
151<p><img src="date_tick.png" alt="" /></p>
152<p>How did we do this? The following code was used:</p>
153<pre><code>def date_tick(ax, data):
154    lists = sorted(data.items())
155
156    x, y = zip(*lists)
157    x = [date(year, 1, 1) for year in x]
158    ax.plot(x, y)
159</code></pre>
160<p>Again, we pass in an axis and the data related with the year born, we sort it, split it into two lists, being the keys the years and the values the number per year. After that, we put the years in a date format for the plot to be more accurate. Finally, we plot the values into that wonderful graphic.</p>
161<h3 id="stacked_bar_chart"><a class="anchor" href="#stacked_bar_chart">¶</a>Stacked bar chart</h3>
162<p>We wanted to know if there was any relation between the latitudes and count per gender, so we developed the following code:</p>
163<pre><code>def stacked_bar_chart(ax, data):
164    labels = []
165    males = []
166    females = []
167
168    for latitude, genders in data.items():
169        labels.append(str(latitude))
170        males.append(genders['Male'])
171        females.append(genders['Female'])
172
173    ax.bar(labels, males, label='Males')
174    ax.bar(labels, females, bottom=males, label='Females')
175
176    ax.set_ylabel('Counts')
177    ax.set_xlabel('Latitudes')
178    ax.legend()
179</code></pre>
180<p>The key of the data dictionary is the latitude rounded to two decimals, and value is another dictionary, which is composed by the key that is the name of the gender and the value, the number of people per gender. So, in a single entry of the data dictionary we have the latitude and how many people per gender are in that latitude.</p>
181<p>We iterate the dictionary to extract the different latitudes and people per gender (because we know only two genders are used, we hardcode it to two lists). Then we plot them putting the <code>males</code> and <code>females</code> lists at the bottom and set the labels of each axis. The result is the following:</p>
182<p><img src="stacked_bar_chart-1.png" alt="" /></p>
183<h3 id="scatter_plots"><a class="anchor" href="#scatter_plots">¶</a>Scatter plots</h3>
184<p>This last graphic was very tricky to get right. It’s incredibly hard to find the extent of a city online! We were getting confused because some of the points were way farther than the centre of Cáceres, and the city background is a bit stretched even if the coordinates appear correct. But in the end, we did a pretty good job on it.</p>
185<pre><code>def scatter_map(ax, data):
186    xs = []
187    ys = []
188    areas = []
189    for (long, lat), count in data.items():
190        xs.append(long)
191        ys.append(lat)
192        areas.append(count / 100)
193
194    if CACERES_MAP.is_file():
195        ax.imshow(plt.imread(str(CACERES_MAP)), extent=CACERES_EXTENT)
196    else:
197        print('Note:', CACERES_MAP, 'does not exist, not showing it', file=sys.stderr)
198
199    ax.scatter(xs, ys, areas, alpha=0.1)
200</code></pre>
201<p>This time, the keys in the data dictionary are points and the values are the total count of people in that point. We use a normal <code>for</code> loop to create the different lists. For the areas on how big the circles we are going to represent will be, we divide the count of people by some number, like <code>100</code>, or otherwise they would be huge.</p>
202<p>If the file of the map is present, we render it so that we can get a sense on where the points are, but if the file is missing we print a warning.</p>
203<p>At last, we draw the scatter plot with some low alpha value (there’s a lot of overlapping points). The result is <em>absolutely gorgeous</em>. (For some definitions of gorgeous, anyway):</p>
204<p><img src="scatter_map.png" alt="" /></p>
205<p>Just for fun, here’s what it looks like if we don’t divide the count by 100 and lower the opacity to <code>0.01</code>:</p>
206<p><img src="scatter_map-2.png" alt="" /></p>
207<p>That’s a big solid blob, and the opacity is only set to <code>0.01</code>!</p>
208<h3 id="drawing_all_the_graphs_in_the_same_window"><a class="anchor" href="#drawing_all_the_graphs_in_the_same_window">¶</a>Drawing all the graphs in the same window</h3>
209<p>To draw all the graphs in the same window instead of getting four different windows we made use of the <a href="https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.subplots.html"><code>subplots</code> function</a>, like this:</p>
210<pre><code>fig, axes = plt.subplots(2, 2)
211</code></pre>
212<p>This will create a matrix of two by two of axes that we store in the axes variable (fitting name!). Following this code are the different calls to the methods commented before, where we access each individual axis and pass it to the methods to draw on:</p>
213<pre><code>pie_chart(axes[0, 0], genders)
214date_tick(axes[0, 1], years)
215stacked_bar_chart(axes[1, 0], latitudes)
216scatter_map(axes[1, 1], positions)
217</code></pre>
218<p>Lastly, we plot the different graphics:</p>
219<pre><code>plt.show()
220</code></pre>
221<p>Wrapping everything together, here’s the result:</p>
222<p><img src="figures-1.png" alt="" /></p>
223<p>The numbers in some of the graphs are a bit crammed together, but we’ll blame that on <code>matplotlib</code>.</p>
224<h2 id="closing_words"><a class="anchor" href="#closing_words">¶</a>Closing words</h2>
225<p>Wow, that was a long journey! We hope that this post helped you pick some interest in data exploration, it’s such a fun world. We also offer the full download for the code below, because we know it’s quite a bit!</p>
226<p>Which of the graphs was your favourite? I personally like the count per date, I think it’s nice to see the growth. Let us know in the comments below!</p>
227<p><em>download removed</em></p>
228</main>
229</body>
230</html>
231