Introduction
Nowadays digital road maps are used in a daily basis: from navigation to self-driving cars. However, its creation and maintenance are associated with high costs and exhausting manual activity. Therefore, many automated solutions have been proposed to infer road maps. Although many data sources can be used to extract road maps, for instance GPS tracks or aerial imagery, in this paper the focus is on satellite imagery.
The proposed approach in this paper, Sat2Graph, overcomes the inherent limitations present in prior works, which can be classified in two categories: pixel-wise segmentation-based approaches and graph-based approaches.
Segmentation-based approach
Given the growing popularity of deep learning and the excellent performance of convolutional neural networks (CNN) in the field of computer vision, a logical step to address this problem is the usage of CNNs. One of the proposed approaches consists in training an end-to-end cascade CNN to extract road segmentation from satellite imagery [1] (see Figure 1). Then, a binary threshold is applied to the road segmentation followed by morphological thinning to extract the road center-lines. Finally, a road graph is produced through tracing the single-pixel-width road center-lines.
Fig. 1. Cascade CNN proposed by Cheng et al. [9].
Graph-based approach
A recent graph-based approach proposes the construction of the road network graph in an iterative fashion [2]. Starting from a know location of the map, the method iteratively proposes the next point (i.e., vertex) to visit along the road. In order to propose the next point, a neural network is used to process the information present at the surrounding satellite imagery of the current location.
Advantages and disadvantages of segmentation-based and graph-based approaches
Despite recent advantages [3, 4, 5], both categories are still suffering from inherent limitations.
Table 1. Advantages and disadvantages of segmentation-based and graph-based approaches
Segmentation-based | Graph-based | |
---|---|---|
Advantage | Wider receptive field which helps capturing global information. | High road connectivity |
Disadvantage | Noisy and lower precision road connectivity due to an intermediate non-graph representation (i.e., segmentation map) and post-processing heuristic (e.g.: morphological thinning). | Iterative procedure which focuses more on local information rather than global information. |
By observing Table 1, approaches in these two categories often tradeoff with each other. Based on this observation, a natural question to ask is if it possible to combine both approaches into one unified approach that can benefit from the advantages of both.
Methodology
Sat2Graph
Sat2Graph is a new road network extraction approach which combines the inherent advantages of segmentation-based approaches and graph-based approaches into one simple, unified network. In order to achieve this, the authors design a novel encoding scheme, graph-tensor encoding (GTE). As its name suggests, this method allows to encode the road network graph into a three-dimensional tensor.
Fig. 2. Basic pipeline of Sat2Graph.
As shown in Figure 2, the input the neural network is the satellite image, and the output is the GTE representation of it. On one hand, Sat2Graph is overcoming the limitations of segmentation-based approaches by not depending on an intermediate non-graph representation but directly estimating the graph structure of the road in the form of a tensor. On the other hand, Sat2Graph also overcomes the limitations of the graph-based approaches by capturing the global information and computing the graph holistically (i.e., in one shot) and not iteratively.
Graph-Tensor Encoding (GTE)
Tensor structure
A visual representation of the tensor structure is shown in Figure 3. The first two dimensions of the tensor (T) correspond to the two spatial axes in the 2D plane. The vector at each spatial location u_{x,y}=\left[T_{x,y,1},T_{x,y,2},\ldots,T_{x,y,\left(1+3\cdot D_{max}\right)}\right] is used to encode the graph information. Here, the vector u_{x,y} has (1 + 3 \cdot D_{max})elements, where D_{max}is the maximum number of edges that can be encoded at position (x,y). Its first element p_{v} \in [0,1](vertexness) encodes the probability of having a vertex at position (x,y). Following the first element, there are D_{max} groups of 3 elements, each of which encodes the information of a potential outgoing edge from position (x,y). For the i-th group, its first element p_{e_{i}} \in [0,1] (edgeness) encodes the probability of having an outgoing edge pointing from (x,y) to \left(x+dx_i,y+dy_i\right). In the paper, the hyperparameter was set to six.
Fig. 3. GTE encoding of a graph.
Nonetheless, if no further modifications are made to the tensor encoding explained so far, there may be multiple geometrically equivalent representations of the same graph (i.e., isomorphic representations). For instance, every permutation of the edge encoding part of the tensor leads to the same graph. Hence, GTE only uses the i-th 3-element group to encode edges pointing toward a \frac{360}{D_{max}}-degree sector from \left(i-1\right)\cdot\frac{360}{D_{max}} degrees to i\cdot\frac{360}{D_{max}} degrees. This restriction is also depicted in Figure 3. This strategy imposes a restriction on the encoded graphs – for each vertex in the encoded graph, there can only be at most one outgoing edge toward each \frac{360}{D_{max}}-degree sector. This might not impact the representation ability of GTE for most road graphs. However, it is thought that it may present problems in very curvy roads where the angle between each road is less than \frac{360}{D_{max}} degrees.
Encoding stage
Encoding a road graph into GTE is straightforward. First, the encoding algorithm generates the minimum number of evenly distributed vertices along the segment of straight road in the road network graph. After this step, all vertices are mapped to the 3D tensor so that the algorithm sets the vertexness p_{v} of (x,y) to 1 when there is a vertex in position (x,y) and 0 otherwise. The same mechanism is followed for the -th directional block of the tensor and the edgeness p_{e_{i}}.
Decoding stage
In this stage, GTE’s decoding algorithm converts the predicted GTE of a graph back to the regular graph format G=\{V,\ E\}. The decoding algorithm consists of two steps:
- Vertex extraction: only those vertices with vertexness p_{v} greater than a threshold p_{thr} are considered potential vertices.
Edge connection stage: for each candidate vertex v \in V, the decoding algorithm connects its outgoing edges to other vertices. The i-th edge of vertex v \in V with edgeness p_{e_{i}} greater than p_{thr} is connected to the nearest vertex u following the distance function,
d(v,i,u) = \|(v_{x} + dx_{i}, v_{y} + dy_{i}) - (u_{x}, u_{y})\| + w\cdot cos_{dist}((dx_{i}, dy_{i}), (u_{x} - v_{x}, u_{y} - v_{y})) where cos_{dist}(v_{1}, v_{2}) is the cosine distance and w is the weight of the cosine distance.
Training
The training of Sat2Graph is conceived as a mix between classification and regression. The authors chose the following loss function,
L\left(T,\hat{T}\right)=\sum_{\left(x,y\right)\in\left[1\ldots W\right]\times\left[1\ldots H\right]}\left(L_{CE}\left(p_v,\widehat{p_v}\right)+\widehat{T_{x,y,1}}\cdot\left(\sum_{i=1}^{D_{max}}\left(L_{CE}\left(p_{e_i},\widehat{p_{e_i}}\right)+L_2\left(\left(dx_i,dy_i\right),\left(\widehat{dx_i},\widehat{dy_i}\right)\right)\right)\right)\right) |
where the term L_{CE}\left(p_v,\widehat{p_v}\right) refers to the cross-entropy of the vertexness. The second term computes the cross-entropy of the edgeness, L_{CE}\left(p_{e_i},\widehat{p_{e_i}}\right), and the L_{2} loss of the edge vectors, L_2\left(\left(dx_i,dy_i\right),\left(\widehat{dx_i},\widehat{dy_i}\right)\right), only where there is a vertex in the ground truth, \widehat{T_{x,y,1}}.
One of the advantages that this method provides is that it is agnostic to the CNN backbones. In this paper, the Deep Layer Aggregation (DLA) [10] segmentation architecture is chosen as the CNN backbone.
Experiments and results
The performance of Sat2Graph was compared with the performance of four segmentation-based models, ranging from a straightforward UNet [6] application to more complex models [7][3], and one graph-based approach [2]. The data was acquired from two large datasets containing satellite imagery: City-Scale Dataset and SpaceNet Roads Dataset. The models were evaluated through two metrics, TOPO [8] and APLS[9]. These metrics focus more on the topology correctness of the inferred road rather than edge-wise correctness. TOPO metric measures the similarity of sub-graphs in terms of precision, recall and F1-score. APLS measures the quality of the shortest path between two locations on the graph, where greater values imply similar shortest paths.
Quantitative evaluation
The best achievable TOPO F_{1}-score and APLS score of each approach are shown in Table 2. For a given precision, Sat2Graph always has the best recall; and for a given recall, Sat2Graph always has the best precision. Moreover, Sat2Graph surpasses all other approaches on APLS metric. More interestingly, at the bottom part of the table Sat2Graph is compared to Seg-DLA, which uses the same CNN backbone as Sat2Graph. Here, it is clear that the superiority of Sat2Graph in comparison with the rest of approaches comes from the graph-tensor encoding rather than the stronger CNN backbone. Finally, the unique property of Sat2Graph to infer stacked roads was also evaluated. In this evaluation, Sat2Graph reported a precision of 83.11% and a recall of 49.81%. The reason for the small recall might be some small roads under wide highways roads which are missing entirely.
F_{1}-score and APLS score.
Table 2. Comparison of the best achievable TOPOIn Figure 4 the precision-recall curves for different approaches are shown. These curves were acquired by varying on major hyperparameter in every approach, which is often a probability threshold. Note that Sat2Graph is the only approach which surpasses all models at all precision-recall points for both datasets.
Fig. 4. TOPO metric precision-recall curves
Qualitative evaluation
In regular urban areas, the existing segmentation-based approach Seg-DLA has already given decent results in terms of both precision and recall, even if the satellite imagery is full of shadows and occlusions. Compared to Sat2Graph, the most apparent remaining issue of the segmentation-based approach appears at parallel roads (see Figure 5). Again, it is thought that this issue is from the fundamental limitation of segmentation-based approaches – the road-segmentation intermediate representation.
Fig. 5. Results of different approaches in a regular urban area.
As mentioned before, the inference of stacked roads is a unique feature of Sat2Graph. None of the other approaches can handle stacked roads. However, Sat2Graph may still fail to infer stacking roads in some complicated scenarios such as in Figure 6 (d-e).
Fig. 6. Stacked roads inference using different approaches
Conclusion
The creation and maintenance of digital road maps are associated with high costs and exhausting manual activity. Automated solutions for road network extraction tipically fall into two categories, segmentation-based approached and graph-based approaches. However, both approaches suffer from complementary limitations and advantages. In this work, the authors propose a new road network extraction approach, Sat2Graph, which combines the inherent advantages of both categories into one simple, unified framework. To do so, the authors present a novel encoding scheme, graph-tensor encoding (GTE), to encode the road network graph into a 3D tensor representation. Therefore, Sat2Graph is able to surpass existing solution in terms of topology-similarity metric at all precision-recall points in an evaluation over two large datasets. In addition, Sat2Graph naturally infers stacked roads that none of the existing approaches can handle.
Personal review
This paper tackles the problem of road network inference from a very interesting point of view. Here, the novelty is not a new fancy neural network, a new layer or a loss function. This is just about reinterpreting the data and its structure. It makes me remember that not all effort should be taken into make great deep neural networks, but data is very important, and it can make a difference.
In my opinion, one powerful feature of this method is that it is model agnostic. Therefore, it makes me think about other problems where this GTE strategy can be applied. For instance, a potential candidate for implementing GTE would be 2D medical images in which the interest is on delineating or tracking some curvy structure like vessels, cells or neurons. In addition, the ability of detecting stacked roads also makes GTE very interesting for processing such medical images (e.g.: detection of stacked vessels).
On the other hand, the GTE strategy presents an inherent limitation when it comes to edge encoding. The outgoing edges of a vertex are limited in number and directionality. This supposes a strong restriction which does not seem to influence the results in the current paper, yet it may limit the application of GTE to other problems.
Finally, I would like to highlight the clear structure and clarity of the explanations in the paper. It facilitates the reading and assimilation of the exposed ideas.
References
[1] Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C.: Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing 55(6), 3322 - 3337 (2017)
[2] Bastani, F., He, S., Abbar, S., Alizadeh, M., Balakrishnan, H., Chawla, S., Madden, S., DeWitt, D.: RoadTracer: Automatic extraction of road networks from aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4720 - 4728 (2018)
[3] Batra, A., Singh, S., Pang, G., Basu, S., Jawahar, C., Paluri, M.: Improved road connectivity by joint learning of orientation and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10385 - 10393 (2019)
[4] Chu, H., Li, D., Acuna, D., Kar, A., Shugrina, M.,Wei, X., Liu, M.Y., Torralba, A., Fidler, S.: Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4522 - 4530 (2019)
[5] Li, Z.,Wegner, J.D., Lucchi, A.: PolyMapper: Extracting city maps using polygons. arXiv preprint arXiv:1812.01497 (2018)
[6] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234{241. Springer (2015)
[7] Mattyus, G., Luo, W., Urtasun, R.: DeepRoadMapper: Extracting road topology from aerial images. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3438 - 3446 (2017)
[8] Biagioni, J., Eriksson, J.: Inferring road maps from global positioning system traces: Survey and comparative evaluation. Transportation research record 2291(1), 61 - 71 (2012)
[9] Van Etten, A., Lindenbaum, D., Bacastow, T.M.: Spacenet: A remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232 (2018)
[10] Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2403 - 2412 (2018)