Crack Detection Paper Reading

Paper LIST:

Autonomous concrete crack detection using deep fully convolutional neural network 2019
DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection 2018
Holistically-Nested Edge Detection(经典边缘检测算法HED) 2015
Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection 2019
A review on computer vision based defect detection and condition assessment of concrete and asphalt civil infrastructure 2015 TODO: 精读
CrackGAN 2020
SegNet 2016
Feature Pyramid Networks for Object Detection 2017
RCF
Bi-Directional Cascade Network for Perceptual Edge Detection 2019

THE FOLLOWINGS ARE MAINLY CLASSIFICATION TASK
Image based techniques for crack detection, classification and quantification in asphalt pavement: a review 2017 CLF TODO
Crack and Noncrack Classification from Concrete Surface Images Using Machine Learning 2018 Structural Health Monitoring
Review and Analysis of Crack Detection and Classification Techniques based on Crack Types 2019 TODO
Concrete Cracks Detection Based on Deep Learning Image Classification LINK 2019
Multi-scale classification network for road crack detection 2018 CLF TODO
Learning relaxed deep supervision for better edge detection 2016 （improvement for HED）

Semantic Segmentation:

Dilated Residual Networks 2017 LINK
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs 2017 LINK
Multi-Scale Context Aggregation by Dilated Convolutions 2016 LINK
Understanding Convolution for Semantic Segmentation 2018 LINK
Rethinking Atrous Convolution for Semantic Image Segmentation 2017 LINK

Reading More：

From Paper number 16：

Deepedge 2015
High-for-low and low-for-high: Efficient Boundary detection from deep object features 2015
Canny A computational approach to edge detection 1986
Deeply supervised Nets 2015

In General

This part is for answering some general questions for this topic.

Q1: Why using Deep Learning methods instead of traditional CV algorithm:

Cracks may constantly suffer from noise in the background, leading to poor continuity and low contrast. That lead to the traditional CV algorithms have bad performance on these cases.

Q2: Which kind of tasks in CV are corresponding to Crack Detection?

Meanly 3 task:

Semantic Segmentation
Edge Detection
Classification

Q3: Main DataSets:

Concrete:

CRACK500
GAPs384
Cracktree200
sCFD
Aigle-RN & ESAR & LCMS

Pavement:

Q4: Metric For Crack Detection Task

最常见的指标之一就是PR（precision and recall）了，将二者统一，即变为F1-measure，这个在各个论文中评价crack detection 模型的性能中经常出现。

$$F_1=\frac{2*PR}{P+R}$$

该值位于0~1之间，约接近1认为模型性能越好。

其缺点为：无法很好的表明检测到的裂缝和地面真实情况之间的重叠程度，尤其是在裂缝较大时。

ODS & OIS

这两个是edge detection中的指标。边缘检测领域的标准准则是固定scale or threshhold(ODS)数据集上的最佳F值，每个图像中最佳scale or threshhold数据集上的平均F值(OIS)。

AP(准确率)
FPHBN中提出的average intersection over union(AIU)。

其中$N_t$表示阈值t的总个数,t在区间[0.01,0.99]内，间隔是0.01;对于给定的阈值t, $N_{pg}^t{}$为预测区域与真实裂缝区域相交（重叠）区域的像素个数，$N_p^t$和$N_g^t$分别表示预测裂缝区域和真值裂缝区域的像素数。因此，AIU在0到1之间，值越大，性能越好。数据集的AIU是数据集中所有图像的AIU的平均值。

提出的主要目的就是弥补PR指标的缺点，作为一个complementary measurement。AIU takes the width information into consideration to evaluate detections and illustrates the overall overlap extent between detections and ground truth.

Paper’s Detail:

This part included the detail of each paper that I have perused.

Autonomous concrete crack detection using deep fully convolutional neural network

Author: Cao Vu Dung, Le Duc Anh Paper Link

Journal: Automation in Construction 2019

Key Point: Concrete Crack Detection, FCN(Fully CNN), Semantic Segmentation(Use FCN), Evaluate crack density, Test the Model on a real video

数据集：https://data.mendeley.com/datasets/5y9wdsg2zt/1

模型很简单：Classification + Semantic Segmentation(FCN)

Semantic Segmentation Result：

总结：特点不多，主要将CV中的Semantic Segmentation的经典方法（FCN）引入至Crack Detection领域，并且使用了较大的预训练( VGG16, ResNet, Inception)模型，获得结论使用预训练模型可以获得更好的预测结果，在实验部分做了不少工作。最后用了一个现实中的视频来验证一下model的效果。

DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection

Author: Qin Zou er al Paper LINK

Journal: IEEE TRANSACTIONS ON IMAGE PROCESSING(Published at 2018 Oct)

KEY WORDS: 基于SegNet，fuse the feature map from encoder and decoder at different scales，crack/line/edge detection

整体模型：

模型核心：

In DeepCrack, they first pairwisely fuse the convolutional features of the encoder network and decoder network at each scale, which produces the single-scale fused feature map, and then combine the fused feature maps at all scales into a multi-scale fusion map for crack detection. 这也是模型相比于SegNet最大的改进。

下图为SegNet的模型结构：

下图为encoder和decoder的feature map是如何聚合的（即论文中的skip-layer）。

Loss Function

使用的是cross entropy（并没有对于不平衡类别即：imbalance classification做处理）

最终的loss是由每层聚合的feature map和多层聚合的feature map的损失相加。

Experiments And Results

Modeling by caffee

Datasets:

Training: CrackTree260
Testing: CRKWH100, CrackLS315, Stone331

Compare With Other Models

Other Experiments

作者也比较了每一个scale对于最终结果的影响，即：在最终的loss中对于每一层的loss增加一个权重。
是否使用预训练模型的比较。有趣的是预训练的效果并不好，作者给出解释是：because that the pretrained model is well fit for nature image segmentation and is impossible or very difficult to be fine-tuned for crack detection.
作者讨论了在真实label中加入噪声对于模型的影响。结论是：对于噪声并不敏感。
作者还比较了不同的up-sampling方法的影响。即：传统的在max pooling时记录index的方法和bilinear interpolation的方法。结论：传统的在max pooling时记录index的up-sampling方法更好。
Different Weights on the Crack and Non-Crack Background，即在计算loss时对于不同的类别（即 crack or non-crack）赋予不同的权重。结论：会对性能提升，降低了类别不平衡的影响。
作者还比较了Running Efficiency, 上图给出的FPS为对于512×512大小的图片的实时处理速度。DeepCrack为：0.153 second per image。不过其使用的设备是GeForce GTX TITAN-X GPU和2.3GHz主频的E5-2630 CPU.

Holistically-Nested Edge Detection(HED)

Author: Saining Xie er al Paper LINK

Conference: CVPR 2015

KEY WORDS: 基于VGGNET，mutil-scale fuse，edge detection

此paper为使用CNN(VGG)的经典边缘检测模型，后期出现在多篇Crack Detection模型的baseline中。下一篇Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection得部分思路也是来源于HED的。这里主要总结和介绍一下模型部分。

其中Holistically表示该算法试图训练一个image-to-image的网络；Nested则强调在生成的输出过程中通过不断的集成和学习得到更精确的边缘预测图的过程。HED使用VGG改造的网络，针对整张图片提取特征信息，使用multi-scale fusion，multi-loss的方法。

值得注意的是，该方法的速度还是比较快的，可以达到0.4s per image(on the NYU Depth dataset)

Model Detail

下图为整个HED的模型结构。特征提取使用的是pre-trained的VGG网络。

作者给出performance提升的原因是：

patch-based CNN edge detection methods
FCN-like image-to-image training allows models to simultaneously train on a significantly larger amount of samples
deep supervision in model guides the learning of more transparent features
interpolating the side outputs in the end-to-end learning encourages coherent contributions from each layer(used in fuse)

本文的一大亮点就是使用了与以往不同的multi-scale fuse的方法。如下图所示：

(a)Multi-stream learning 示意图，可以看到图中的平行的网络下（不同的网络有不同的参数和receptive field），输入是同时输进去，经过不同路的network之后，再连接到一个global out layer得到输出。

(b)Skip-layer network learning 示意图，该方法主要连接各个单的初始网络流得到特征图，并将图结合在一起输出。

这里（a）和（b）都是使用一个输出的loss函数进行单一的回归预测，而边缘检测可能通过多个回归预测得到结合的边缘图效果更好。

(c)Single model on multiple inputs 示意图，单一网络，图像resize方法得到多尺度进行输入。当用于test端时，有点类似于集成学习的感觉。

(d)Training independent networks ，通过多个独立网络分别对不同深度和输出loss进行多尺度预测，该方法下训练样本量较大。

(e)Holistically-nested networks，本文提出的结构，从（d）演化来，类似地是一个相互独立多网络多尺度预测系统，但是将multiple side outputs组合成一个单一深度网络。

这种方法的好处在于：基于hidden layer的supervision更有利于performance的提升；也相对的减小了模型的复杂度；在fuse时可以自由的调配每个scale的权重，或者通过学习来学到一个更好的参数。

作者主要使用了VGG16，并对其进行小幅度修改。

Loss Function

总的side-output loss为每一个side-output的加权和，权重可以通过学习来获得。由于edge detection任务的特性，大部分pixel为非edge的，所以存在明显的类别不平衡，每个side-output的loss使用下图的公式来解决这个问题。

另外一部分的loss是fuse loss。

h为fuse的一组参数。最终的loss和学习目标为：

Experiment Results:

Results on BSDS500 datasets.

Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection

Author: Fan Yang er al Paper LINK

Conference: CVPR 2019

KEY WORDS: Pavement Crack Detection，主要将Feature Pyramid Network(FPN)应用于Crack Detection task，提出新的 measurement for crack detection: average intersection over union (AIU)

模型细节：

FPHBN使用了HED（本篇文档中也有HED的介绍与总结）作为其backbone net，在其之上引入了FPN的思想（FPN在本文档中也有介绍与总结）。

下图就是HED与FPHBN的模型对比图，可以很明显的看到，在HED后面加入了类似于FPN的思想（top-down + skip connection）和Hierarchy Boostings。Hierarchy Boostings可以起到平衡top layer与bottom layer的权重，可以让模型pay attention to hard examples.

整个模型分为四个部分：

A bottom-up architecture for hierarchical feature extraction
A feature pyramid for merging context information to lower layers using a top-down architecture
Side networks for deep supervision learning. The side network at each level performs crack prediction individually.
A hierarchical boosting module to adjust sample weights in a nested way

这里主要介绍一下Hierarchical Boosting，1~3部分即为HED+FPN，这二者在本文中都有介绍。

使用Hierarchical Boosting是为了解决HED的一个缺点，即：the network cannot effectively learn parameters from misclassified samples during training phase. 这个是因为edge detection task 本质的类别不平衡导致的，虽然在HED中引入了系数β来损失函数对不同类别的权重，但这种方法并不能区分easy and hard samples，因为类别不平衡很容易导致损失函数主要被不平衡类别占据。

作者针对这个问题（还有一些其他解决方法），提出了Hierarchical Boosting来reweigh samples。

想法来源于，回看feature pyramid层，可以看到，上层的信息会传递到下层，这种信息会告诉下层那些样本是hard的，这样这些下层的网络就可以pay attention to这些hard样本了。于是作者就想到来facilitating communication between adjacent side networks.

实现起来就是重写每一个side network的损失函数：$d_i^{m+1}$ 为第m+1个side network的预测值与真实值在第i个像素的差异。

Experiments And Results

可以看到在inference时的速度也比HED有了提升。

Feature Pyramid Networks for Object Detection

Author: Lin er al Paper Link

Journal: CVPR 2017

Key Point: Adding Pyramid representation into Deep Learning Methods, Multi-scale and pyramidal hierarchy

Other’s Note: ZhiHu Link

总结： FPN本质是另一种特征提取的结构，获得的representation可以用于很多不同的task，例如object detection 或者 semantic segmentation(crack detection就可以看为是这种任务)。其最大的好处是：在保持原有性能的情况下，极大的降低了模型的复杂度，让模型更feasible。现在FPN已经经常作为各种Detecton和segmentation算法的标准组件。

下图给出了几种常见的pyramid形式的特征提取方法。例如b方法就是常见的CNN所使用的结构。

值得注意的是，在图中边框更粗的代表了更high-level的语义的特征（很符合直观）。个人理解：许多常见的multi-scale方法就是类似c的方法，将每层CNN的结果聚合作为预测结果，如SSD(Single Shot Detector)，SegNet和上面的HED。作者给出的(c)方法的缺点为：This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

(a) 中的 Featurized image pyramid结构在ImageNet 或 COCO上取得了很好的表现，因为这种方法产生了multi-scale的representation而且每层特征提取都semantically strong。这种方法的主要问题就是：inference time过长，导致实际难以应用。

下图为FPN网络的具体图示，作者也引入了类似ResNet的skip connection。

整体结构主要包含三个部分：

Bottom-up pathway 正常使用backbone网络，如Resnet
Top-down pathway 使用nearest neighbor upsampling
Lateral connections 如下图所示。

SegNet: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Author: Vijay Badrinarayanan er al

Journal: TPAMI 2015

Key Point: Semantic Segmentation，compare different upsampling methods

本文提出的SegNet和FCN、DeepLab-LargeFOV、DeconvNet做了比较，这种比较揭示了在实现良好分割性能的前提下内存使用情况与分割准确性的权衡。
SegNet的主要动机是场景理解的应用。因此它在设计的时候考虑了要在预测期间保证内存和计算时间上的效率。
定量的评估表明，SegNet在和其他架构的比较上，时间和内存的使用都比较高效。

模型细节

Encoder: VGG16，移除全连接层(基本操作)

Decoder: Use pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling.

特点：相比于最经典的FCN等，有更少的参数，相对更快的inference time。

实际上，SegNet与FCN最大的差别就在于upsampling的不同。作者在文章中还探讨了不同upsampling对性能的影响和每个方法可能的优劣。

通过上表分析，可以得到如下分析结果：

bilinear interpolation 表现最差，说明了在进行segmentation时，decoder是可学习的还是非常重要的。
SegNet-Basic与FCN-Basic对比，均具有较好的精度，不同点在于SegNet存储空间消耗小，FCN-Basic由于feature map进行了降维，所以时间更短。
SegNet-Basic与FCN-Basic-NoAddition对比，两者的decoder有很大相似之处，SegNet-Basic的精度更高，一方面是由于SegNet-Basic具有较大的decoder,同时说明了encoder过程中低层次feature map的重要性。
FCN-Basic-NoAddition与SegNet-Basic-SingleChannelDecoder：证明了当面临存储消耗，精度和inference时间的妥协的时候，我们可以选择SegNet，当内存和inference时间不受限的时候，模型越大，表现越好。

作者总结到：

encoder特征图全部存储时，性能最好。这最明显地反映在语义轮廓描绘度量（BF）中。
当限制存储时，可以使用适当的decoder（例如SegNet类型）来存储和使用encoder特征图（维数降低，max-pooling indices）的压缩形式来提高性能。
更大更复杂的decoder提高了网络的性能。

Concrete Cracks Detection Based on Deep Learning Image Classification

Author: Wilson Ricardo Leal da Silva 1,OrcID andDiogo Schwerz de Lucena 2 LINK

Journal: The 18th International Conference on Experimental Mechanics (ICEM 2018) 材料领域的会议

总结：模型真的非常简单。使用VGG16作为Backbone Net，在其上面做transfer Learning（pre-train model）。

整体感觉，没想到2018年的会议上也有这么水的文章，不过引用了一些可以扩展阅读的文献。

模型训练使用NVIDIA Tesla K80（约12G显存）

使用无人机（UAV）做Crack detection

Robust crack detection for unmanned aerial vehicles inspection in an acontrario decision framework 2015
Development of Crack Detection System with Unmanned Aerial Vehicles and Digital Image Processing 2015

Deep Learning System for Automated Cracking

Basis and Design of a Deep Learning System for Automated Cracking Survey 2017
Automatic Pavement Crack Detection Based on Structured Prediction with the Convolutional Neural Network. 2018

Crack and Noncrack Classification from Concrete Surface Images Using Machine Learning

Author: Hyunjun Kim er al LINK

Journal: Structural Health Monitoring 2018

Key Point: Concrete crack identification/classification，feature preprocessing(use speeded-up robust features) VS CNN, Detect existence and location of cracks

作者给出的contribution为：

an efficient classification framework based on a crack candidate region (CCR) is proposed to effectively categorize cracks and noncracks
comparative analysis between SURF-based and CNN-based methods is conducted to evaluate the classification performances
a comprehensive crack identification in the presence of crack-like noncracks is conducted for practical applications

作者给出的SURF（speeded-up robust feature）是一种计算机视觉领域中的特征点检测的传统算法（无学习能力）。可以获得distinctive features。其主要包含以下两个步骤：

interest point detection
interest point description

SURF-based classification

这部分与CV中的Bag-of-Words(Visual categorization with bags of keypoints 2004)的流程很相似，共分为三个阶段：

Feature Extraction: 使用SURF对每个4*4的子区域生成一个64维feature vector。
Visual vocabulary construction(K-meas Clustering): 所有interest point的feature vector都用于visual word，该词用作代表性的小图像段，用来描述诸如颜色，形状和表面纹理之类的特征。之后在使用K-means来进行聚类，聚类的结果被分为许多组，这些组就是visual vocabulary or the bag of features.
Classification(SVM): 训练时，先试用feature extraction+k-means生成词汇表，然后BoW将每一个图片在每一个组内的数目统计下来，生成一个feature histograms来输入给SVM训练。

下图为SURF-based classification和CNN-based classification(使用Fast RCNN)的示意图。

模型细节

主要由两部分组成：

generation of CCRs
SURF-based or CNN-based classifications

具体的流程为：

先将图片binarization，这样可以找到crack和crack-like noncrack，并对得到的结果的每个部分赋予类别。然后，使用CNN or SURF来提取特征，之后在进行分类。

CCR（crack candidate region）

下图为生成CCR的流程图。其中image binarization: 将所有pixel的取值二值化，0为黑色，1为白色计算时基于一个阈值，使用的方法为Sauvola’s binarization（该方法在noisy和high-contrast的图像上性能出色）来计算该阈值。

作者给出这种方法的好处为：

获得的CCR只关注cracks和crack-liked noncrack.
计算时更加高效。因为只有被选出的CCR才会进入train and test stages.

SURF-based and CNN-based classification models

大体流程和对比如下图所示：

CNN框架使用AlexNet。生成feature后，进入各自不同的分类模型，分类模型的介绍在上一节有写。

最终模型的流程图为：

实验部分

居然是使用MATLAB来implement的！训练i7-7700 GTX1080，最终的分类结果。

作者在多种超参数组合下，计算了各种性能指标，包括：P、R，F1、Acc，计算时间等。

结论是CNN-based methods的性能更好的。不过在训练速度上难以对比，因为CNN based method使用的是GPU，而SURF使用的是CPU。训练速度对比图：

只不过在有些样例上，使用CNN的方法效果不如SURF，作者给出的原因是：the local features extracted using the SURF can in some instances correctly classify the CCRs hat were incorrectly categorized using the CNN-based method.

作者猜测使用：使用DeepLearning提取的feature和local feature的结合会进一步提高魔性的性能。

这里作者给出了一个很有趣、也很合理的结论，作者使用CCR后，相当于在训练时的样本基本上只包括crack 和 crack-like noncrack，这样相比于使用crack和intact surface来训练，极大的提升了模型性能。正如下图所示：

Richer Convolutional Features for Edge Detection(RCF)

Author: Yun Liu er al

Journal: CVPR 2017

另一篇经典的edge detection的算法。算法的特点：性能优于HED，速度基本与HED持平。其中也提出的Fast RCF：**在BSDS500数据集上，ODS指标可以达到0.806 with(30 FPS)**。人类在这个benchmark上的水平为：0.803.

相比于HED的改进就在于以下三点：

使用了每次卷积后的结果，而不是同HED只使用每个stage的最后一个输出
改变了loss，引入了对于类别不平衡的修正
引入Multiscale Hierarchical Edge Detection，会从一定程度上提升准确度，显然会导致运算速度的下降。

本质：exploits multi-scale and multi-level information

Architecture

同HED，也是使用VGG16network，将VGG16的5个conv阶段分开，分别产生一个side-output.（这部分与HED很相似）。

下图为RCF的网络结构：

整体就是一个VGG16，与HED的不同之处在于HED是在每一个conv stage的最后取最后的输出做multi-scale fuse，拼接后再过一个1*1的卷积得到最终的结果。

而RCF进一步的使用了网络中每次卷积的结果，在每个stage的conv的结果都被输出而聚合。这也正是paper的标题：Richer Convolutional Features的由来。

NOTE：每个stage的划分标准是根据max pooling layer来分割的。

Loss

使用的每一个loss如下图：

可以看到本质就是一个二分类的交叉熵，训练数据本身的正例和反例的不均匀性（即edge在图中的pixel的数量是远远小于非edge的），所以α和β来起到一个类别平衡的作用。而超参数λ用来balance positive and negative samples.

最终的loss是各个loss的和，如下图所示：

Multiscale Hierarchical Edge Detection

在test time时，为了提高模型的准确性，使用了image pyramids，有点类似于使用集成学习的方法。具体来讲：we resize an image to construct an image pyramid, and each of these images is input to our single-scale detector separately. Then, all resulting edge probability maps are resized to original image size using bilinear interpolation. At last, these maps are averaged to get a final prediction map.

即如下图所示：

Experiments

训练时也使用了同HED一样的数据增强。在NYUD数据集上，为了使用深度信息，作者使用HHA，来进一步做对比。

CrackGAN

Author: Kaige Zhang, Yingtao Zhang, and H. D. Cheng LINK

Journal: CVPR 2020

Key Point: Pavement crack detection，generative adversarial learning, partially accurate ground truths(GTs)

Problems

Why label is partially accurate GTs?

在我们获取数据集的时候（这里尤其是pavement数据），往往是通过在一辆车上固定一个相机来拍摄，之后再由人工手工标注而获得的。由于是通过这种方式来获取label的，crack往往都非常的细，相对于background而言是很少的，而且他们的边界也相对模糊。这也导致了对于pixel-level的ground truth标记困难。所以在实践中，往往只将crack标记为1-pixel的曲线，这种GT也被称为labor-light GT（减小了人工标注的成本）。所以很明显，这种GT并不是完全准确的（pixel-level），所以被称为partially accurate ground truths(GTs)
Then What Happened when using FCN-based methods？

显然，这种情况下，这是一个明显的类别不平衡问题，前人的解决方法往往是使用对于loss做一个修正。作者认为这种方法是无法解决这个问题的，因为partially accurate GTs的存在。所以就导致了the network will simply converge to the status that treats the entire crack image as BG (labeled with zero), and still can achieve a good detection accuracy or loss (BG-samples dominate the accuracy calculation). 这就是ALL BLACK problem。此问题也经常出现在其他的pixel-level pavement crack detection之中。

So, CrackGAN is solving:

solving ALL BLACK issue
proposes the crack-patch-only (CPO) supervised adversarial learning and the asymmetric U-Net architecture to perform
the end-to-end training
the network can be trained with partially accurate GTs generated by labor-light method which can reduce the workload of preparing GTs significantly
solve data imbalance problem which is the byproduct of the proposed approach

Model Architecture

D is a pretrained discriminator obtained directly from a pre-trained DC-GAN using crack-GT patches only. 这类的discriminator会令网络一直产生包含crack GT的image，作者认为这个是解决ALL BLACK问题的关键。

作者为了解决这个问题，在loss中加入了额外的generative adversarial loss.

实际上就是使用一个预训练好的discriminator（训练一个只使用crack ground-truth的data，这样如果出现没有crack就认为其是fake，这样的discriminator作者称为one-class）来防止出现将整个图片都预测为ground truth的情况。

下图就是对于discriminator预训练的过程（训练一个DC-GAN），其中CPO是由人工产生的。文中多次提到的COP-supervision指的是：the training data are prepared with crack patches only, without involving any non-crack patches and “all black” patches.

训练好这个discriminator后，这个discriminator和模型的核心 U-net一起训练。

至于为什么只使用CPO-supervision就可以让整个模型也处理不包含crack的图像，作者在C. Asymmetric U-Net for BG-image translation给出了解释。目前这部分还不是很明白，涉及到了pix2pix GAN的Receptive field. 而我本人对于GAN的了解不是很多，所以部分之后会在了解了GAN之后继续更新的。

最后还有一部分是加入了一个pixel-level的loss. y是经过处理的partially accurate GT（将1-pixel width的crack变为3-pixel宽）。直观来讲：就是U-net生成结果的crack在扩展后的真实label内。

作者还指出，这个asymmetric U-net可以处理任意尺寸的input image.

Reference

K. Zhang, H. D. Cheng, and B. Zhang, “Unified approach to pavement crack and sealed crack detection using pre-classification based on transfer learning,” J. Comput. Civil Eng., vol. 32, no. 2 (04018001), 2018.
K. Zhang, H. D. Cheng, and S. Gai, “Efficient Dense-Dilation Network for Pavement Crack Detection with Large Input Image Size,” in Proc. IEEE ITSC 2018, Maui.
F. C. Chen, and M. R. Jahanshahi, “NB-CNN: Deep learning-based crack detection using convolutional neural network and Nave Bayes data fusion,” IEEE Transactions on Industrial Electronics, vol. 65, no. 5, pp.4392-4400, 2018.
S. Park, S. Bang, and H. Kim, “Patch-Based Crack Detection in Black Box Images Using Convolutional Neural Networks,” Journal of Computing in Civil Engineering, vol. 33, no. 3 (040190170), 2019.
N. D. Hoang, Q. L. Nguyen, and V. D. Tran, “Automatic recognition of asphalt pavement cracks using metaheuristic optimized edge detection algorithms and convolution neural network,” Automation in Construction, vol. 94, pp. 203-213, 2018.
K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal, “Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection,” Construction and Building Materials, vol. 157, pp. 322-330, 2017.
Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, S. Wang, “DeepCrack: Learning hierarchical convolutional features for crack detection,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1498-512, 2019.
F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei, H. Ling, “Feature pyramid and hierarchical boosting network for pavement crack detection,” arXiv preprint arXiv:1901.06340. 2019.
M. Mirza, and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE CVPR, 2017.
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. ICLR, 2016.

Learning relaxed deep supervision for better edge detection

Author: Yu Liu er al LINK

Journal: CVPR 2016

文中很有特点的一点就是使用：relaxed supervision，即：使用额外的relaxed label，这些relaxed label是使用simple detector来获取的（比如：Canny算子）之后再与真实的ground truth 相结合。

本质来讲就是RDS(本篇文章提出的模型)使用relaxed labels 和额外的训练数据来 retrain HED，这样可以获得更好的性能的提升。

问题引出：

作者认为HED的模型对于所有的intermediate layer只使用了一个general supervision（有真实标记的）。作者认为这样让网络的diversities消失了，即hierarchical layer(CNN)不同层所获取特征不再具有多样性。所以作者就想引入：diversities supervision 来重新赋予网络的CNN layer重新获得coarse to fine的表示。这种diversities supervision就是作者本文提出的relaxed deep supervision(RDS).

RDS 加入了额外的label（原有的label是edge为positive，non-edge为negative）。额外加入的label使用simple detector来获取的（比如：Canny算子）。之后将这些relaxed label加入的原始的ground truth来训练。

Contribution：

提出了RDS这种relaxed label来提升HED算法的性能
因为在许多edge detection任务中，数据标注的成本很高，导致训练数据很少，例如BSDS500只有200个训练数据。作者还证明了使用使用一些在大型数据集上预选练的模型，之后fine-tune，可以提升模型的性能。

Related Work：

pixel-level的work有: DeepEdge和其的一篇改进。extract deep features peer pixel and classify it to edge and non-edge class.
patch-level的work有：DeepContour。这类方法：estimate edge maps for the input patches and then integrate them for the whole edge map.
image-level主要就是：HED了。一类end-to-end的方法。

作者主要工作就是将HED与DSN（deeply supervised net）相结合

Model‘s Architecture：

与HED极其的相似，不同点在于：

HED：每个side output和fuse output都用真实label
RDS：side output用自己生成的relaxed label+positive label；最终的fuse output的label为真实label

实际上就是作者手工让hierarchical layer强行学到不一样的东西。 结果显示这样可以提升性能。

作者在生成relaxed label时有两种选择，分别为Canny算子和SE Detector。

下图为这些方法在不同参数下提取到的特征的结果。通常来讲，SE detector可以获得更稀疏的relaxed label.

Loss：

其中$L_{side}$如下图所示，这里Relaxed label为2，GT为1，non-edge为0。 α和β是用来平衡类别不平衡带来的影响

Pretrain with CEA:

CEA（coarse edge annotation）指的是在大型的例如semantic segmentation数据集PSACAL VOC上，是有一些edge的label的，但其并不精细，如下图所示：

使用这些大型的数据集来预训练RDS模型，之后在BSDS500使用RDS的方法进行fine tune，可以达到更好的效果。

整体流程如下图所示：

Multi-Scale Context Aggregation by Dilated Convolutions

Author: Fisher Yu 2016 LINK 第一篇提出空洞卷积的方法。空洞卷积是下面的那篇BDCN使用到了，这也是我第一次了解这个卷积操作。

Journal: ICLR 2016

提出Dilated Conv的paper. 一种专门design for dense prediction（例如：semantic segmentation）的方法

这种空洞卷积（Dilated Conv）support exponential expansion of the receptive field without loss of resolution or coverage!

作者还指出，将这种Dilated Conv可以作为插入模块，来提升这种dense prediction的性能。

总结：

Dilated Conv的好处：

扩大感受野：在deep net中为了增加感受野且降低计算量，总要进行降采样(pooling或s2/conv)，这样虽然可以增加感受野，但空间分辨率降低了。为了能不丢失分辨率，且仍然扩大感受野，可以使用空洞卷积。这在检测，分割任务中十分有用。一方面感受野大了可以检测分割大目标，另一方面分辨率高了可以精确定位目标。
捕获多尺度上下文信息：空洞卷积有一个参数可以设置dilation rate，当设置不同dilation rate时，感受野就会不一样，也即获取了多尺度（multi-scale）信息。

所以我们可以对于一个feature map使用几组不同dilation rate和padding的dilated Conv来让获取multi-scale信息，之后再concatenate并过一个conv聚合并学习这样的multi-scale的信息。

而语义分割（semantic segmentation）由于需要获得较大的分辨率图，因此经常在网络的最后两个stage，取消降采样操作，之后采用空洞卷积弥补丢失的感受野。

背景：

在图像分割领域，图像输入到CNN（典型的网络比如FCN）中，FCN先像传统的CNN那样对图像做卷积再pooling，降低图像（feature）尺寸的同时增大感受野，但是由于图像分割预测是pixel-wise的输出，所以要将pooling后较小的图像尺寸upsampling到原始的图像尺寸进行预测（upsampling一般采用deconv反卷积操作），之前的pooling操作使得每个pixel预测都能看到较大感受野信息。因此图像分割FCN中有两个关键，一个是pooling减小图像尺寸增大感受野，另一个是upsampling扩大图像尺寸。在先减小再增大尺寸的过程中，肯定有一些信息损失掉了，那么能不能设计一种新的操作，不通过pooling也能有较大的感受野看到更多的信息呢？答案就是dilated conv。

Detail About Dilated Conv：

对于公式化卷积的过程可以如下所示。

其使用的卷积核与普通CNN一直，只不过对于每一个Dilated Conv存在一个factor l.

下图就是Dilated Conv的一个直观实例，也显示出了其特点：The number of parameters associated with each layer is identical. The receptive field grows exponentially while the number of parameters grows linearly.

dilated的好处是不做pooling损失信息的情况下，加大了感受野，让每个卷积输出都包含较大范围的信息。在图像需要全局信息或者语音文本需要较长的sequence信息依赖的问题中，都能很好的应用dilated conv。

下图是传统的卷积：或者说是1-dilated Conv

而下图是2-dilated Conv

值得注意的是这里的这里对于filter的初始化不能是随机的。

Bi-Directional Cascade Network for Perceptual Edge Detection

Author: Jianzhong He er al LINK 一篇比较新的2019年的edge detection的方法。

Journal: CVPR 2019

此方法基本上是目前edge detection的state-of-art的方法！

作者给出的改进在于：

Bi-Directional Cascade Network：增强不同scale可以学到的不同的东西。即让模型focus on meaningful details and deep layers should depict object-level boundaries. 即CNN的每一层应该有其单独的supervision，而不是像之前主流的HED和RCF使用统一的supervision。这种每层不同的supervision的方法可以让网络的不同层关注不同的信息，例如浅层关注于一些detail，而深层关注于表述整个物体的edge. 实现这个的本质是利用相邻卷积层的预测和真实的supervision共同作用，作为每层的supervision。可以看为是一种对于loss的修改，这种修改赋予了层与层之间交流信息的能力。
加入额外的Scale Enhancement Module（SEM）：目的是增强每个scale的学习能力。SEM本质就是一组有不同dilated rate的dilated Conv.

关于空洞卷积的一些细节和思考可以见我的另一篇note.

整体来讲，空洞卷积可以effectively increases the size of receptive fields of network neurons. 对于这种pixel level的任务，很大的receptive field是非常重要的。

整体的模型大致如下图所示。

其中的ID Block是Incremental Detection Block，使用VGG网络的每个卷积部分加上SEMs组成的。This bi-directional cascade structure enforces each layer to focus on a specific scale, allowing for a more rational training procedure.

在我看来，模型相比于HED，RCF这种经典的multi-scale方法的改进之处在于：

改进了loss function，引入了不同scale layer的communicate。使用的是每层之间的supervise是相互关联的
增加每个scale的表示能力（每个scale额外使用 dilated convolution）

（i）首先介绍作者是如何增强不同scale可以学到的不同的东西而并不依赖额外的监督信息（增加额外的监督信息见：Relaxed Deep Supervision这篇paper，此blog中也有涉及），实现这个的本质是利用相邻卷积层的预测和真实的supervision共同作用，作为每层的supervision。可以看为是一种对于loss的修改，这种修改赋予了层与层之间交流信息的能力。

作者将真实的label拆解为:

其中Y是真实的label，$Y_s$指的是在scale s上的label. 目标是学到一个detector D, that capable of detecting edges at different scales，D本质上就是一个神经网络N，其有S个conv层，The pooling layers make adjacent convolutional layers depict image patterns at different scales.

优化目标可以写为：

其中$P_s=D_s(N_s(x))$为在scale s的edge prediction。最终的D可以看为$D_s$的ensemble.

显然，下面我们需要定义出每一个$Y_s$，如果每一个$Y_s$都合理，那么显然我们可以让模型的不同卷积学到不同的scale，也是这种改进的核心所在。

然而直接手工的分解$Y$显然是非常困难的。作者使用的方法就是根据真实label Y和其他层的prediction来估计$Y_s$.

但这种估计并不是一个很好地layer-specific supervision. （原因见原文，本质就是这种方法对于每个scale产生的loss的梯度是一致的，dose not necessarily differentiate the scales depicted by different layers）。

所以作者用了两个complementary supervisions来估计$Y_s$，一个是来自smaller scale，另一个是来自larger scale. 这两个不同的supervision会训练两个detector at each scale.

即：

最终就是利用这两个$P_s^{s2d}+P_s^{d2s}$ to interpolate edge detection at scale s.

(ii) 接着介绍第二个改进，即加入额外的Scale Enhancement Module（SEM）：目的是增强每个scale的学习能力。SEM本质就是一组有不同dilated rate的dilated Conv.

最终的输出与HED和RCF一致，都是使用每个中间层结果的fuse再过一个1*1的conv.

每个SEM的结构如图所示，for each SEM we apply K dilated convolutions with different dilation rates.

(iii)最终的loss:

可以看到每个ID Blog由两个layer-specific side supervision来训练，最终也用了fuse 了所有中间层产生的final predict来与真实label产生loss.

最终的loss如下图所示.

L()是一个balanced Cross entropy. Because of the inconsistency of annotations among different annotators, we also introduce a threshold γ for loss computation.即：

这是其产生的一个结果。

Reference: 本部分对应的paper在pdf中用波浪线给出

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
Learning deep structured multi-scale features using attention-gated crfs for contour prediction. In NIPS, pages 3964–3973, 2017. About Edge detection
Attention to scale: Scale-aware semantic image segmentation. In CVPR, pages 3640–3649, 2016.
dilated convolution
Network Cascade