Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.
Title: Detecting Unknown Instances Using CNNs
Abstract: Deep convolutional neural networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. However, a vast majority of DCNN-based recognition methods are designed for a closed world, where the primary assumption is that all categories are known a priori. In many real-world applications, this assumption does not necessarily hold. Generally, incomplete knowledge of the world is present at training time, and unknown classes can be submitted to an algorithm during testing. The goal of a visual recognition system is then to reject samples from unknown classes and classify samples from known classes.
In the first part of my talk, I will present new DCNNs for anomaly detection based on one-class classification. The main idea is to use a zero centered Gaussian noise in the feature space as the pseudo-negative class and train the network using the cross-entropy loss. Also, a method in which both classifier and feature representations are learned together in an end-to-end fashion will be presented. In the second part of the talk, I will present a multi-class category detection using a network which utilizes both global and local information to predict whether the test image belongs to one of the known classes or an unknown category. Specifically, the models is trained using a network to perform image-level category prediction and another network to perform patch-level category prediction. We evaluate the effectiveness all these methods on multiple publicly available datasets and show that these approaches achieve better performance compared to previous state-of-the-art methods.
Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.
Title: Student-Teacher Learning Techniques for Bilingual and Low Resource OCR
Abstract: Optical Character Recognition (OCR) is the automatic generation of a transcription given a line image of text. Current methods have been very successful on printed English text, with Character Error Rates of less than 1¥%. However, clean datasets are not commonly seen in real life applications. There is a move in OCR towards `text in the wild’, conditions where there are lower resolution images like store fronts, street sign, and billboards. Oftentimes these texts contain multiple scripts, especially in countries where multiple languges are spoken. In addition, Latin characters are wildly seen no matter what language. The presence of multilingual text poses a unique challenge.
Traditional OCR methods involve text localization, script identification, and then text recognition. A separate system is used in each task and the results from one system are passed to the next. However, the downside of this pipeline approach is that errors propagate downstream and there is no way of providing feedback upstream. These downsides can be mitigated with fully integrated approaches, where one large system does text localization, script identification, and text recognition jointly. These approaches are also sometimes known as end-to-end approaches in literature.
With larger and larger networks, there is also a need for a greater amount of training data. However, this data may be difficult to obtain if the target language is low resource. There are also problems if the data that is obtained is in a slightly different domain, for example, printed versus handwritten text. This is where synthetic data generation techniques and domain adaptation techniques can be helpful.
Given these current challenges in OCR, this thesis proposal is focused on training an integrated (ie: end-to-end) bilingual systems and domain adaptation techniques. Both these objectives can be achieved using student-teacher learning methods. The basics of this approach is to have a trained teacher model add an additional loss function while training a student model. The outputs of the teacher will be used as soft targets for the student to learn. The following experiments will be performed:
Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.
Title: Optical coherence tomography signal processing in complex domain
Abstract: Optical coherence tomography (OCT) plays an indispensable role in clinical fields such as ophthalmology and dermatology. Over the past 30 years, OCT has gone through tremendous developments, which come with both hardware improvements and novel signal processing techniques. Hardware improvements such as the use of adaptive optics (AO) and the use of vertical-cavity surface-emitting laser (VCSEL) help push the fundamental limits of OCT imaging capability. Novel signal processing techniques aim to push the imaging capability beyond current hardware architecture limitations. Often, novel signal processing techniques achieve better performances than hardware modifications while keeping the cost to the lowest. The purpose of this dissertation proposal is to develop novel OCT signal processing techniques that provide new imaging capabilities and overcome current imaging limitations.
OCT signal, as the result of the interference between the sample back-scattering light and the reference light, is complex and contains both amplitude and phase information. The amplitude information is mostly used for OCT structural imaging, while the phase information is mostly used for OCT functional imaging. Usually, the amplitude-based methods are more robust since they are less prone to noise, while the phase-based methods are better in quantifying precision measurements since they are more sensitive to micro displacements. This dissertation proposal focuses on three advanced OCT signal processing techniques in both amplitude and phase domain.
The first signal processing technique proposed is the amplitude-based BC-mode OCT image visualization for microsurgery guidance, where multiple sparsely sampled B-scans are combined to generate a single cross-section image with enhanced instrument and tissue layer visibility and reduced shadowing artifacts. The performance of the proposed method is demonstrated by guiding a 30-gauge needle into an ex-vivo human cornea.
The second signal processing technique proposed is the amplitude-based optical flow OCT (OFOCT) for determining accurate velocity fields. Modified continuity constraint is used to compensate the Fourier-domain OCT (FDOCT) sensitivity fall-off. Spatial-temporal smoothness constraints are used to make the optical flow problem well-posed and reduce noises in the velocity fields. The accuracy of the proposed method is verified through phantom flow experiments by using a diluted milk powder solution as the scattering medium, in both cases of advective flow and turbulent flow.
The third signal processing technique proposed is phase-based. A wrapped Gaussian mixture model (WGMM) is proposed to stabilize the phase of swept-source OCT (SSOCT) systems. The OCT signal phase is divided into several components and each component is fully analyzed. The WGMM is developed based on the previous analysis. A closed-form iteration solution of the WGMM is derived using the expectation-maximization (EM) algorithm. The performance of the proposed method is demonstrated through OCT imaging of ex-vivo mice cornea and anterior chamber.
For all the three proposed methods above, process has been made in theoretical modeling, numerical implementations, and experimental verifications. All the algorithms have been implemented in the graphic processing unit (GPU) in the OCT system for real-time data processing. Preliminary results demonstrate good performances of these proposed methods. The final thesis work will include optimizing the proposed methods and applying the implemented algorithms to both ex-vivo and in-vivo biomedical research for the overall system testing and analysis.
Title: Single Image Based Crowd Counting Using Deep Learning
Abstract: Estimating count and density maps from crowd images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. In addition, techniques developed for crowd counting can be applied to related tasks in other fields of study such as cell microscopy, vehicle counting and environmental survey. The task of crowd counting and density map estimation from a single image is a difficult problem since it suffers from multiple issues like occlusions, perspective changes, background clutter, non-uniform density, intra-scene and inter-scene variations in scale and perspective. These issues are further exacerbated in highly congested scenes. In order to overcome these challenges, we propose a variety of different deep learning architectures that specifically incorporate various aspects such as global/local context information, attention mechanisms, specialized iterative and multi-level multi-pathway fusion schemes for combining information from multiple layers in a deep network. Through extensive experimentations and evaluations on several crowd counting datasets, we demonstrate that the proposed networks achieve significant improvements over existing approaches.
We also recognize the need for large amounts of data for training the deep networks and their inability to generalize to new scenes and distributions. To overcome this challenge, we propose novel semi-supervised and weakly-supervised crowd counting techniques that effectively leverage large amounts of unlabeled/weakly-labeled data. In addition to developing techniques with ability to learn from limited labeled data, we also introduce a new large-scale crowd counting dataset which can be used to train considerably larger networks. The proposed data consists of 4,372 high resolution images with 1.51 million annotations. We made explicit efforts to ensure that the images are collected under a variety of diverse scenarios and environmental conditions. The dataset provides a richer set of annotations like dots, approximate bounding boxes, blur levels, etc.
Title: Towards End-to-end Non-autoregressive speech applications
Abstract: Sequence labeling is a fascinating and challenging topic in the speech research community. The Sequence-to-sequence model is proposed for various sequence labeling tasks as a particularly popular end-to-end model. Autoregressive models are the dominant approach that predicts the label one by one, conditioning on previous results. This makes the training easier and more stable. However, this simplicity also results in inefficiency for the inference, particularly with those lengthy output sequences. To speed up the inference procedure, researchers start to be interested in another type of sequence-to-sequence model, known as non-autoregressive models. In contrast to the autoregressive models, non-autoregressive models predict the whole sequence within a constant number of iterations.
In this proposal, two different types of non-autoregressive models for speech applications are proposed: mask-based approach and noise-based approach. To demonstrate the effectiveness of the two proposed methods, we explored their usage for two important topics: speech recognition and speech synthesis. Experiments reveal that the proposed methods can match the performance of state-of-the-art autoregressive models with a much shorter inference time.
Title: Deep Learning Based Methods for Ultrasound Image Segmentation and Magnetic Resonance Image Reconstruction
Abstract: In recent years, deep learning (DL) algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. It has shown promising performances in many medical image analysis (MIA) problems, including classification, segmentation and reconstruction. However, the inherent difference between natural images and medical images (Ultrasound, MRI etc.) have hinder the performance of such DL-based method that originally designed for natural images. Another obstacle for DL-based MIA comes the availability of large-scale training dataset as it have shown that large and diverse dataset can effectively improve the robustness and generalization ability of DL networks.
In this thesis, we develop various deep learning-based approaches to address two medical image analysis problems. In the first problem, we focus on computer assisted orthopedic surgery (CAOS) applications that use ultrasound as intra-operative imaging modality. This problem requires an automatic and real-time algorithm to detect and segment bone surfaces and shadows in order to provide guidance for the orthopedic surgeon to a standardized diagnostic viewing plane with minimal artifacts. Due to the limitation of relatively small datasets and image differences from multiple ultrasound machines, we develop DL-based frameworks that leverage a local phase filtering technique and integrate it into the DL framework, thus improving the robustness.
Finally, we propose a fast and accurate Magnetic Resonance Imaging (MRI) image reconstruction framework using a novel Convolutional Recurrent Neural Network (CRNN). Extensive experiments and evaluation on knee and brain datasets have shown its outstanding results compared to the traditional compressed sensing and other DL-based methods. Furthermore, we extend this method to enable multi sequence-reconstruction where T2-weighted MRI image can provide guidance and improvement to the reconstruction of amid proton transfer-weighted
Carlos Castillo, Department of Electrical and Computer Engineering
Shanshan Jiang, Department of Radiology and Radiological Science
Ilker Hacihaliloglu, Department of Biomedical Engineering (Rutgers University)
Title: Harmonization of Structural MRI for Consistent Image Analysis
Abstract: Magnetic resonance imaging (MRI) is a flexible, non-invasive medical imaging modality that uses strong magnetic fields and radio-frequency pulses to produce images with excellent contrast in the soft tissues of the body. MRI is commonly used in diagnosis and monitoring of many conditions, but is especially useful in disorders of the central nervous system, such as multiple sclerosis (MS), where the brain and spinal cord are heavily involved. An MRI scan normally contains a number of imaging volumes, where different pulse sequence parameters are selected to highlight different tissue properties. These volumes can then be used together to provide complimentary information about the imaged area. Flexible design of the imaging system allows for a variety of questions to be answered during a single scanning session, but also comes with a cost. As there are many parameters to define when designing an imaging sequence, there is no common standard that is widely used. These differences lead to variability in image appearance between manufacturers, imaging centers, and even individual scanners. As an example, a commonly acquired MR volume is a T1-weighted image, where differences in a specific magnetic property (longitudinal relaxation time or T1) is highlighted. However, this general effect can be achieved with a myriad of different pulse sequences even before the individual parameters are considered. This is perhaps most apparent in the difference between T1-weighted images with and without a preparatory inversion pulse, where images with an inversion pulse tend to have a much clearer contrast between grey and white matter in the brain. With the advent of advanced machine learning methods, variations such as the example above create a large problem, as accurate methods become closely tied to the data used to train them and any variation in inputs can have unknown effects on output quality. This problem sets the stage for image harmonization, where synthetic “harmonized” images are produced after acquisition to provide consistent inputs to image analysis routines.
This thesis aims to develop harmonization strategies for structural brain MR images that will allow for the synthesis of harmonized images from differing inputs. These images can then be used downstream in automated analysis pipelines, most commonly whole-brain segmentation for volumetric analysis. Recently, deep learning-based techniques have been shown to be excellent candidates in the realm of image synthesis and can be readily incorporated in harmonization tasks. However, this is complicated, as training data (especially in multi-site settings) is rarely available. This work will approach these problems by covering three main topics:
Title: Intraoperative Optical Coherence Tomography Guided Deep Anterior Lamellar Keratoplasty
Abstract: Deep anterior lamellar keratoplasty (DALK) is a highly challenging procedure requiring micron accuracy to guide a “big bubble” needle into the stroma of the cornea down to Descemet’s Membrane (DM). It has important advantages over Penetrating keratoplasty (PK) including lower rejection rate, less endothelial cell loss, and increased graft survival. Currently, this procedure relies heavily on the visualization through a surgical microscope, the surgeon’s own surgical experience, and tactile feel to determine the relative position of the needle and DM. Optical coherence tomography (OCT) is a well-established, non-invasive optical imaging technology that can provide high-speed, high-resolution, three-dimension images of biological samples. Since it was first demonstrated in 1991, OCT has emerged as a leading technology for ophthalmic visualization, especially for retinal structures, and has been widely applied in ophthalmic surgery and research. Common-path (CP) OCT systems use single A-scan image to deduce the tissue layer information and can be operated at a much higher speed. This synergizes well with handheld tools and automated surgical systems which require fast response time. CP-OCT has been integrated into a wide range of microsurgical tools for procedures such as epiretinal membrane peeling and subretinal injection.
In this proposal, the common-path swept-source OCT system (CP-SSOCT) is proposed to guide DALK procedures. The OCT distal sensor integrated needle and OCT guided micro-control ocular surgical system (AUTO-DALK) will be designed and evaluated. This device will allow for the autonomous insertion of a needle for pneumo-dissection based on the depth-sensing results from the OCT system. An earlier prototype of AUTO-DALK was tested on the ex-vivo porcine cornea including the comparison of expert manual needle insertion. The result showed the precision and consistency of the needle placement were increased, which could lead to better visual outcomes and fewer complications. Future work will include improving the overall design for in-vivo testing and clinical use, advanced convolutional neural network based tracking, and system validation on larger sample size.
Jin U. Kang (adviser), Department of Electrical and Computer Engineering
Israel Gannot, Department of Electrical and Computer Engineering
Xingde Li, Department of Biomedical Engineering
Title: Unsupervised Domain Adaptation for Speaker Verification in the Wild
Abstract: Performance of automatic speaker verification (ASV) systems is very sensitive to mismatch between training (source) and testing (target) domains. The best way to address domain mismatch is to perform matched condition training – gather sufficient labeled samples from the target domain and use them in training. However, in many cases this is too expensive or impractical. Usually, gaining access to unlabeled target domain data, e.g., from open source online media, and labeled data from other domains is more feasible. This work focuses on making ASV systems robust to uncontrolled (‘wild’) conditions, with the help of some unlabeled data acquired from such conditions.
Given acoustic features from both domains, we propose learning a mapping function – a deep convolutional neural network (CNN) with an encoder-decoder architecture – between features of both the domains. We explore training the network in two different scenarios: training on paired speech samples from both domains and training on unpaired data. In the former case, where the paired data is usually obtained via simulation, the CNN is treated as a non-linear regression function and is trained to minimize L2 loss between original and predicted features from target domain. Though effective, we provide empirical evidence that this approach introduces distortions that affect verification performance. To address this, we explore training the CNN using adversarial loss (along with L2), which makes the predicted features indistinguishable from the original ones, and thus, improve verification performance.
The above framework, though effective, cannot be used to train the network on unpaired data obtained by independently sampling speech from both domains. In this case, we first train a CNN using adversarial loss to map features from source to target. We, then, map the predicted features back to the source domain using an auxiliary network, and minimize a cycle-consistency loss between the original and reconstructed source features.
To prevent the CNN from over-fitting when trained on limited amounts of data, we present a simple regularizing technique. Our unsupervised adaptation approach using feature mapping, also complements its supervised counterpart, where adaptation is done using labeled data from both domains. We focus on three domain mismatch scenarios: (1) sampling frequency mismatch between the domains, (2) channel mismatch, and (3) robustness to far-field and noisy speech acquired from wild conditions.
Title: Coherence-based learning from raw ultrasound data for breast mass diagnosis
Abstract: Breast cancer is the most prevalent cancer among women in the United States, with approximately one in eight women being diagnosed in their lifetimes. Imaging modalities such as mammography, MRI, and ultrasound are employed to non-invasively visualize breast masses in order to determine the need for a biopsy. However, each of these methods results in a significant number of patients requiring biopsies of benign masses. Ultrasound in particular is praised for its low cost, painlessness, and portability, yet the false positive rate of breast ultrasound can be as high as 93% depending on the type of mass in question. Most commonly, diagnosis is performed using the brightness-mode (B-mode) image present on most clinical ultrasound scanners, which transitions naturally to the use of B-mode images for segmentation and classification of breast masses. Ultimately, segmentation and classification of breast masses can be summarized as analysis of a grayscale image. While this approach has been successful, information is lost during the B-mode image formation process.
An alternative approach to the lossy process of information extraction from B-mode images is to leverage features (e.g., spatial coherence) of backscattered ultrasound waves to determine the content of a breast mass. I will first describe my contributions to improve the diagnostic quality of breast ultrasound images by leveraging spatial coherence information. Next, I will present my deep learning approach to overcome limitations with real-time implementation of coherence-based imaging techniques. Finally, I will present a new method to learn the high-dimensional features encoded within backscattered ultrasound waves in order to differentiate benign from malignant breast masses.