Emerging non-volatile memories (NVMs) have exhibited great potential for accelerating machine learning (ML) computation. A plethora of NVM technologies are being explored for various types of computing architecture such as systollic arrays and computing-in-memory crossbars. We are particularly interested in training acceleration, aiming to develop efficient ML hardware with learning capability. Such efficient hardware will open up a lot of exciting opportunities in resource-constrained scenarios such as edge devices. Training workloads typically require accurate and frequent updates of the neural network models. Therefore the processing of ML training will impose more stringent requirement on the write efficiency (energy and latency) and endurance in the underlying hardware fabrics. Moreover, the performance of training algorithms may be susceptible circuit/device non-ideality, limiting the model accuracy. To that effect, thoughtful cross-layer design of hardware device/circuit technology, mapping of neural network models, and the algorithm architecture will be crucial. We will incorporate such non-ideal behaviors in the technology-specific device/circuit level and implement co-exploration of ML models and dedicated hardware. The goal is to search for viable paths towards robust and efficient machine intelligence with localized and continual learning capability.