Ivakhnenko和 Lapa [71]公布了,第一款对于深度监督前馈式多层感知器(supervised deep feedforward multilayer perceptrons)的通用而有效的学习算法。1971年的一篇文章描述了一个通过“数据成组处理法(Group Method of Data Handling)”训练的8层深度网络,仍然在新千年中很流行。考虑到一个输入向量的训练集有对应的目标输出向量,层逐渐地增多并通过回归分析训练,接着用一个分离的验证集的帮助改进,正则化被用于淘汰多余单元。每层的单元和层的总数可以在与问题相关的环境中习得。
我的第一个递归深度系统(上述提到)[1,2] 通过在非监督情况下的一个深度RNN栈预训练(a deep RNN stack pre-trained in unsupervised fashion),部分克服了根本问题[A5] ,进而加速了后来的监督学习。这是在2000年后有效的深度学习系统,并且也是第一个神经分层时间记忆模型,也是第一个“很深的学习系统”。
然而,深度学习领域研究历史悠久,1965年,Ivakhnenko和 Lapa [71]公布了第一款对于深度监督前馈式多层感知器(supervised deep feedforward multilayer perceptrons)的通用而有效的学习算法。1971年的一篇文章描述了一个通过“数据成组处理法(Group Method of Data Handling)”训练的8层深度网络,这仍然在新千年中很流行。
[1] J. Schmidhuber. Learning complex,extended sequences using the principle of history compression, NeuralComputation, 4(2):234-242, 1992 (based on TR FKI-148-91, 1991).
[2] J. Schmidhuber. Habilitation thesis,TUM, 1993. PDF. An ancient experiment with credit assignment across 1200 timesteps or virtual layers and unsupervised pre-training for a stack of recurrentNN can be found here - try Google Translate in your mother tongue.
[4] S. Hochreiter, Y. Bengio, P. Frasconi,J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learninglong-term dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide toDynamical Recurrent Neural Networks. IEEE press, 2001.
[4a] Y. Bengio, P. Simard, P. Frasconi.Learning long-term dependencies with gradient descent is difficult. IEEE TNN5(2), p 157-166, 1994
[5] S. Hochreiter, J. Schmidhuber. LongShort-Term Memory. Neural Computation, 9(8):1735-1780, 1997.
[6] F. A. Gers, J. Schmidhuber, F. Cummins.Learning to Forget: Continual Prediction with LSTM. Neural Computation,12(10):2451--2471, 2000.
[7] A. Graves, J. Schmidhuber. Framewisephoneme classification with bidirectional LSTM and other neural networkarchitectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
[8] A. Graves, S. Fernandez, F. Gomez, J.Schmidhuber. Connectionist Temporal Classification: Labelling UnsegmentedSequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
[9] A. Graves, M. Liwicki, S. Fernandez, R.Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for ImprovedUnconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 31, no. 5, 2009.
[10] A. Graves, J. Schmidhuber. OfflineHandwriting Recognition with Multidimensional Recurrent Neural Networks.NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
[11] J. Schmidhuber, D. Ciresan, U. Meier,J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. FourthConference on Artificial General Intelligence (AGI-11), Google, Mountain View,California, 2011.
[12] A. Graves, A. Mohamed, G. E. Hinton.Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver,2013.
[12a] T. Bluche, J. Louradour, M. Knibbe,B. Moysset, F. Benzeghiba, C. Kermorvant. The A2iA Arabic Handwritten TextRecognition System at the OpenHaRT2013 Evaluation. Submitted to DAS 2014.
[13] J. Hawkins, D. George. HierarchicalTemporal Memory - Concepts, Theory, and Terminology. Numenta Inc., 2006.
[14] R. Kurzweil. How to Create a Mind: TheSecret of Human Thought Revealed. ISBN 0670025291, 2012.
[15] G. E. Hinton, R. R. Salakhutdinov.Reducing the dimensionality of data with neural networks. Science, Vol. 313.no. 5786, pp. 504 - 507, 2006.
[16] Y. LeCun, B. Boser, J. S. Denker, D.Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied toHandwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
[16a] Y. LeCun, B. Boser, J. S. Denker, D.Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Handwritten digitrecognition with a back-propagation network. Proc. NIPS 1989, 2, MorganKaufman, Denver, CO, 1990.
[17] Dan Claudiu Ciresan, U. Meier, L. M.Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten DigitRecognition. Neural Computation 22(12): 3207-3220, 2010.
[18] D. H. Hubel, T. N. Wiesel. ReceptiveFields, Binocular Interaction And Functional Architecture In The Cat's VisualCortex. Journal of Physiology, 1962.
[19] K. Fukushima. Neocognitron: Aself-organizing neural network model for a mechanism of pattern recognitionunaffected by shift in position. Biological Cybernetics, 36(4): 193-202, 1980.Scholarpedia.
[19a] K. Fukushima: Neural network modelfor a mechanism of pattern recognition unaffected by shift in position -Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979.
[20] M. Riesenhuber, T. Poggio.Hierarchical models of object recognition in cortex. Nature Neuroscience 11, p1019-1025, 1999.
[20a] J. Schmidhuber. A local learningalgorithm for dynamic feedforward and recurrent networks. Connection Science,1(4):403-412, 1989. PDF. HTML. Local competition in the Neural Bucket Brigade(figures omitted).
[21] D. C. Ciresan, U. Meier, J. Masci, L.M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional NeuralNetworks for Image Classification. International Joint Conference on ArtificialIntelligence (IJCAI-2011, Barcelona), 2011.
[22] D. C. Ciresan, U. Meier, J.Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc.IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649,2012.
[23] Y. LeCun, Y. Bottou, Y. Bengio, P. Haffner.Gradient-based learning applied to document recognition. Proceedings of the IEEE,86(11):2278-2324, 1998
[24] S. Behnke. Hierarchical NeuralNetworks for Image Interpretation. Dissertation, FU Berlin, 2002. LNCS 2766, Springer2003.
[25] D. C. Ciresan, U. Meier, J. Masci, J.Schmidhuber. Multi-Column Deep Neural Network for Traffic Sign Classification.Neural Networks 32: 333-338, 2012.
[25a] D. C. Ciresan, U. Meier, J. Masci, J.Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification.International Joint Conference on Neural Networks (IJCNN-2011, San Francisco),2011.
[25b] J. Stallkamp, M. Schlipsing, J.Salmen, C. Igel. INI Benchmark Website: The German Traffic Sign RecognitionBenchmark for IJCNN 2011.
[25c] Qualifying for IJCNN 2011competition: results of 1st stage (January 2011)
[25d] Results for IJCNN 2011 competition (2August 2011)
[26] A. Krizhevsky, I. Sutskever, G. E.Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS25, MIT Press, 2012.
[26a] M. D. Zeiler, R. Fergus. Visualizingand Understanding Convolutional Networks. TR arXiv:1311.2901 [cs.CV], 2013.
[27] A. Coates, B. Huval, T. Wang, D. J.Wu, Andrew Y. Ng, B. Catanzaro. Deep Learning with COTS HPC Systems, ICML 2013.
[28] J. Masci, A. Giusti, D. Ciresan, G.Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation withMax-Pooling Convolutional Networks. ICIP 2013.
[28a] A. Giusti, D. Ciresan, J. Masci, L.M. Gambardella, J. Schmidhuber. Fast Image Scanning with Deep Max-PoolingConvolutional Neural Networks. ICIP 2013.
[29] P. J. Werbos. Beyond Regression: NewTools for Prediction and Analysis in the Behavioral Sciences. PhD thesis,Harvard University, 1974
[29a] P. J. Werbos. Applications ofadvances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds):System Modeling and Optimization: Proc. IFIP (1981), Springer, 1982.
[29b] P. J. Werbos. BackwardsDifferentiation in AD and Neural Nets: Past Links and New Opportunities. InH.M. Bücker, G. Corliss, P. Hovland, U. Naumann, B. Norris (Eds.), AutomaticDifferentiation: Applications, Theory, and Implementations, 2006.
[29c] S. E. Dreyfus. The computationalsolution of optimal control problems with time lag. IEEE Transactions onAutomatic Control, 18(4):383-385, 1973.
[30] Y. LeCun: Une procedured'apprentissage pour reseau a seuil asymetrique. Proceedings of Cognitiva 85,599-604, Paris, France, 1985.
[31] D. E. Rumelhart, G. E. Hinton, R. J.Williams. Learning internal representations by error propagation. In D. E.Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,volume 1, pages 318-362. MIT Press, 1986
[32] Ron J. Williams. Complexity of exactgradient computation algorithms for recurrent neural networks. Technical ReportTechnical Report NU-CCS-89-27, Boston: Northeastern University, College ofComputer Science, 1989
[33] A. J. Robinson and F. Fallside. Theutility driven dynamic error propagation network. TR CUED/F-INFENG/TR.1,Cambridge University Engineering Department, 1987
[34] P. J. Werbos. Generalization ofbackpropagation with application to a recurrent gas market model. Neural Networks,1, 1988
[35] D. H. Ballard. Modular learning inneural networks. Proc. AAAI-87, Seattle, WA, p 279-284, 1987
[36] G. E. Hinton. Connectionist learningprocedures. Artificial Intelligence 40, 185-234, 1989.
[37] B. A. Pearlmutter. Learning statespace trajectories in recurrent neural networks. Neural Computation,1(2):263-269, 1989
[38] J. Schmidhuber. A fixed size storageO(n^3) time complexity learning algorithm for fully recurrent continuallyrunning networks. Neural Computation, 4(2):243-248, 1992.
[39] J. Martens and I. Sutskever. TrainingRecurrent Neural Networks with Hessian-Free Optimization. In Proc. ICML 2011.
[40] K. Fukushima: Artificial vision bymulti-layered neural networks: Neocognitron and its advances, Neural Networks,vol. 37, pp. 103-119, 2013. Link.
[41a] G. B. Orr, K.R. Müller, eds., NeuralNetworks: Tricks of the Trade. LNCS 1524, Springer, 1999.
[41b] G. Montavon, G. B. Orr, K. R. Müller,eds., Neural Networks: Tricks of the Trade. LNCS 7700, Springer, 2012.
[41c] Lots of additional tricks forimproving (e.g., accelerating, robustifying, simplifying, regularising) NN canbe found in the proceedings of NIPS (since 1987), IJCNN (of IEEE & INNS,since 1989), ICANN (since 1991), and other NN conferences since the late 1980s.Given the recent attention to NN, many of the old tricks may get revived.
[42] H. Baird. Document image defectmodels. IAPR Workshop, Syntactic & Structural Pattern Recognition, p 38-46,1990
[43] P. Y. Simard, D. Steinkraus, J.C.Platt. Best Practices for Convolutional Neural Networks Applied to VisualDocument Analysis. ICDAR 2003, p 958-962, 2003.
[44] I. J. Goodfellow, A. Courville, Y.Bengio. Spike-and-Slab Sparse Coding for Unsupervised Feature Discovery. Proc.ICML, 2012.
[45] D. Ciresan, U. Meier, J. Schmidhuber.Transfer Learning for Latin and Chinese Characters with Deep Neural Networks.Proc. IJCNN 2012, p 1301-1306, 2012.
[45a] D. Ciresan, J. Schmidhuber.Multi-Column Deep Neural Networks for Offline Handwritten Chinese CharacterClassification. Preprint arXiv:1309.0261, 1 Sep 2013.
[46] D. Scherer, A. Mueller, S. Behnke.Evaluation of pooling operations in convolutional architectures for objectrecognition. In Proc. ICANN 2010.
[47] J. Schmidhuber, M. C. Mozer, and D.Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus,and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTHAachen, pages 87-95. Augustinus, 1993.
[48] R. E. Schapire. The Strength of WeakLearnability. Machine Learning 5 (2): 197-227, 1990.
[49] M. A. Ranzato, C. Poultney, S. Chopra,Y. Lecun. Efficient learning of sparse representations with an energy-basedmodel. Proc. NIPS, 2006.
[50] M. Ranzato, F. J. Huang, Y. Boureau,Y. LeCun. Unsupervised Learning of Invariant Feature Hierarchies withApplications to Object Recognition. Proc. CVPR 2007, Minneapolis, 2007.