Text Data Management and Analysis – Zhai and Massung

Search is not applied magic, though it is certainly applied mathematics. The standard textbooks on the science and application of information retrieval date back to the period from 2008 to 2012. Over the last decade the extent of the research into information retrieval optimisation has been very significant, even if this is not obvious to users of search applications. In addition the boundary between ‘enterprise search’ and ‘text analytics’ has become increasingly blurred, to the benefit of all concerned. The problem with the standard textbooks is that the extent to which they bridge the chasm between information retrieval and search is limited, with very few examples of how the underlying mathematics translates to real life.

The chasm has now been bridged very successfully by Professor ChengXiang Zhai and his student Sean Massung at the University of Illinois at Urbana-Champaign in Text Data Management and Analysis, which is published by Morgan & Claypool for the Association for Computing Machinery (ACM). In effect this 500 page book is the printed version of MOOC courses in text retrieval and text mining that were first offered in 2015. The benefit of these antecedents is in the clarity of the text in both the writing and the layout. The tagline text is ‘A practical introduction to information retrieval and text mining’ and  the content certainly matches the marketing. The book is divided into four parts.

  • Overview, with some of the core principles needed to understand subsequent chapters
  • Seven chapters on text data access
  • Eight chapters on text mining
  • A short section on unified text data management and analysis

It is not possible to get away without some applied mathematics but where this is required the presentation is clear enough for readers without a grounding in the mathematics of probability and computational linguistics to follow the issues being presented. As the authors note this book is much wider in scope than earlier books, covering topics such as probabilistic topic modelling and also showing clearly the intersection between not only search and text mining but also the integrated analysis of textual and non-textual data. In addition there is a companion toolkit, MeTA, which implements many of the techniques presented in the book and is also integrated into the exercises at the end of each chapter. The toolkit has been widely used by students on the MOOC course so clearly is a robust application. The book is available in both print and e-book formats. The benefit of the e-book version is the internal linking to references and to diagrams but you will probably find the printed version easier to browse through. The book has an excellent index.

This book has been published at a time when the speed of convergence between search and text analytics is increasing very rapidly. Don’t be put off by the exercises – the book will be certainly be of value to students on computer science courses and on more advanced degrees in information retrieval. My experience suggests that many IT managers with responsibility for enterprise search certainly have a background in computer science but never had the opportunity to get into the level of detail needed to fully understand how search and text mining applications achieve apparent magic. This book will be of considerable benefit to them. It will also provide support to open source search developers who have the coding skills to work with Lucene, Solr and Elastic but may not have a full grasp of the underlying science of text analysis. It is certainly not the case that all search and text mining applications work the same way! Readers of this book will begin to understand that ‘search’ is actually a set of components, that each of the approaches selected by vendors (and open source developers) has benefits and challenges and that getting the best out of any search application takes more than just playing design games with the user interface.

Martin White