Data science in my book Dancing with Python

In my blog entry Quantum computing in my book Dancing with Python,” I covered what my book covers related to quantum computing. I also published the entry “Availability of my book Dancing with Python and its table of contents.”

Today, I want to specifically list what I discuss in the book in what I term “an extended definition of data science.” The core chapters are in Part III. Here are their titles, introductions, and chapter tables of contents:

III Advanced Features and Libraries

12 Searching and Changing Text

We represent much of the world’s information as text. Think of all the words in all the digital newspapers, e-books, PDF files, blogs, emails, texts, and social media services such as Twitter and Facebook. Given a block of text, how do we search it to see if some desired information is present? How can we change the text to add formatting or corrections or extract information?

Chapter 4, Stringing You Along, covered Python’s functions and methods. This chapter begins with regular expressions and then proceeds to natural language processing (NLP) basics: how to go from a string of text to some of the meaning contained therein.

12.1 Core string search and replace methods
12.2 Regular expressions
12.3 Introduction to Natural Language Processing
12.4 Summary

13 Creating Plots and Charts

Among mathematicians and computer scientists, it’s said that a picture is worth 210 words. Okay, that’s a bad joke, but it’s one thing to manipulate and compute with data, but quite another to create stunning visualizations that convey useful information.

While there are many ways of building images and charts, Matplotlib is the most widely used Python library for doing so. [MAT] Matplotlib is very flexible and can produce high-quality output for print or digital media. It also has great support for a wide variety of backends
that give you powerful mouse-driven interactivity. Generally speaking, if you have a coding project and you need to visualize numeric information, see if Matplotlib already does what you want. This chapter covers the core functionality of this essential library.

13.1 Function plots
13.2 Bar charts
13.3 Histograms
13.4 Pie charts
13.5 Scatter plots
13.6 Moving to three dimensions
13.7 Summary

14 Analyzing Data

While we can use fancy names like “data science,” “analytics,” and “artificial intelligence” to talk about working with data, sometimes you just want to read, write, and process files containing many rows and columns of information. People have been doing this interactively for years, typically using applications like Microsoft Excel® and online apps like Google Sheets™.

To “programmatically” manipulate data, I mean that we use Python functions and methods. This chapter uses the popular pandas library to create and manipulate these collections of rows and columns, called DataFrames. [PAN] [PCB] We will later introduce other methods in Chapter 15, Learning, Briefly. Before we discuss DataFrames, let’s review some core ideas from statistics.

14.1 Statistics
14.2 Cats and commas
14.3 pandas DataFrames
14.4 Data cleaning
14.5 Statistics with pandas
14.6 Converting categorical data
14.7 Cats by gender in each locality
14.8 Are all tortoiseshell cats female?
14.9 Cats in trees and circles
14.10 Summary

15 Learning, Briefly

Machine learning is not new, but it and its sub-discipline, deep learning, are now being used extensively for many applications in artificial intelligence (AI). There are hundreds of academic and practical coding books about machine learning.

This final chapter introduces machine learning and neural networks primarily through the scikit-learn sklearn module. Consider this a jumping-off point where you can use the Python features you’ve learned in this book to go more deeply into these essential AI areas if they interest you.

15.1 What is machine learning?
15.2 Cats again
15.3 Feature scaling
15.4 Feature selection and reduction
15.5 Clustering
15.6 Classification
15.7 Linear regression
15.8 Concepts of neural networks
15.9 Quantum machine learning
15.10 Summary

This book is an introduction, so my goal is to get you started on a broad range of topics. For example, here are the Python modules and packages discussed or used in each of the four chapters in Part III:

12 Searching and Changing Text: re, flashtext, spacy
13 Creating Plots and Charts: matplotlib, numpy, mpl_toolkits.mplot3d
14 Analyzing Data: pandas, numpy, matplotlib, squarify, matplotlib-venn
15 Learning, Briefly: sklearn, pandas, numpy

I mention in passing in the book several other packages, such as pytorch, as pointers for further exploration. I did not include in the list above standard modules such as math, random, and sys.

Call for papers: Education, Research, and Application of Quantum Computing – HICSS 2022

Education, Research, and Application of Quantum Computing

My IBM Quantum colleague Dr. Andrew Wack and I are hosting a minitrack at the Hawaii International Conference on System Sciences (HICSS) 2022.

The description of the minitrack is:

There is no question that quantum computing will be a technology that will spur breakthroughs in natural science, AI, and computational algorithms such as those used in finance. IBM, Google, Honeywell, and several startups are working hard to create the next generation of “supercomputers” based on universal quantum technology.

What exactly is quantum computing, how does it work, how do we teach it, how do we leverage it in education and research, and what will it take to achieve these quantum breakthroughs?

The purpose of this minitrack is to bring together educators and researchers who are working to bring quantum computing into the mainstream.

We are looking for reports that

  • improve our understanding of how to integrate quantum computing into business, machine learning, computer science, and applied mathematics university curriculums,
  • describe hands-on student experiences with the open-source Qiskit quantum software development kit, and
  • extend computational techniques for business, finance, and economics from classical to quantum systems.

It is part of the Decision Analytics and Service Science track at HICSS.

Please consider submitting a report and sharing this Call for Papers with your colleagues.

Math and Analytics at IBM Research: 50+ Years

Soon after I arrived back in IBM Research last July after 13 years away in the Software Group and Corporate, I was shown a 2003 edition of the IBM Journal of Research and Development that was dedicated to the Mathematical Sciences group at 40. From that, I and others assumed that this year, 2013, was the 50th anniversary of the department.

Herman Goldstine at IBM Research

I set about lining up volunteers to organize the anniversary events for the year and sent an email to our 300 worldwide members of what is now called the Business Analytics and Mathematical Sciences strategy area. Not long afterwards, I received a note from Alan Hoffman, a former director of the department, saying that he was pretty sure that the department had been around since 1958 or 59. So our 50th Anniversary became the 50+ Anniversary. Evidently mathematicians know the theory of arithmetic but don’t always practice it correctly

The first director of the department was Herman Goldstine who joined after working on the ENIAC computer and a stint at the Institute for Advanced Study in Princeton. Goldstine is pictured in the first photo on the right at a reception at the T.J. Watson Research Center in the early 1960s. Goldstine died in 2004, but all other directors of the department are still alive.

Directors of the Mathematical Sciences Department at IBM Research

We decided that the first event of the year celebrating the (more than) half century of the department would be a reunion of the directors for a morning of panel discussions. This took place this last Wednesday, May 1, 2013.

Reunion of the directors of the Math Sciences Department at IBM Research
Photo credit: Mary Beth Miller

I started the day by giving a glimpse of what the department looks like today: the above-mentioned 300 Ph.D.s, software engineers, postdocs, and other staff distributed over the areas of optimization, analytics, visual analytics, and social business in 10 of IBM’s 12 global labs.

I then introduced our panel pictured in the photo above. From left to right we have me, Brenda Dietrich, Bill Pulleyblank, Shmuel Winograd, Roy Adler (a mathematician who was in the department during the tenures of all the other directors except me), Alan Hoffman, Dick Toupin, Hirsh Cohen, and Ralph Gomory.

Ralph Gomory, Benoit Mandelbrot, and other IBM researchers pondering a math problem

My goal for the discussion was to go back and look at some of the history and culture of the math department over the last five decades. I was hoping we would hear anecdotes and stories of what life was like, the challenges they faced, and the major successes and disappointments.

Other than a few questions I had prepared, I wasn’t sure where our conversation would go. The many researchers who joined us in the auditorium at the T. J. Watson Research Center in Yorktown Heights, NY, or via the video feed going out to the other worldwide labs would have a chance to ask questions near the end of the morning.

I’m not going to go over every question and answer but rather give you the gist of what we spoke about.

  • Ralph Gomory reminded us that the department was started in a much different time, during the Cold War. The problems they were trying to solve using the hardware and the software of the day were often related highly confidential. However, every era of the department has had its own focus, burning problems to be solved, and operational environment.
  • Hirsh Cohen got his inspiration for the mathematics he did by solving practical problems such as those related to the large mainframe-connected printers. Many people feel that mathematics shouldn’t stray too far from the concrete, but it is not that simple. This isn’t just applied mathematics, it is a way of looking for inspiration that may express itself in more theoretical ways. The panelists mentioned more than once that the original posers of business or engineering problems might not recognize the mathematics that was developed in response. (I think there is nothing wrong with theoretical mathematics with no direct connection to the physical world, but there are some areas of mathematical pursuit that I think are just silly and of marginal pure or applied interest.)
  • In response to my question about balancing business needs with the desire to advance basic science, Shmuel Winograd told me I had asked the wrong question: it was about the integration of business with basic science, not a partitioning of time or resources between them. This very much sets the tone of how you manage such a science organization in a commercial company. The successful integration of these concerns may also be why IBM Research is pretty much the sole survivor of the industrial research labs from the 1950s and 1960s.
  • There was general consensus that it is difficult to get a researcher to do science in an area that he or she fundamentally does not want to work. This was redirected to the audience members who were reminded to understand what they loved to do and then find a way to do it. (This sounded like a bit of a management challenge to me, and I suspect I’ll hear about it again.)
  • Time gives a great perspective on the quality and significance of scientific work that is just not obvious while you are the middle of it. This is one of the reasons why retrospectives such as this can be so satisfying.
Discussing the future of BAMS
Photo credit: Mary Beth Miller

After the first panel and coffee break, we came back and I started the session looking at the future of the department instead of the history. We have an internal department social network community in IBM Connections and I started by summarizing some of the suggestions people came up with about what we’ll be doing in the department in five, ten, and twenty years.

Sustainability, robotic applications of cognitive computing, and mathematical algorithms for quantum computing were all suggested. Note that his was all fun speculation, not strategy development!

Eleni Pratsini, Director of Optimization Research, and Chid Apte, Director of Analytics Research, then each discussed technical topics that could be future areas for scientific research as well as having significant business use.

After the final Q&A session, we got everyone on stage for a group photo.

Photo credit: Steve Hamm

One thing that struck me when we were doing the research through the archives was how much more of a record we have of the first decade of the department than we do of the 40+ years afterwards. In those early days, each department did a typed report of its activities which was then sent to management and archived.

With the increasing use of email and, much later, digital photos, we just don’t have easy if any access to what happened month by month. As part of this 50+ Anniversary, I’m going to organize an effort to do a better job of finding and cataloging the documents, photos, and video of the department.

This should make it easier for future celebrations of the department’s history. I suspect I’m not going to make it to the 100th anniversary, but I just might get to the 75th. For the record for those who come after me, that will be in 2034.

Verified by MonsterInsights