Data analytics: A high-level introduction for accounting practitioners

By Andrew M. Bauer, Ph.D.

Editor: Annette Nellen, Esq., CPA, CGMA

Data analytics can be defined as "the process of gathering and analyzing data and then using the results to make better decisions" (Stippich and Preber, Data Analytics: Elevating Internal Audit's Value (Institute of Internal Auditors Research Foundation 2016)). Under this definition, data analytics is clearly a process that organizations have always attempted to optimize. Yet today, with so much electronic data available from various sources, the art of data analytics is more sophisticated than ever, leaving organizations at the edge of a new frontier of analysis.

So, how well-positioned are accountants to face this new frontier? To get a preliminary idea, consider the discussions and related exhibits on data analytics produced by the Institute of Internal Auditors Research Foundation (IIARF) and Grant Thornton in the book Data Analytics, cited above; while written from the perspective of an organization's internal audit function, this book applies equally well to public accounting firms and related institutions. From its analyses, two takeaways emerge: (1) At a high level, accountants understand how planning a strong data analytics function can provide value to an organization; but, (2) at a detailed level, accountants lack a practical understanding of what tasks data analytics involves and how to implement and carry out a sophisticated data analytics function.

This column aims to shed some light on remedying the latter problem by introducing general knowledge bases that accounting-specific education in data analytics could be rooted in, which has particular relevance for universities and colleges that seek to integrate data analytics into their programs in the next few years. In addition, the column links some specific areas of data analytics-related computer science to a business-oriented data analytics process. In providing these links, examples are shared to explain how various potential accounting-related tasks, including those of tax practitioners, could be served by data analytics tools.

A base of computer science and programming

Organization and structure are two attributes of accounting that draw many individuals to the profession. Data analytics involves adding structure to data to enable effective and efficient decision-making. Thus, in theory, accountants should make excellent data analysts. However, it is estimated that nearly 80% of enterprise data is unstructured currently (Stippich and Preber, Data Analytics, p. 7), which means that the large majority of firm-level data is not in a readily available database format.

Thus the data analytics process of today's businesses involves at least two primary challenges: (1) collecting and categorizing voluminous data and (2) analyzing and prioritizing relevant data. Traditional accounting education models are not well-designed at the moment to prepare future practitioners for these two challenges, and, thus, in practice, accountants need additional knowledge and tools to continue to be excellent data analysts. While at the moment organizations seem to be pairing nonaccountant data scientists with accountants, and universities are hiring computer scientists to teach students and conduct research, the long-range goal is to train accountants who are also data scientists.

As a first step, then, practitioners and educators need to continue a recent emphasis on developing a common set of tools for future accountants to acquire at a university or college level. Conceptually, this change does not require any unfamiliar topics; accounting students have traditionally been required to take courses in mathematics, computer science, and statistics. Practically, however, this change requires more robust and directed courses in these areas, where the objective should be to think like a computer scientist. As discussed throughout this column, the various computer science views of data analytics involve probabilities, correlations, and matrix and vector algebra. These views are also concerned with acquiring data, whether structured or unstructured, and conducting meaningful analysis to discover patterns—and thus knowledge—in the data.

At a practical level, a computer science view will also require knowledge of computer programming or coding. This requirement presents an immediate challenge for accounting practitioners, as 77% of organizations surveyed in 2015 reported that they use Microsoft (MS) Excel as a data analytics tool for internal audit (Stippich and Preber, Data Analytics, p. 56); by contrast, various computer-assisted audit techniques are the second most used tool at 53%, and MS Access is the third most used at 37%. However, a comforting note about programming is that despite its different languages, much overlap exists in their general syntax and structure. Even Visual Basic, which is used to create macros in MS Word and Excel, relies on structure similar to common programming languages and tools such as C++, Perl, Python, R, and SQL.

Ultimately, accountants should be encouraged to move beyond simply clicking a button in existing data analytics software—including advanced software packages such as those from SAS and ACL—to understanding how to program their own tests. Given the required structure of programming languages, again, one would expect prospective accountants to be well-suited to embrace these changes in education.

To summarize, post-secondary institutions can amend their programs to provide a menu of data analytics learning options for accounting students. All students should receive a base set of courses, including computer science (e.g., programming), algebra, and probabilities. More ambitious students should have the opportunity to obtain a minor in computer science or a double major in accounting and computer science. This education could be centered on a five-step data analytics framework (Stippich and Preber, Data Analytics, p. 9): (1) Define the question; (2) obtain the data; (3) clean and normalize the data; (4) analyze the data and understand the results; and (5) communicate the results. As "defining the question" is always important, the remaining sections of this column link the last four steps with specific computer science areas and examples for accounting practice.

Text retrieval (obtain the data)
Conceptual discussion

Text retrieval can be thought of as a search-and-find exercise. Regardless of how many documents an organization has, an individual wants to be able to find the ones most relevant to a topic or query. The text-retrieval process within an organization would be similar to the underlying processes used by Google's search engine, the search bar of a personal computer, or the search options of tax content providers such as CCH and Bloomberg BNA.

These search processes use various computer science techniques, generally centered on ranking algorithms. Usually, the idea is to have the algorithm (i.e., the system) decide which documents (or websites, etc.) are most relevant to a user-initiated search (i.e., a query) of some overall collection, or population, of documents. These algorithms can rely on vector space or probability models, which make use of algebraic and probabilistic concepts available in undergraduate courses. The objective of the algorithm is to transform each document, which likely features significant qualitative data in the form of words and sentences, into quantitative vectors reflecting the frequency of all of the words in the document.

By adjusting for words that are very rare or too frequent across documents in the collection, as well as for frequency of words within the document and the relative length of documents, the algorithm can generate a list of the most relevant documents based on how closely each aligns to the words in the query. Google's algorithms also take into account external website links to and from the particular website. Any algorithm should also be evaluated for effectiveness, which often requires initial human input to assess relevance of the "hits" on a test set of the collection. However, with relevance assessed, the evaluation metrics can be fairly straightforward, including a ratio of total relevant documents retrieved divided by the total relevant documents.

Practical examples

Text retrieval is a useful process that can allow auditors—external, internal, or other—to move beyond random sampling. Assuming that the full collection of documents is available, a well-specified query (or queries) can identify the documents that are most likely to reflect what the auditor is searching for. For example, an auditor may want to identify sales invoices with specific contractual terms and then sample only from the invoices with such terms.

Cluster analysis (clean and normalize the data)
Conceptual discussion

Cluster analysis is an excellent method to clean and normalize data. At its basic level, cluster analysis is simply a form of evaluating data for outliers, using common measures of central tendency: mean (average), median (middle), and mode (most common). Using these measures in conjunction with visual plots of data, one can determine whether values are missing (and extrapolate, or perhaps interpolate, new values) and whether outlying or "noisy" data should be adjusted or smoothed (Han, Kamber, and Pei, 2012, Data Mining: Concepts and Techniques (Elsevier 3d ed. 2012)).

At a more advanced level, one can also use cluster analysis to determine how similar or dissimilar to one another are sets of values, objects, documents, etc. Determining similarity or dissimilarity can use simple correlations, distance measures and angles between vectors of quantitative factors, and probabilistic methods. A useful property of advanced cluster analysis is that it does not rely on humans to classify data points. As a discovery method, it can strategically partition data much more efficiently, particularly when a grouping attribute (e.g., customer location, customer industry) is not available.

Practical examples

Cluster analysis could allow tax practitioners to analyze their or their client organizations' potential state or other jurisdiction tax nexus. Clustering data using information on volume of sales to customers, purchases from suppliers, and employee compensation for various jurisdictions could provide insight into identifying jurisdictions where nexus could unexpectedly arise, or has arisen. This analysis could also identify jurisdictions for which changes in transfer-pricing policies could provide opportunities to optimize the organization's average global tax rate.

Pattern discovery and text mining (analyze the data and understand the results)
Conceptual discussion

Pattern discovery attempts to uncover a set of items or sequences that occur, or could occur, frequently together in a data set. Text mining takes unstructured data (that have been retrieved) and turns them into quality information that can be analyzed. Although a collection of documents, tweets, emails, or invoices can be grouped into sets, it is the data within each object (e.g., document) that are unstructured and information-rich.

Like text-retrieval methods, pattern discovery and text mining rely on vector space or probabilistic models to represent text data as quantitative vectors or features. However, pattern discovery typically occurs after data have already been input into a data set, while text mining more often reflects both the collection and analysis of data. In addition, while both methods can transform unstructured data (e.g., text) into structured data (e.g., word counts), pattern discovery typically reflects supervised learning where the user or developer provides "knowledge" used to classify information into groups.

Therefore, a particularly intriguing aspect of text mining is that many of its applications are conducted as unsupervised learning. For example, rather than having a user direct the system by inputting a query, the system can survey all of the included documents and, after turning the text into frequency counts, determine an automated set of categories into which to classify the documents, based on words with the greatest overlap within the categorized groups. This process is known as topic modeling.

Practical examples

Pattern discovery methods could be used by accountants to predict credit risk of current or new customers and account balances based on historical patterns and trends of previous credit information. This approach could supplement both accounting and tax determination of bad debts to be written off, as well as inventory obsolescence. Tax practitioners may also be able to use historical information to predict the use of net operating losses against future potential income streams. Alternatively, pattern discovery could be used to discover significant deviations from prior-year balances, which would aid in a variety of analytical procedures and reasonability checks for financial statement and tax-compliance purposes. For example, detection algorithms could be used to identify data-entry errors when manually inputting tax return information.

Among its potential uses, topic modeling provides an alternative to text-retrieval methods for an auditor to identify a set of documents to examine. The auditor could automate the grouping of sales documents by topic and then sample from the groups with topics that have greater associations with certain key contract terminology. Such a procedure could also be used in conjunction with an audit of manual entries to the accounting systems, as text descriptions (or lack of descriptions) within the entries may suggest either dubious or justifiable transactions. From a tax practitioner's viewpoint, a topic model algorithm could be used to group a collection of court cases. These cases, or groups of cases, could then be more effectively evaluated for the degree to which the judicial decision advocates for or discourages a client's intended position.

Data visualization (communicate the results)
Conceptual discussion

While data visualization in theory relies on how human senses perceive and store information, it also tends to have the most practical and intuitive appeal of the data analytics areas. Software products such as Excel and Tableau are popular because they allow a user to interface with and plot data without requiring a sophisticated computer science background. Thus, while advanced data analytics techniques will be paramount for accountants in the future, the ability to communicate data easily to various audiences continues to be an important skill.

An interesting consideration of data visualization is the interface: interactive, presentation, or storytelling. Interactive visualization relies on one user, such as a data scientist accountant, working with and plotting the data (e.g., using cluster analysis) to uncover patterns. Presentation visualization makes data available to a large audience, such as in a boardroom or other meeting. In this setting, the audience does not provide input but merely observes what the presenter displays. Storytelling, or interactive story-telling, involves users who acquire the data from the data scientist; individual users cannot change the underlying data but can look for patterns themselves.

Practical examples

Linking back through the previous sections, accountants in charge of data analytics functions can take data that have been cleaned and structured and make them available to others within the organization. Information stored in database format can be accessed by numerous software programs, and thus individuals without significant data analytics training can evaluate trends using familiar software. Such visualization tools would allow supervisors to sort the data to create and review bad debt reports, maps of nexus, and so on.

A new frontier of data analytics

While the accounting profession values the importance of data analytics, the world is at a new frontier in terms of data availability and sophisticated tools to conduct such analyses. To help practitioners better understand what this new frontier entails, this column highlights some of the general and specific knowledge that goes into a computer science view of data analytics and provides examples of how the related methods can be used for tasks in accounting and tax practice. Ultimately, practitioners should be comfortable as consumers of the results of data analytics exercises and should be advocates for changes in accounting education that focus on computer science, which will allow future accounting practitioners to be both producers and consumers of data analytics information.

This column is based, in part, on a presentation the author made at the February 2017 midyear meeting of the American Taxation Association in Phoenix.



Andrew M. Bauer is an assistant professor of accountancy at the University of Illinois at Urbana-Champaign in Champaign-Urbana, Ill. He will become an assistant professor at the University of Waterloo in Waterloo, Ontario, Canada, in June. Annette Nellen is a professor in the Department of Accounting and Finance at San José State University in San José, Calif. For more information about this column, please contact


Tax Insider Articles


Business meal deductions after the TCJA

This article discusses the history of the deduction of business meal expenses and the new rules under the TCJA and the regulations and provides a framework for documenting and substantiating the deduction.


Quirks spurred by COVID-19 tax relief

This article discusses some procedural and administrative quirks that have emerged with the new tax legislative, regulatory, and procedural guidance related to COVID-19.