Detecting Tax Fraud

Along with my colleagues Nicole White and Ying Du, I created a predictive model to solve a real-world tax auditing problem.

Tax audit departments have limited resources and can only audit a very small percentage of returns. To increase the productivity of audits, we developed two predictive models. The first predictive model highlights the returns that are most likely to be fraudulent. The second predictive model improves upon the first, predicting the amount of money the government will collect from underpaid taxes on a fraudulent return.

While we were thankful to have access to real government data, the use of real data means that many of our findings are not publicly sharable. What you see on this website is edited for confidentiality.

For our first predictive model, we used Weka to create a classification tree. The tree estimates the probability of fraud for a given return, depending on the (secret) characteristics of that return:

For our second predictive model, we used Excel to find a regression equation predicting the amount of money collected from an audit for a given return. An evaluation of this model vs. income from auditing returns at random gives us this chart:

More complete descriptions of the models are available in the full report, available here.