‘AI aided the investigation by applying a journalist’s human judgment identifying a particular kind of document—like a tax return or a business plan—across the entire document trove.’
A cache of documents came from a Mauritius law firm about tax avoidance schemes in Africa, the Middle East, and Asia, via a whistle-blower. Quartz explains how they used machine learning to identify which 200,000 documents were worthy of a closer look.
A machine learning model was trained to find similarities in the documents, involving about 13 hours of training on a MacBook Pro.
Hurdle one was training computers on secret documents, overcoming the challenge of labelling when secrecy meant they couldn’t use third-party help. The second hurdle was finding documents within documents and standardizing analysis when the source material was in different formats.
They measured success by finding relevant documents that otherwise wouldn’t have been identified (they couldn’t determine if they found all of them)