‘AI aided the investigation by applying a journalist’s human judgment identifying a particular kind of document—like a tax return or a business plan—across the entire document trove.’
A cache of documents came from a Mauritius law firm about tax avoidance schemes in Africa, the Middle East, and Asia, via a whistle blower. QUARTZ explains how they used machine learning to identify which of 200,000 documents were worthy of a closer look.
- How they did it – A machine learning model was trained to find similarities in the documents, involving about 13 hours of training on a MacBook Pro.
- How they measured success – Finding relevant documents that otherwise wouldn’t have been identified (they couldn’t determine if they found all of them)
- Hurdle one: Training computers on secret documents – Overcoming the challenge of labelling when secrecy meant they couldn’t use 3rd party help.
- Hurdle two: Documents within documents – Standardizing analysis when the source material is in different formats.