[Update] ICJ and PCIJ Corpora

Notable Changes

September saw scheduled updates published for the Corpus of Decisions: International Court of Justice (CD-ICJ) and the Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ). In addition to a number of minor bug fixes and enhancements both corpora were fully recompiled with Tesseract 5, yielding an even better OCR rendering of old decisions.

Compared to the 50.19% reduction in unique tokens for the English version of the CD-ICJ, as reported in the Journal of Empirical Legal Studies, the reduction now stands at 51.15% compared to the original OCR published by the Court. Faulty OCR can cause a large number of unique or rare tokens that may bias or otherwise impede statistical analysis, particularly bag-of-words models. Please see the JELS article for more on OCR quality control. Improvements in the French version of the CD-ICJ and both language versions of the CD-PCIJ are comparable.

Also, the CD-ICJ now includes cases up to General List No 183 (Germany v Italy) and documents published before September 2022.

For full details please see the changelog on the relevant Zenodo pages and in each Codebook.