- [Open Access] Corpus of Decisions: International Court of Justice (CD-ICJ)
- [Open Access] Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ)
- [Open Access] Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ) (JELS 2022)
September saw scheduled updates published for the Corpus of Decisions: International Court of Justice (CD-ICJ) and the Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ). In addition to a number of minor bug fixes and enhancements both corpora were fully recompiled with Tesseract 5, yielding an even better OCR rendering of old decisions.
Compared to the 50.19% reduction in unique tokens for the English version of the CD-ICJ, as reported in the Journal of Empirical Legal Studies, the reduction now stands at 51.15% compared to the original OCR published by the Court. Faulty OCR can cause a large number of unique or rare tokens that may bias or otherwise impede statistical analysis, particularly bag-of-words models. Please see the JELS article for more on OCR quality control. Improvements in the French version of the CD-ICJ and both language versions of the CD-PCIJ are comparable.
Also, the CD-ICJ now includes cases up to General List No 183 (Germany v Italy) and documents published before September 2022.
For full details please see the changelog on the relevant Zenodo pages and in each Codebook.