How to Get Started with Legal Data Science
Overview Link to heading
People often ask me how to get started with Legal Data Science.
This tutorial offers a basic introduction to the subject and gives reasons why you should not be satisfied with the traditional focus on literary methods in legal methodology.
Most importantly, it offers a large number of Open Access resources to help you on your way, whether you are looking to become a full-fledged Legal Data Scientist or just want to complement your traditional skill set with some computational approaches to the law.
Check back every once in a while, as I will add resources when I come across them.
Table of Contents Link to heading
What is Legal Data Science? Link to heading
Data Science can be described as the intersection of three different disciplines (Conway 2013):
- Programming
- Statistics (and other math)
- Domain Expertise (e.g. in law)
Legal Data Science is the application of computational statistical methods to the legal domain.
Sometimes it is also called “Computational Legal Studies” or “Quantitative Legal Studies”. You will also hear the term “data analysis” a lot. There is much overlap with “Empirical Legal Studies”(ELS), although ELS is a wider field and more welcoming to qualitative scholars. Legal Data Science tends to follow a primarily quantitative approach.
Because domain expertise is so important to data science, it could be said that lawyers are already one-third Legal Data Scientist. However, programming and statistics are rarely taught at law schools and are quite intimidating if you have been only trained in the traditional literary methods of legal doctrine. So how do you approach computers and math?
I have two clear recommendations:
- Read general Data Science literature (much of it is free and Open Access!)
- Work on your own computational projects as soon as possible (NOT necessarily in public!)
General literature on Data Science will teach you programming techniques and basic statistics at the same time. Many excellent books are available free and Open Access on the Web, see below for some recommendations with links.1 Good statistics books tend to be commercially available only, but recently I’ve been seeing more Open Access statistics texts as well. That being said, you only need to start working with dedicated statistics textbooks once you reach an advanced level. Routine descriptive statistical methods like frequency tables and histograms go a long way.
Why start your own projects even before you are good at any of this? The reason is simple: you have to write a lot of bad code before you are able to write good code. Also, the more useful the problems you work on are to your own interests, the more motivated you will be. No one becomes an expert in a day. Building expertise in any field takes time, sustained practice and determination.
Why learn Programming and Statistics? Link to heading
Why should you bother with learning how to program and how to interpret complicated statistics? You are a lawyer! You don’t want to become a programmer or a statistician!2 Perhaps you could simply add some AI filler talk to your papers and slides and call it a day? Works for most people and they still get invited to “innovation events” and are hailed as “thought leaders”, right?
This is why:
- Walk the walk instead of just talking the talk. Regulating the information society is important, but the digital transformation is driven by code and math, not by clever legal arguments. If you want to be more than a sad AI cheerleader on Linkedin, you have to understand how digital technology works and perhaps build some of it yourself.
- A little knowledge is a dangerous thing, but willful ignorance even more so. Lawyers often want to use statistics-based empirical knowledge in circumstances that have significant real-world consequences. And often they cannot avoid doing so, e.g. most judges must decide cases brought before them.3 If you do so, make sure you have at least some basic competence at what you are attempting. Willful ignorance is dangerous, because you don’t understand how much you don’t know. Even a little knowledge can help you understand your limits, make better decisions and find better experts for the problems you don’t understand.
- Maintain the autonomy of the legal discipline. The challenges of modernity (COVID, climate change, ChatGPT) are inescapable. You can either withdraw into your lovely doctrinal corner and accept uncritically what experts from other disciplines (if they will talk to you) tell you about state of the world or you can acquire enough expertise to critically evaluate current empirical and technical knowledge yourself.
The R Programming Language Link to heading
Introductions to R Link to heading
The Pirate’s Guide to R is very entertaining, I highly recommend it. In fact, I started out with it myself. R for Data Science is the classical open textbook.
- Philips, YaRrr! The Pirate’s Guide to R (2018)
- Wickham and Grolemund, R for Data Science, 1st edition (2017)
- Wickham, Cetinkaya-Rundel and Grolemund, R for DataScience, 2nd edition (2023)
- Alschner, Data Science for Lawyers (2022)
Reference Works Link to heading
Many classical problems in Data Science have classical solutions that are listed in “cookbooks” or similar overviews. The R Cookbook is incredibly useful, don’t miss it. From Data to Viz and the R Graph Gallery are my first ports of call if I want to try a new type of diagram.
- Holtz, From Data to Viz
- Holtz, R Graph Gallery - Help and inspiration for R charts
- Long & Teetor, R Cookbook (O’Reilly 2019, 2nd ed)
- Chang, R Graphics Cookbook (O’Reilly 2023)
Using R Link to heading
While you can use R directly in the command line, almost no one does this day-to-day. An integrated development environment (IDE) assists you with advice, code snippets, syntax highlighting and many other helpers that make your life easier.
For beginners I strongly recommend the R Studio IDE . Check out the Code section on my website for my personal workflow and tools. R Studio is also available as a cloud solution.
- Local installation of R and R Studio: Installing R and R Studio [My recommendation]
- R in the cloud: Posit Cloud [free accounts available]
- R in the browser (local): WebR [useful for demos]
- R in the browser (remote): MyCompiler [useful for demos]
Advanced R Programming Link to heading
- Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer 2023, 3rd ed)
- Wickham, Advanced R (Chapman & Hall 2019)
- Baruffa, The Big Book of R (2023)
- Burns, The R Inferno (2011)
Statistics Link to heading
Introduction to Statistics Link to heading
- Agresti, Statistical Methods for the Social Sciences (Pearson 2017, 5th ed)
- McElreath, Statistical Rethinking (CRC Press 2020, 2nd ed)
- Johnson, Ott and Dogucu, Bayes Rules! An Introduction to Applied Bayesian Modeling (CRC Press 2022)
Causal Inference Link to heading
- Ho & Rubin (2011). Credible causal inference for empirical legal studies. Annual Review of Law and Social Science, 7(1), 17-40.
- Huntington-Klein, The Effect: An Introduction to Research Design and Causality (Chapman and Hall 2022)
- McElreath, Statistical Rethinking (CRC Press 2020, 2nd ed)
- Pearl & Mackenzie, The Book of Why (Basic Books 2018)
Network Analysis and Graph Theory Link to heading
- Zweig, Network Analysis Literacy (Springer 2016)
Open Legal Data Link to heading
So you’re working hard on the methods, but where do you get the legal data?
Fortunately for you, I’ve published more than a dozen Open Access legal datasets under permissive licenses with open source code in the open data section on my website.
There you’ll find many ready-to-use datasets on international law (ICJ, PCIJ, UNSC) and German law (German federal courts, federal laws) for you to hone your new skills with. Almost all data sets are corpora, that is, data sets composed of texts with associated metadata (usually texts of court decisions, but also UN Security Council resolutions and parliamentary texts).
Also, the Liquid Legal Institute maintains a a comprehensive collaborative list of legal data sets from many different authors on GitHub.
Why R and not Python? Link to heading
Why recommend only materials related to R and not Python? Link to heading
Python and R are the two most important programming languages in Data Science. Python is certainly the more popular language of the two.
The primary reason why this overview focuses on R is a pragmatic one: I work with R all the time and very little with Python. I can give some fairly good advice on using R, but not on Python.
There are other reasons:
- R was developed for non-programmers, Python requires you to be a lot more comfortable with a computer and therefore is quite popular with computer scientists
- The R community is very open and friendly, it is composed primarily of non-statisticians
- R is widespread in certain academic disciplines that interest me most (statistics, peace research, political science, psychology)
- R has the better ecosystem in terms of high-end statistics and data visualization, whereas Python has the better machine learning ecosystem
- Many routine statistical methods are already built into R, whereas with Python you first have to find, install and learn additional packages (i.e. extensions)
- The dependency management systems in Python are a complete mess (see XKCD No. 1987)
So which language is the best for you? Link to heading
For most lawyers — meaning those without significant Data Science ambitions — the answer is easy: try both and stick with the language that feels best to you and is easiest to use. Usability in this sense also includes the available IDEs (e.g. R Studio vs PyCharm), the quality of package documentation and the opportunity to ask questions and receive helpful replies (whether online or from people you work with).
If you are pursuing a career in Data Science, the question is more important and the answer depends on whom you want to work with. If you intend to develop machine learning applications with computer scientists, then Python is probably better. If you model empirical phenomena with political scientists or psychologists, then R probably better.
That being said, choosing the “wrong language” when you start out is less fatal than you might think. You can learn a second programming language much faster than your first one, because many core programming techniques (loops, functions, formal logic, set theory, REGEX) are not language specific.4 The same goes for mathematical techniques: math is the same math everywhere.
-
Many Open Access books can also be bought as a hard copy or commercial e-book. If you like the book, consider buying it to support the author(s) and show your appreciation for their work. ↩︎
-
But if you do, props to you! ↩︎
-
Supreme Courts and Constitutional Courts usually have wide discretion in the cases they accept, but the greater the real-world consequences, the more likely it is that even they are forced to deliberate on a case. In other words, the more empirical the problem, the less an apex court can avoid dealing with it. ↩︎
-
The implementations will differ in syntax and details, but the concepts as such as the same and can be learned and understood in the abstract. ↩︎