Data cleaning is a tedious but crucial prerequisite for many digital humanities project. Whether due to inconsistencies in the original data or mistakes created during data entry, data may contain problems like inconsistent spellings and abbreviations or mismatched date formats. These errors can present a problem for tools and prevent a researcher from moving forward with analysis.
Examples of data cleaning activities include:
- Standardizing abbreviations and spellings
- Cleaning up OCR’d data
- Preparing geodata: formatting a list of addresses so they can be geocoded in bulk
- Preparing corpora: splitting a flowing text into separate text files with associated metadata
Data cleaning can be time-consuming, but the reward is a data set that can be easily combined with a variety of analytical tools. The tools and best practices outlined below will help you speed up repetitive tasks and ensure your data is portable and reusable.
You might want to get in touch with a digital humanities consultant to discuss your project before you get started. A preliminary consulting session can help you identify some best practices for your particular data set given your project’s end goal. How you decide to structure your data as you clean and prepare it may affect the type of analysis you can perform on it (read about an example of this on the DH blog). For an intensive introduction to data cleaning tools, attend our weeklong intensive, “Data Workflows and Network Analysis” at the Digital Humanities at Berkeley Summer Institute.
Before you begin
Before you start manipulating your data set, consider these best practices to avoid headaches later in the process.
- Save your files in the UTF-8 encoding (or UTF-16 when working with Vietnamese) to ensure your documents maintain their integrity and are usable with a variety of tools. This is of greatest concern to scholars working with non-Roman scripts, but text encoding can cause issues for a variety of scholars. Read a short introduction to the Unicode standard.
- When possible, opt for human-readable file formats over proprietary, binary file formats (i.e. choose .csv over .xslx for spreadsheets; choose .txt over .doc for text files). Proprietary formats, such as those created by Microsoft Excel and Word, contain extraneous binary code that make them less transparent and harder to manage over the years. Human-readable formats, like .csv, can be used directly by many tools and can be opened and manipulated with a simple text editor.
- Save multiple copies of your research data
- Each time you perform a new operation to transform your data, save a new copy of the file and give it a detailed and informative name (e.g. 2016-02-24-Yearbook-datetransform)
- Backup your data often. Take advantage of free cloud storage solutions like Drive or Box.
- Consider sharing your data on GitHub, publishing it on your scholarly blog, or sharing it on a research website.
Regular expressions (also known as “regex”) are a useful for speeding up repetitive tasks in data cleaning. Regular expressions allow a user to describe a set of patterns (e.g. every set of three digits; all lines that begin with a capital letter). Regular expressions can be used in several contexts. For example, many text editors, like Sublime Text, allow you to use regular expressions with their Find and Replace features. You can also use regular expressions in Python, using the re module.
Regular expressions come in handy when cleaning up data that has re-occurring errors. For example, a data set might have improperly formatted dates (e.g. date-month-year format instead of month-date-year format). Using regular expressions in Sublime Text, you might use Find and Replace to identify all dates in the file and change them to the appropriate format.
- RegexOne: an interactive introduction to regular expressions
- “Understanding Regular Expressions”, The Programming Historian: an applied introduction to regular expressions
- “Cleaning OCR’d text with Regular Expressions”, The Programming Historian: a tutorial using regular expressions in Python
OpenRefine (formerly known as GoogleRefine) is a free open source tool for cleaning, transforming, and exploring data. It accepts common spreadsheet formats such as .xslx and .csv. One of OpenRefine’s most helpful features is its ability to track and keep record of changes to the data. If you make a mistake with cleaning up your data, you can rewind to any step in the data transformation process. OpenRefine can also identify duplicate entries, cluster variations of the same word (e.g. history, HISTORY, and History can be merged into one label), and remove blank entries.
- Thomas Padilla, “Getting Started with Open Refine”: a great applied introduction to OpenRefine basics with a DH use cases
- Available via the library: Verbogh and De Wilde, Using Open Refine: a comprehensive guide to using OpenRefine.