How Much Data Do I Need?
As an analyst and evaluator, the most frequent question I am asked is, “How much data do I need for this analysis?” You need to have enough usable, representative data to answer your questions. Here’s how you get there.
It’s common to focus on a response rate for a survey or count rows of data as a way to show we have enough data. When I teach Management Concepts five day Analytics Boot Camp course, I conduct an activity using marbles to illustrate the importance of knowing how representative your data is of your whole target group. If I have all solid colored marbles, but I know there are cats-eye marbles in the bag, I might have to keep selecting marbles to find one. If they’re all at the bottom of the bag, and I don’t ever see them, is that okay?Here are some things you should find out before you collect any data, whether from people or from databases:
- How many should I have in my target group? How many do I have in the sample I’ve selected?
- What are the characteristics of the target group? (age, education, years of service – if you can imagine a category, think about what the possibilities are within each category and write it out) Do those look the same, proportionately, between my sample and the target group?
- If someone were making a decision about me based on this information, would I be comfortable with it? What might cause me to be concerned?
- What constitutes “enough” data in this situation? (Hint: Ask your stakeholders before you start collection.)
These analytic questions apply whether you are conducting an employee satisfaction survey or pulling large volumes of data from a government database. In many cases, you can’t get everything you want, and having all the data may be impossible (think about Department of Labor datasets, for example).
I often see people collect data that doesn’t look at all like they intended – they want a cats-eye marble and only have solid color marbles. It’s hard to answer anyone’s questions when you don’t have what you need. Just because you have access to some marbles doesn’t mean they are the marbles you need for your analysis.
- Usable Data
Not all data is clean or good for analysis. People responding to surveys may skip questions, leave out information, or answer in a way that is unclear. We can’t guess what they meant, so that data has to be excluded from analysis. In the cleaning and screening process (where analysts look at all the details and decide what can actually be used), in some cases as much as 20% of the data is unusable (depending on how it was collected) and that impacts representativeness. That means that if you thought you collected enough data, you now may find that it’s not enough because some of it can’t be used due to being incomplete, error-laden, or other problems (such as a software error). You have to plan ahead for some of the data being unusable for reasons that are beyond your control.
Tips for Collecting the Right Amount of Data
- Know what your data should look like when you get it (based on the categories that describe your source – like age, total number of employees, ethnicity, gender, years of service, location, etc.) so you know if it’s not in range of what you expect.
- Plan to collect more data than you need in case you have to remove up to 20% of it during cleaning. That means that if you need to have 100 usable rows of data for analysis, you should probably start with 120 rows of data.
- Calculate your response rate in advance so you know what number is workable. Yes, you can find a response rate table on the Internet that tells you the number of responses needed for a 95% confidence interval. Great! Be careful however, because the guidelines don’t account for data removed in the cleaning process or the representativeness of your data set. For example: Using one of these tables, for a 95% confidence interval, if your target group size is 70 people, you need at least 59 usable responses (that’s an 84% response rate if you can get it). If only 30 people reply (or you can only use 30 of the 59 responses because of a system error), you cannot be confident in your results (with only a 43% response rate).
Not many people talk about these issues in a non-technical way, and frequently representativeness and usability of data in analysis are linked closely with data visualization (once you see the data in charts, you start to question what’s going on). Here are few blog postings in particular that are helpful to better understand the complexity of the topic (none of these are endorsements, just references I found to be helpful in their explanation):
Evergreen Data: http://stephanieevergreen.com/dataviz-inequality_pt1/
Stephen Few: http://www.perceptualedge.com/blog/
About the Author: Stephanie Fuentes
Stephanie Fuentes serves as an instructor for Management Concepts Analytics Boot Camp and HR Analytics courses. In this role, she teaches Federal professionals to make data-based decisions and the mechanics of analytics that will empower them to be better data consumers. Stephanie has worked with staff in the Departments of Agriculture and Energy in her day job as a consultant She has experience with analytics and workforce learning and development in a variety of industries. Stephanie holds an MA.Ed in Instructional Technology from the University of Colorado at Denver, an MBA in Operations Management and a Ph.D. in Organizational Learning both from the University of New Mexico. In her free time, she enjoys spending time in nature with her family, gardening, and sewing her own clothes