For performance testing, it is often necessary to import a large data set to test against. However, importing large data sets presents its own challenges. Below I want to give some tips on how to deal with those.
- Begin with making backups. Not just of your current data, but also of the large data set you want to import. You might just want to transform the data to import, and then it is useful to be able to go back to the original.
- Start with a representative subset of the large data set. This will allow you to test the import process without having to wait hours for feedback. Only when you’re convinced that everything works as expected, do you import the whole large data set.
- Test the limited data set end-to-end. For instance, the product I’m currently working on consists of a Content Management System (CMS, where people author content) and a Delivery System (DS, where people use the content). Data is imported into the CMS, edited, and finally published to the DS. In this situation, it is not enough to have a successful import into CMS. The publication to DS must also succeed.
- Automate the import. When things go wrong, you need to perform the import multiple times. It saves time to be able to run the import with a single command. Even if the import succeeds on the first try (one can dream), you might want to redo the import later, e.g. for performance testing against a new release, or when a new, even larger, data set becomes available.
- If you need to transform the data to make the import work, make sure to put the transformation scripts under version control, like your regular code (you do use a version control system, do you?). The build scripts that automate the import should also be put under version control.
- If you cannot get your hands on real-world data, you may still be able to do performance testing using generated data. The downside of this approach is that the generated data will probably not contain the exotic border cases that are usually present in real-life data.