I recently had an interesting discussion regarding escaping user-inputted data. When i mean escaping, i mean making sure that the data is modified so that any medium-specific special characters or symbols should be taken literally and not processed.
Here’s a very common scenario: Your website allows users to submit text which you save in your database and later display for other users to see. It could be comments, descriptions, names, addresses, etc.
You want to make sure that they don’t mess around with your system (security is important right?), so you want to make sure you escape any markup they include in their text. Escaping it should make sure that </body> tag they put it by “mistake” or that accidental <script>…</script> tag won’t fire when the text is displayed inline.
That brings me to my big question… when should you escape this data?
On the way in or the way out?
On the way in, like escaping the text before it’s saved in the database, might seem safer because you’re sure whichever way it leaves your system (or whichever developer uses that data), it will be escaped. But I think that’s the wrong approach.
I’ve always had the belief that your user-inputted data is sacred and should always be stored in its original form. You never know what you’ll be doing with it later.
Why does that help us here?
It’s because escaping changes for each output medium and you might not know which medium it is ahead of time!
Let’s go back to our original example. Let’s say you decided to escape any HTML tags before saving it to the database. Output should be easy now. You just dump the text on your webpage and call it a day.
But a few days later, your boss tells you a new output format should be supported in your system – for example, CSV files.
If you dump that escaped text to the CSV file, it will be escaped for HTML which is not what we want. Not only that, CSV files have their own escaping that is required. You also have the possibility of double escaping when one medium isn’t compatible with the escaped field from another medium.
So escaping data on the way in has two problems. First, you can’t decide which medium to escape it to ahead of time. Second, escaping it could potentially lose data that is important to you by double escaping.
Doing it on the way out makes sure you are making the decision at the best possible time in the data’s lifecycle and keeps the system flexible to new requirements for data display.
The downside is discipline. Forgetting to escape something can be a problem from a security perspective and from a user experience perspective. As they say, with great power comes great responsibility.
Fortunately, popular frameworks and libraries already include escaping functionality. It just becomes the responsibility of the developers to make sure that happens.
What do you think?
Update: Some people have pointed out performance. They’re absolutely right! You could keep the original text and encode it as well in a separate field. That way you have the freedom to use the original value or the pre-processed one.