When do you escape your data?

I recently had an interesting discussion regarding escaping user-inputted data. When i mean escaping, i mean making sure that the data is modified so that any medium-specific special characters or symbols should be taken literally and not processed.

Here’s a very common scenario: Your website allows users to submit text which you save in your database and later display for other users to see. It could be comments, descriptions, names, addresses, etc.

You want to make sure that they don’t mess around with your system (security is important right?), so you want to make sure you escape any markup they include in their text. Escaping it should make sure that </body> tag they put it by “mistake” or that accidental <script>…</script> tag won’t fire when the text is displayed inline.

That brings me to my big question… when should you escape this data?

On the way in or the way out?

On the way in, like escaping the text before it’s saved in the database, might seem safer because you’re sure whichever way it leaves your system (or whichever developer uses that data), it will be escaped. But I think that’s the wrong approach.

I’ve always had the belief that your user-inputted data is sacred and should always be stored in its original form. You never know what you’ll be doing with it later.

Why does that help us here?

It’s because escaping changes for each output medium and you might not know which medium it is ahead of time!

Let’s go back to our original example. Let’s say you decided to escape any HTML tags before saving it to the database. Output should be easy now. You just dump the text on your webpage and call it a day.

But a few days later, your boss tells you a new output format should be supported in your system – for example, CSV files.

If you dump that escaped text to the CSV file, it will be escaped for HTML which is not what we want. Not only that, CSV files have their own escaping that is required.  You also have the possibility of double escaping when one medium isn’t compatible with the escaped field from another medium.

So escaping data on the way in has two problems. First, you can’t decide which medium to escape it to ahead of time. Second, escaping it could potentially lose data that is important to you by double escaping.

Doing it on the way out makes sure you are making the decision at the best possible time in the data’s lifecycle and keeps the system flexible to new requirements for data display.

The downside is discipline. Forgetting to escape something can be a problem from a security perspective and from a user experience perspective. As they say, with great power comes great responsibility.

Fortunately, popular frameworks and libraries already include escaping functionality. It just becomes the responsibility of the developers to make sure that happens.

What do you think?

Update: Some people have pointed out performance.  They’re absolutely right!  You could keep the original text and encode it as well in a separate field.  That way you have the freedom to use the original value or the pre-processed one.

About these ads

9 Responses to When do you escape your data?

  1. Daniel Trebbien says:

    This reminds me of an answer on Stack Overflow: http://stackoverflow.com/questions/3591317/am-i-safe-from-a-mysql-injection/3591455#3591455

    In web apps I have always escaped “on the way out” to use your terminology. I wonder what percentage of web developers escape on the way in.

    • matan says:

      good question. i’d guess some reason for the doing it on the way in:

      1. Security focus. to make sure the users of the data don’t forget. Not the best idea imho, but it’s been done.

      2. Performance. usually this is when they keep the original and keep another escaped copy in order to reduce processing.

      3. Oops. They don’t realize the implications.

  2. […] When do you escape your data? « n0tw0rthy. Share this:TwitterFacebookLike this:LikeBe the first to like this post. Categories: Uncategorized Comments (0) Trackbacks (0) Leave a comment Trackback […]

  3. James says:

    Easy. You don’t escape your data. You hold it in the raw format and escape it on the way out of the database. This way you can change it at a later time to what you need it for. If required you can cache it rendered if you have a performance problem converting it.

    The other thing that should be added is a validation layer. You may want this in the interface but it is normally better to constrain it either in the database of in the data layer. You can unit test for it to insist that it is not breaking your security rules.

  4. Phil Brass says:

    You escape your data whenever it is transmitted across a semantic boundary. It is probably also wise to canonicalize it at this time.

    Let’s say you are, for some reason, still using concatenation to build SQL queries to store the user’s data, so you’re building an insert statement. You will need to escape single-quotes “on the way in”, and ensure that all user-controlled values are quoted. You may also need to escape other characters in the inbound string.

    What you may not think about is that the web app framework has probably already un-escaped this data for you as well. Whenever data crosses a semantic boundary, such going from as “A URL-encoded parameter of an HTTP request” to “a literal string value in an SQL insert statement”, you have to decode it from the source context, and re-encode it into the destination context.

    When you talk about escaping in HTML output, you also have to realize that there are multiple possible semantic contexts in any given HTML page. In some of these contexts, there is no way to safely encode user data (between tags, for example). Other contexts will require other forms of escaping to prevent specially formatted data from breaking out of the containing context and causing mischief. For example, different escaping rules may be necessary if the user-controlled data is being written into a JavaScript string literal, versus an HTML attribute value, versus a text entity.

    • matan says:

      Nice summary Phil. You’re right that when you dig into the details, this is what needs to be done.

      Escaping on the “way in” for security reasons is an important point, and something i don’t really explain in my post. I guess my assumption is for user-generated data that isn’t used to create concatenated SQL queries.

      I like the way you described semantic boundaries. Good stuff.

  5. aaaa says:

    I escape my data on 28th Oct, when my contract ends.

  6. demotivator says:

    demotivator…

    […]When do you escape your data? « n0tw0rthy[…]…

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: