When do you escape your data?

October 7, 2011

I recently had an interesting discussion regarding escaping user-inputted data. When i mean escaping, i mean making sure that the data is modified so that any medium-specific special characters or symbols should be taken literally and not processed.

Here’s a very common scenario: Your website allows users to submit text which you save in your database and later display for other users to see. It could be comments, descriptions, names, addresses, etc.

You want to make sure that they don’t mess around with your system (security is important right?), so you want to make sure you escape any markup they include in their text. Escaping it should make sure that </body> tag they put it by “mistake” or that accidental <script>…</script> tag won’t fire when the text is displayed inline.

That brings me to my big question… when should you escape this data?

On the way in or the way out?

On the way in, like escaping the text before it’s saved in the database, might seem safer because you’re sure whichever way it leaves your system (or whichever developer uses that data), it will be escaped. But I think that’s the wrong approach.

I’ve always had the belief that your user-inputted data is sacred and should always be stored in its original form. You never know what you’ll be doing with it later.

Why does that help us here?

It’s because escaping changes for each output medium and you might not know which medium it is ahead of time!

Let’s go back to our original example. Let’s say you decided to escape any HTML tags before saving it to the database. Output should be easy now. You just dump the text on your webpage and call it a day.

But a few days later, your boss tells you a new output format should be supported in your system – for example, CSV files.

If you dump that escaped text to the CSV file, it will be escaped for HTML which is not what we want. Not only that, CSV files have their own escaping that is required.  You also have the possibility of double escaping when one medium isn’t compatible with the escaped field from another medium.

So escaping data on the way in has two problems. First, you can’t decide which medium to escape it to ahead of time. Second, escaping it could potentially lose data that is important to you by double escaping.

Doing it on the way out makes sure you are making the decision at the best possible time in the data’s lifecycle and keeps the system flexible to new requirements for data display.

The downside is discipline. Forgetting to escape something can be a problem from a security perspective and from a user experience perspective. As they say, with great power comes great responsibility.

Fortunately, popular frameworks and libraries already include escaping functionality. It just becomes the responsibility of the developers to make sure that happens.

What do you think?

Update: Some people have pointed out performance.  They’re absolutely right!  You could keep the original text and encode it as well in a separate field.  That way you have the freedom to use the original value or the pre-processed one.


What would you learn if you had a week of paid time to do it?

August 30, 2011

A while ago, I posted some of my thoughts on what I thought was a good job interview question (with quite a bit of constructive feedback). I wanted to share another question that I use which helps in a different way. It’s another open one that goes like this:

If you got a full week of paid time-off to learn some new technical subject or improve an existing one, what would it be (and why)?

In in other words, what have you been dying to try out, but didn’t have time to do it at work (and still get paid for it).

I like this question because:

  • It’s time-boxed.  You only have a week, so it focuses on achievable goals in the eyes of the candidate.
  • There is no “spare-time” requirement and can be done during work hours.
  • It gives some interesting – and sometimes surprising – insight on the candidate.

For example, were they interested in a new tool or language? That’s a great subject to discuss to understand what interests them.

How about improving some existing skill? Deep dive into it to see what they think they are missing and why.

How about to write up some prototype or POC of an idea they had? Ask them why they haven’t done it in their spare time

Some candidates literally are surprised by the fact that they haven’t really thought about it at all!

In the end, it leads to some great conversation and is surprisingly refreshing for a job interview setting. It can also be a great way for a candidate to stand out from the others in your memory. There is a sinister side to it though…

Candidates sometimes (inadvertedly) admit to not knowing some skills you think are important. Oops. Not good for them, but better to know early right?

In all, this question has been very helpful for me.  Hopefully it will be for you too.

What would you learn if you had a chance?


Graph Databases FTW

August 18, 2011

I always try to slip in some new technology when working on some new idea or project to keep myself from doing the same ‘ol.  Recently, I’ve been working on my own (very many) weekend project.  Because this is a social website, I had to go through the paces of persisting the “social graph”.  Namely, who your friends are and who their friends are (rinse and repeat).

Having chosen MongoDB as my somewhat trusty back-end database, it would have been convenient to shoe-horn that fat social graph into the database.  I mean, blogs and books have written about how to do this in your typical relational database.   Some are still figuring out the best way.  Some even got a really big shoe-horn.  Even MongoDB, which is a document db but similar enough, has its own solution page for this common task.

Getting it right with one of these solutions isn’t easy.  Especially when you start to scale and new features demand deeper traversals into the social graph.  Not to mention is it quite the maintenance nightmare.

So from day one, I figured i’d go all out and get myself a shiny graph database to add to my technology stack and see what a difference it would make.  Neo4J seemed like a good a choice as any, so I downloaded it and plugged away.  Luckily, I caught it right in the transition to a REST-based server which is a much better design for my needs.

The first thing I noticed, having read through all the documentation and their cheat sheet on reproducing IMDB, is that designing a graph database schema is a pretty natural thought process.  I literally grabbed a piece of paper and was planning to rewrite my whole data layer to just use a Neo4J! I mean, it has: transactions, embedded Lucene for full-text search and indexing, solves my use-cases, and a nifty back-end management tool.

Having some sanity though, I stopped myself before actually doing it and scaled it back just to dealing with the social graph.  I figured i’d give MongoDB a chance to do what I already wrote it to do – which is persist all the crap I gave it – and I added Neo4J to only contain the social graph information, i.e. user identifiers and their relationships.  Each user document in MongoDB kept the node ID of the Neo4J node – and each node in Neo4J kept the user ID of the user document in MongoDB. That way I could look up any user node in the graph and start any traversals as needed.  And when I found what I was looking for, I could look it up in MongoDB to get all the details. (Side-note: the built-in Lucene in Neo4J is also a great solution for many lookup/search use-cases but I personally didn’t have any use for it since I’m using Solr already).

Honestly, it was a painless process and works pretty damn nicely.  It’s also very fast and flexible now.  Do I want to know who your friends are? No problem.  Do I want to know your friends’ friends’ friends’ friends are?  I can do that too by changing a function parameter.  And the results will come back pretty much immediately.

Overall I think i’ll be using Neo4J – and graph databases in general – for those tasks involving non-trivial relationships.  For everything?  Probably not….but getting it to support paging is a good start.


What have you developed in your spare time?

June 15, 2011

Throughout the years, I’ve interviewed quite a few developers and I’ve recently been reflecting on what single question can give me the most information about the ability and passion someone has for programming.

I’ve concluded that this one gives me the best ammo to work with:

What have you developed in your spare time?

I love this question because it touches a few areas at once.  Spare time is a valuable resource which you usually dedicate to what you enjoy most.  Dedicating that to development is a huge indicator of where your passions lie. On the other hand, getting a “Huh?” or “In my SPARE time?” in return is probably a good indicator that this person isn’t what i’m looking for.

What they’ve been working on is also an interesting indicator.  Did they contribute to an open source project? Port a popular tool to a new language? Build some nifty tool to try out some new tech? Reading into the type of development they did and why can really give some insight into what motivates or challenges them.

It also touches on is their ability to stay up to date and be self-taught.  So much new tech and ideas are being made available and it’s hard to keep up.  It’s almost impossible to do so at your “day job”.

That’s why I like that question and use it as often as i can when interviewing candidates.

What is your favorite one?

Update:

I very much appreciate all the reactions and opinions regarding this specific topic.  In no way does it try to pigeon hole people one way or the other and is my humble opinion on my personal experiences.  The main point I want to stress is that I like to know where someone’s passions lie. If someone says “After working at my job all day, why should I work more at home?”  I totally agree!  It shouldn’t be considered “work”.

It also can (or should) be an occasional thing.  I try to spend 2-3 hours a week on average working on some idea, or testing out some new technology, or just reading a good book.  That’s hardly overwhelming.

I’d also argue that over time it becomes more important.  When you build a deep body of knowledge and experience in one field, it becomes your prism to view new problems.  Expanding your horizon can create some interesting (and sometimes surprising) ideas for new projects, ideas to solve stubborn problem from the past, or just some personal enjoyment.

The point is to have an itch regarding programming and feel the need to scratch it.


Scale using your users

June 8, 2011

I’ll use an example scenario to show what i mean.

As part of your site functionality, you allow users to create, update, search for, and view “thingies”.  This thingie has an address.

When people look at your thingie (er, that doesn’t really sound right), you want to display a nice dynamic map of the address with a cute little red marker.  You don’t have the coordinates for that address and therefore want to do some geocoding.

Having those coordinates would allow you to support all types of fancy location-based searches to tell if someone is near your thingie.  Even better, your cute little red marker would be in the right place.

So how do you go about doing this?  Well, some possible solutions could be:

1) Send a geocode request when saving/updating your thingie

A good first start, but probably not the best.  This has a few disadvantages, but for the sake of simplicity, let’s say the main reason  is you don’t want users waiting for unnecessary synchronous steps to complete.  Factor in network uncertainly, timeouts, etc. and you have a nice set of disasters waiting as your site grows.

You also realize that these geocoding services have limits on daily usage!  Too many thingie’s a day will be great for you, but you’d be tied to the limitations on your geocoding daily allowance and throttling requirements.  What to do?

So you come up with:

2) Create a queue to batch your geocoding requests

Better idea! When you create new thingies – and assuming you don’t have the addresses cached somewhere – you queue them in your uber geocoding async job.  It chugs away in the background sending nicely throttled requests and stops after reaching the daily limit.  Unfortunately, unless you want to pay lots of cash to up your daily limit, your queue will start getting longer and longer as it can’t keep up with new thingies.  Also, the geocoding might not even be done for thingies that your users are viewing right now.

Fortunately, this is a great scenario of how you can take advantage of your users to help you grow.  And you do it by leveraging client-side JavaScript.  Each user using your site is a potential resource which you can use to offload some of your work.  Help your users help you!

So why not have the users geocode for you?

3) Have the first user who views your map geocode it for you

I personally like this solution best because it’s being lazy.  When you have to display your map and that thingie has no coordinates yet, have the user send out a geocoding request from their client machine and use the results to show the map.  More importantly, send another async request afterwards to your server to save that value for future users!  Now you both get what you want at the right time.

Now assuming the geocoding API you’re using allows anonymous requests (like Google),  you don’t have to worry about daily limits, throttling issues, and other headaches.

The point of it all

Now this was all a hypothetical scenario to stress a point.  Lots of services today have an HTTP API accessible via JavaScript and this will probably grow and grow.  Align that with the increased horsepower of modern machines and improved browser JavaScript engines and you have an army of resources waiting to be tapped if you can figure out how to do so.  The great part is this resource scales linearly with demand.

The tricky part though is changing your frame of mind when writing features to consider this new option.  Why use server bandwidth and resources to do it when the client can do it just as well?

Have you had any scenarios like this?


Queue like in Vegas

May 31, 2011

I just got back from Las Vegas.  From the second you land there, you can smell the alcohol in the air.  It’s also probably the only place i know where you can gamble while waiting in baggage claim….and people do!

At any rate, we gambled, we lost, we left.  Statistically, that’s about right so no hard feelings.

One thing that I did notice while there was how queues are serviced.  I’m not normally that observant, but it was pretty easy considering that:

a) I just finished a great book on the subject called “The Principles of Product Development Flow” (the book is much better than the title suggests).

b) You spend half your time in Las Vegas standing in lines.

One particular incident involved me standing in line at the MGM Grand Concierge to pick up some show tickets.  This was how the line looked:

Queue Per Server

Each server / concierge (the dots) has its own queue of people.  It this a good solution?  Of course not!  This is the perfect situation that breeds line-choice-remorse.  It’s that well known phenomenon which occurs after choosing a line –  you soon realize the line next to you is faster while you’re stuck behind a guy that is buying tickets for twenty five people, a restaurant reservation, maps for all of Nevada, a limousine, some shoe string, an igloo, and a towel.  At least that is what it seemed like at the time.

In short, having a Queue Per Server doesn’t account for variation.  Assuming that each person has the same priority, they join the lines at random intervals, and each person takes a random amount of time to service, this is the perfect recipe for some queues to be blocked up for long periods of time while other queues might be empty.  It also doesn’t serve people in the order they joined.

Luckily, most of the other lines in Las Vegas were like this:

Single Queue, Multiple Servers

While these lines might look longer, they are actually more efficient.  These queues have one line which includes all the people in the order joined, and multiple servers each taking a person from the front of the queue.  Making the same assumptions as the previous example, this method is actually more resistant to variation.  If one person takes a much longer time, only one server will be blocked and the queue will continue to flow.  In addition, it’s better for your mental health.

The perfect example of this is the Taxi line at the airport.  The line was literally hundreds of people and if you didn’t know better you’d think it was going to take hours to get a taxi to your hotel.  In reality it took 10 minutes.  We were literally walking non-stop in the line.  They had a system where 15 taxis would drive in and predetermined spots and drive off with the soon-to-be poor vacationers.  Pure awesomeness.

These decisions happen in software development too, especially in multi-threaded environments and Message Queues.  I wonder where they got the idea from….


ORM’s hidden cost

May 28, 2011

I think my first post will start with some venting so bear with me.

ORM solutions have really made a huge impact on software development with relational databases.  It’s pretty much expected from any modern object oriented language to have a built-in ORM solution and/or multiple 3rd party implementations of one.

You could even write a pretty impressive list of benefits of ORMs like productivity, avoidance of vendor lock-in (except DB specific ORMs), and maintenance.  But in spite of all this, I do find that ORMs have a very real and prevalent hidden cost:

ORMs set the bar too low for who gets to write data access code

You might have seen this happen, or you may been that person once, but time and again i’ve seen how the lack of understanding of what’s happening “behind the scenes” cause very painful performance, stability, and design issues: 1-n problems, fetching too much data, ignoring batch updates, etc. to name just a few common ones.

ORMs take the discipline of fine-tuned and precise database interaction and takes a huge warm dump on it.

Now don’t be mistaken, the elite can and will use ORMs in the best possible way and in all the right places, but let’s be honest about the majority.  I’ve seen the blank stare with eyes glazed when I asked a developer why they wrote 3 nested loops to find a specific value of the child of the child of a mapped ORM entity instead of just writing a specific query for it. Or why they retrieved and iterated over all 5 million items to set the same value of of the same field instead of writing an update statement.

Somehow people assume that writing specific queries for specific cases when using an ORM is an admission of defeat when they can easily just traverse the object graph and call their setters and getters.  It all comes crashing down though when you have enough data.

In the end using ORMs properly requires even more knowledge and ability and it should be entrusted to those people who can get it right.

Those of you have passed this hurdle should congratulate yourselves. There aren’t many of you.