One of the things that we changed a while back that has made our lives easier was to tag every visit to the site with a session ID. Basically, if the session cookie doesn’t exist, one is created, and the cookie’s expiration time is reset with each page load, unless it times out (30 minutes for our sites).
The data stored is just the IP of the client and the UNIX time stamp of the cookie’s creation, e.g.:
Then we append this to each line of the HTTP server log as an extra field.
In doing so, it makes it very straightforward to pull out individual sessions for analysis. It also gets around some of the problems you have with IP overlapping from routers. For example if you are trying to look at classroom usage of a resource, and all the traffic is logged as coming from the same IP, it’s not really possible to tell how many concurrent sessions there are, nor what the traffic pattern is. Because the session ID is stored by each client, unless (in this case) two sessions begin at the same time, they’ll still have separate IDs.
I overheard a brief snippet of conversation this morning, talking about usage of a particular feature on an education-based web site.
They said that they assumed most of the traffic to said feature was classroom-based.
This reminded me that this is a problem for which I have never found a usable answer:
How can we separate “education” traffic from the overall traffic noise?
It occurred to me that if the percentage of K-12 schools in the US used the *.k12.(state).us domain for their server, then looking for those domain names in the logs would give some assessment of in-class usage. It wouldn’t trap things like teachers working from home, etc., but that might be OK if we could work under the assumption that detecting *.k12.(state).us was a likely situation of “in class” usage.
Problem is, I can’t (quickly) ascertain to what extent the schools ARE on *.k12.(state).us versus some other domain name. One co-worker thought that there were actually FEWER schools on it than five years ago. It seems to me that were I in the state-level office that has to keep track of school domains, that I would greatly prefer they register under a common (and somewhat protected) domain space even if the actual network providers were a sundry lot.
Can anyone out there shed any light on this? Even getting a series of fuzzy stats that could be strung together to make a fudge factor on observed traffic would be better than “we’re just assuming that MOST is K-12” without actually looking at the logs…
It’s very interesting, but at first glance I think still uses a lot of assumptions on web site topologies that are more geared to commercial sites. I need to give it a more thorough read…
“Bounce” refers to a web visit of one page, i.e., the visitor “bounces” in from somewhere and leaves immediately. The “bounce rate” is the percentage of visits that only go to one page.
On commercial sites, the expected “conclusion” of a visit is usually a sale. So, a visit consisting of a single page generally can’t really result in a sale (unless you went RIGHT to the check-out page and handed them $$$ to buy nothing). So, typically, single-page visits are considered “failures” because there’s nothing to show for it in terms of the bottom line.
Of course SOME commercial bounces might be successes in disguise: e.g., the visitor goes to a commercial site to research something from a Google search, finds the information they want on the first page, and then heads out to the brick and mortar store to purchase the item. Some retailers are even trying to discern when this happens through surveys.
BUT for the non-commercial site, the circumstance of bounce visits is far more murky. A visitor could only view one page and then leave completely happy with all of their goals achieved. Maybe they’ve even bookmarked the page for future reference – you won’t know from that little log file entry.
How can you clarify the situation? Here are some of the things I use to filter my set of single-page visits. There’s no guarantee that my underlying assumptions are correct all of the time, but it provides some sense of categorization. At the very least, we can separate the different categories into “Success”, “Likely Success”, “Could Go Either Way”, “Likely Failure”, and “Failure”.
- The referrer is key: if it’s a bookmark, then there’s a reasonably good chance that the “bounce” visit isn’t a failure. Depending on the page content (e.g., if it’s a page with explicit information, such as a “Contact Info” page, or a page listing your hours, etc.) you might be able to distinguish from the Successes and “Could Go Either Way” situations.
- If the referrer is from a known source with established ties to your site (or at least you can guess WHY they linked to your site), I would count that as a likely success (again depending on the link at the other site’s end – if it’s promoting particular content, then it almost certainly be tagged as a success.
- If the referrer is a search engine, things are murkier. Consider all the times that you do a search, try a possible solution and realize upon arrival that it’s a dead-end, and hit the back button… OR the times you struck it rich and got what you wanted (possibly after some number of previous failed visits elsewhere).
- If the referrer is unknown to you and you haven’t checked its context, then it could go either way.
Depending on how pessimistic you want to be, you can get a “refined” bounce rate from counting on which categories you want to interpret as a failure, remembering that the conventional wisdom (from the commercial world) is that ALL single-page visits are failures.
It’s also helpful to look at the metadata from the referring URL where search engines are involved. Even non-commercial sites might want to invest in Google Ads or enhancing their ranking in search engines. Taking a look at what search terms were used that resulted in a visit (even if they’re off-base) is enlightening!
Caught this posting on the “Occam’s Razor” blog.
(Too) Many of the metrics requests I receive ends up being one of the “standard” requests peppered (maybe) with some custom filtering.
Almost all of them can be re-phrased to a question that start “How many…?”
- How many (hits, views, users, etc.) did we get last (week, month, year)?
- How many times did (insert subset of pages) get visited?
The others tend to be in terms of creating “Top 10” lists:
- What’s our most popular (page type)?
All of these can be calculated of course, and there are lots of products out there that will handle those tabulations for you so that the effort is reduced to filling out a short form on a page and copying/pasting the results. So the challenge isn’t getting the pertinent information to the interested party. Instead, I find myself compelled to ask “Why?” More often than not, the answer is that there’s a report being crafted and the authors wanted “some numbers”…
OK – but – WHY? What message are you trying to convey? Are these stats really the best information that supports your message?
“Well, these are the numbers that EVERYONE reports (and therefore expects).”
Yes. That’s very true. But how are you putting these numbers in perspective so that comparisons site A with sites B, C, D… Z tell you something? So, the challenge seems to be:
- trying to “second guess” what the REAL request is;
- getting the data and presenting it in a useful context;
- educating people to understand that yes – while having the “popular” metrics are nice, there’s almost always a better way “paint the picture” or “tell the story”.
So – all these “expected” stats have uses, but:
- whereas in real estate it’s “location, location, location!”, for web metrics it’s “context context context!”
- in many circumstances it’s not the value that’s important, it’s the difference between that value and some expectation of it:
- predictions that were made have (not) been met;
- outlying behavior indicates activity that might have been more-closely monitored (or should be in the future).
- while these numbers are certainly obtainable for all sites (commercial and non-commercial) with the same methods, their interpretation isn’t always the same.
What do I mean by the last point? Well, if you have a commercial site (i.e., you’re selling things online so that the point of sale happens on the site) then a one-page visit is almost always a “failure” in the sense that the sale didn’t get made: aside from a donation system where you truly can complete the transaction with a single click (from an external site), the expectation is that the user will have to visit at least one other page to select an item to purchase. Non-commercial sites and especially education sites frequently have visits where the user has used a search engine on a specific phrase, and “struck gold” going to the site where the information they sought was immediately available. Objective achieved, and they move on.
So there are benefits:
- quick and easy retrieval and establishment of a time series of data (so you can watch the number of visitors increase weekly/monthly/yearly);
- comparison with stats from other sites, especially those that are most similar to yours;
- you’re using terms that other metrics aficionados understand;
but there are also downsides:
- you’re not making the most of your data because your crafting a “story” defined by someone else rather than concentrating on what makes your site special;
- it takes more resources to do deeper mining and spending all of your metrics budget on popular stats might not be the best investment;
- you could be discouraged even though your site is actually a smashing success because the numbers just don’t seem “impressive enough”.
Let’s spend a few posts with each of the “favorite” stats, point out their strengths, uses, etc. but for the most part rip them to shreds. ☺
… Call it a statistical catharsis.
Outside of many McDonald’s, they tell you the numbers they’ve served is in excess of 99 billion. If that refers to customers, it must include repeat visits, unless they’ve used a time machine to reach most of their clientele who are primarily space aliens. And while the number sounds impressive, it doesn’t really say much in terms of impact. Or, it might refer to burgers, but then it doesn’t say who the recipients are (or if they’re all human…). Or it might refer to the number of times a transaction occurred, but then that might or might not include orders that don’t include burgers… In short, I don’t know what it means, but it certainly sounds like a huge number and I suspect that that’s truly the only message, left entirely to the interpretation of the viewer of the sign with the expectation that there are very few things with that many zeroes in it to use for comparison.
Frequently we fall into the same trap with respect to metrics. We WANT to be seen as important, or at least respected, having achieved our goals that involved designing, building, launching, and maintaining our web sites – it’s a lot of work!
So we start looking for impressive sounding numbers – things that are not subject to fluctuations. Cumulatve numbers always increase and they’re low-hanging fruit.
In a way they’re like the odometer in your car which is handy . But if you think about it, there’s not that much joy at watching those miles tally up. Really, there’s typically two things that you use that meter for:
- working out distances between way points on trips;
- knowing when to do maintenance.
Why is it then that we’re trying to make our web stats into more?
Plots of cumulative stats (page views, visitors, etc.) fall into this trap: because they’re cumulative they never go down, and you get that familiar “hockey stick” shape. (Have you seen any lately that made you go “WOW!”?) But going back to our odometer metaphor, the two aforementioned uses do have their metrics counterparts:
First, we do have way points we measure from: monthly stats, yearly stats, etc., and measuring those offsets provides us with an indication of growth (among other things). OK – that’s fairly obvious. But the web site “trip odometer” has an interesting twist in other (shorter-term) ways. What about the comparative increase in hits of part of the site when new content is added and looking for the time frame that things “go viral”? I can see this being important for sites with RSS feeds: you want to know if the content being served is rapidly perused and the comparative popularity of each article (or articles on the same topic).
As for the second – preventive maintenance – while I’ve never seen a site development strategy put in the context of the “5,000,000 visit checkup” it sort of makes sense, and it would be interesting to see what sort of effect it would have if we chose to revisit design, UI, and even regular maintenance checks (dead links are a great example) based upon milestones of visits or page views.
Forthcoming: why distributions are so cool, and when they are/aren’t “normal”.
A few years ago I was asked to write a series of articles on web metrics for the National Science Digital Library (NSDL).
- Using Web Metrics to Estimate Impact: I – The Lawlessness of Averages
- Using Web Metrics to Estimate Impact: II – When Counting Doesn’t “Count”
- Using Web Metrics to Estimate Impact: III – Growing Pains?
- Using Web Metrics to Estimate Impact: IV – The “Path” to Understanding Users
I’ll be revisiting all of the topics in these papers (especially path analysis) over the coming months.
What’s interesting (to me at least) is that most of the issues/challenges/insights haven’t changed much in this time:
- Usage doesn’t often follow a “simple” pattern: time on site isn’t monolithic: for most information-driven sites, there’s a distinct bimodal distribution of very short visits and longer visits. One thing that has occurred to me (and something I’ve been meaning to follow-up on) is to see if there’s a characteristic change in behavior with repeat visits by the same user (insofar as I can determine that it’s probably the same user), chiefly the behavioral shift from discovering what the site has to offer and “getting your feet wet” in terms of the navigation and user interface, to more-frequently jumping to specific places on the site for quick information retrieval. For classroom web usage, that behavior might manifest itself in searching for materials on a specific topic, bookmarking it, and then using those materials in class (or in an assignment) later on. (Another interesting experiment would be to see if specific content is accessed from the same rough location within a short span of time – this would’ve been harder to do in the past, but geolocation services are readily available – so add another “project” to the to do list!)
- Path analysis is still tough to “crack” – even if thousands of visitors are coming to the site with the same mindset, they aren’t hitting the same content and they don’t get to that content in the same way (and that’s also influenced by how the site’s information is laid out, the UI and navigation, etc.). I’ve made some headway on this, and I’m still doing tests when I have the time…
- Even though metrics has been a hot topic for several years now, there’s still an over-reliance on both scalar data (i.e., single values, esp. averages from which we’re somehow supposed to infer deep meaning), and “Top 10” lists (which can identify “hot” topics) but typically without the follow-up demographic or time serial analysis to discover overall usage patterns.
In other ways, things have changed. For example, there’s less “terror” involved regarding registration (thanks to social networking sites), although the underlying issues of privacy and attempts to otherwise compromise one’s security have grown, not lessened. Putting those aside for the moment (because our intentions are completely noble), it gives us FAR more information to work with. Even so, other developments (such as geolocation services) means that the “anonymous user” isn’t quite as anonymous as they were in the past.
To quote Spiderman: With great power comes great responsibility. …
So, you’ve got some numbers – say monthly visits. If your site is just starting out, you’re probably relying on viral exposure to increase traffic: i.e., if you build it, they will come. Or perhaps you’ve done some degree of marketing, even if it’s only getting your site listed somewhere or noticed by search engines.
If you’ve been around for a while, you might have a regular client base (in that you know that some of your visitors are repeat visitors) though you’re also relying on word-of-mouth from those satisfied users to drive traffic.
So, what does something like “monthly visits” MEAN? Continue reading