|
This site is for sale,
Learn More
A Comprehensive Strategy for Using Web Site Statistics
Track The Effectiveness Of Your Marketing Effort
Part 2
By Carlton Lovegrove
Originally Published: March, 15 2005
Continued From:
<<< A Comprehensive Strategy for Using Web Site Statistics Part 1
Log file data
URLs and Referrer Fields
While these fields are useful to analyze and provide reasonable characterizations, several limitations make analysis difficult when attempting path reconstruction efforts. The URL recorded is the URL as requested by the user, not the location of the file returned by the server. This behavior can cause false tabulation for pages when the requested page contains relative hyperlinks, symbolic links, and/or hard coded expansion/translation rules,
e.g., directories do not always translate to index.html. It also can lead to two paths being considered different when in actuality they contain the same content. While both pieces of information are useful, the canonical file system-based URL returned by the server would arguably be more useful as it removes the ambiguity of what resource was returned to the user.
In addition, the content of the information contained in the referrer field can be quite varied. Various browsers and proxies do not send this information to the server for privacy and other reasons. In addition, the value of the referrer field is undefined for cases in which the user requests a page by typing in the URL, selects a page from their Favorites/Bookmarks list, or uses other interface navigational aids like the history list. Furthermore, several browsers provide conflicting values for the referrer field. To illustrate, suppose a user selects a listing for the Dell Corporation on Yahoo. In requesting the Dell splash page, the URL for the page on Yahoo is provided as the value for the referrer field. Now suppose the user clicks on the Products page, returns to the Dell splash page, and reloads the splash page. In several popular browsers, the referrer field for Yahoo is included in the second request for the Dell splash page although the last page viewed on the user's surfing path was the Product page in the Dell site. If one chooses to reconstruct paths by relying upon the referrer field, the paths of two users may be identified instead of only one. Given these limitations, strong reliance upon the information in the referrer field may be more problematic than one would initially expect.
User Agent Fields
The user agent field also suffers from imprecise semantics, different implementations, and missing data. This can partially be attributed to the use of the field by browser vendors to perform content negotiation. Given that the rendering of HTML differs from browser to browser, servers have the ability of altering the HTML based upon which browser is on the other end. Consequently, the user agent field may contain the name of multiple browsers. Some proxies also append information to this field. In addition, the value of the user agent field can vary for requests made by the same user using the same Web browser. Adding to the confusion, there is no standardized manner to determine if requests are made by autonomous agents (e.g., robots), semi-autonomous agents acting on behalf of users (e.g., copying a set of pages for off-line reading), or humans following hyperlinks in real time. Clearly, it is important to be able to understand these classes of requests when attempting to model surfing behaviors.
Interpreting Cookies
Although cookies were initially implemented to facilitate shopping carts, a common use of cookies is to uniquely identify users within a web site. Cookies work in the following manner. When a person visits a cookie enabled web site, the server replies with the content and a unique identifier called a cookie, which the browser stores on the user's machine. On subsequent requests to the same web site, the browser software includes the value of the cookie with each request. Because the identifier is unique, all requests that were with the same cookie are known to be from the same browser. Since multiple people may use the same browser, each cookie may not actually represent a single user, but most web sites are willing to accept this limitation and treat each cookie as a single user. Recently, browser vendors have provided users with controls to select the cookie policy that maps to their privacy preferences. This enables users to choose various levels of awareness when visiting a server that issues cookies in addition to allowing users to completely disable their browser from sending cookies. Consequently, unless a site requires people to use cookies to receive content, the cookie field may be null, which leaves the task of identifying user paths to relying upon the other recorded fields.
Given the limitations of the information recorded in Web access logs, it is not surprising that sites require users to adhere to cookies and defeat caching to gain more accurate usage information. Still, numerous sites either do not use cookies or do not require users to accept a cookie to gain access to content. In these cases, determining unique users and their paths through a web site is typically done heuristically.
Even when cookies are used, several scenarios are possible when a previously encountered cookie is processed. If the request is coming from the same host regardless of the user agent, the request is treated as being issued by the same user. This is because a unique cookie is issued to only one browser. If the user agent field remains the same but the host changes, it is still the same user and some form of IP/domain name change is occurring. This often occurs with users behind firewalls and ISPs that load-balance proxies. However, if we have the same cookie with a different user agent, then an error has most likely occurred as cookies are not shared across browsers. If no cookies are present, then the site statistic software can resort to using IP addresses. If the request comes from a known host, then we could have a new user or the same user, otherwise the request is from a different user. It is important to point out that these latter two cases could also be issued from non-cookie compliant crawling software.
An interesting set of scenarios occur when a new cookie is encountered. If the request is from a host that has already been processed and the previous value of the cookie was null and the user agent is the same, it is fair to conclude that the request is from a new user that just received their first cookie from the server in the previous request. If the client is not using cookie obfuscation software, one would expect the following requests from this user to all contain the same cookie. However, suppose the previous value from the same host and agent was a different cookie, it could be the same user obfuscating cookie requests, or a new user from the same ISP using the same browser version and platform as the user from the previous request. Barring any other piece of supporting evidence like the referrer field or consulting the site's topology, it is difficult to determine which the correct scenario is. If the user agent is different from the previous request, but accompanies a new cookie from the same host, it is fair to assume that a new user has entered the site. Of course, a new cookie from a new host regardless of the agent is a new user.
Continued:
A Comprehensive Strategy for Using Web Site Statistics Part 3 >>>
Continued From:
<<< A Comprehensive Strategy for Using Web Site Statistics Part 1
Carlton Lovegrove is a PhD of Information Systems
Site Promotion Articles Indexes:
|
|