Screen scraping is any automated process for extracting content from a website for use in another context. Screen scraping is acomplished using programs called robots, web-crawlers, spiders, or just bots.
Search engines also deploy bots to crawl the web, but most retrieve primarily static content and visit a given website somewhat infrequently, placing very little burden on the sites visited. The bots that typically cause trouble seek out dynamic content in response to user inputs and visit the targeted site repeatedly, sometimes contributing significantly to the visited site’s overall traffic.
A recent litigation involving two high profile companies, Oracle and SAP, reminds us to consider a few general rules before deploying a bot.
Rule #1: Don’t be a Spammer
People, including judges, hate spam. If you use a bot to collect email addresses or other contact information and send unsolicited information, you are likely to anger someone enough to cause them to sue. Chances are you will lose the resulting case.
In one of the earliest web scraping cases, America Online Inc. v. LCGM, Inc., 46 F. Supp. 2d 444 (E.D. Va. 1998), LCGM used a bot to collect names of AOL customers and sent spam to those customers. The court found that LCGM interfered with AOL’s computer system by occupying its capacity and that sending spam may have injured AOL’s business good will.
Rule #2: Think Twice Before Scraping Your Competitors
In eBay v. Bidder’s Edge, 100 F. Supp. 2d 1058 (N.D. Cal. 2000), Bidder’s Edge used a bot to pull data from eBay’s website in order to aggregate information across numerous online auction sites. The bot visited eBay’s site about 100,000 times per day without authorization and in violation of eBay’s stated policies. Damage was established in the form of lost server capacity.
In Oyster Software, Inc. v. Forms Processing, Inc., C-00- 0724-ICS, 2001 U.S. Dist. LEXIS 22520 (N.D. Cal. Dec. 6, 2001), Oyster complained that Forms Processing used a bot to copy the metatags on Oyster’s site in order to redirect web user searches to its own site, where it sold directly competing products. The court held that the mere fact that the conduct was an unauthorized use of Oyster’s computer system was sufficient for Oyster’s claims to proceed to trial.
In EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577 (1st Cir. 2001), Explorica deployed a bot to scrape 60,000 lines of data from EF’s website to determine all its prices for global high school student tours in order to offer a competing service at lower prices. EF’s server was accessed more than 30,000 times. The court held that EF’s expenditure of substantial sums to assess the extent, if any, of the physical damage to their website caused by the intrusion was sufficient to support an injunction forbidding Explorica’s scraping.
Most recently, Oracle sued SAP AG, No. 07-01658, (N.D. Cal. March 22, 2007), alleging that SAP used the login credentials of Oracle software customers to unleash a bot on Oracle’s customer support website to download thousands of files and documents that SAP could then use to provide lower-cost support services to Oracle software customers, without having to invest time and expense in developing such solutions themselves. This case is ongoing.
Rule #3: Maximize Benefit and Minimize Disruption to the Visited Site
Lawsuits typically result because the proprietor of a scraped site dislikes the way the scraped data is used. While spammers and direct competitors may simply be out of luck, there are other uses that companies may ignore, accept, tolerate, or even welcome.
Dynamically-generated content places more strain on web servers than static content. If the bot you are thinking of using requires thousands of dynamically-generated results, performed every day in perpetuity, then the owner of the targeted site is going to notice and will not be amused.
Find a way to:
1. retrieve static content instead of dynamic content,
2. limit the total number of requests coming from the bot,
3. limit the amount of data retrieved in each request, or
4. space out the requests over longer time intervals.
Otherwise, you should consider asking permission from the visited sites. Technically-savvy readers maybe thinking of ways to disguise the origin of the bot’s requests. Bidder’s Edge tried a rotating system of proxy services which disguised the origin of its bot’s requests to eBay’s site. This merely started a technological arms race in which eBay developed ever-more sophisticated means of detecting Bidder’s Edge’s requests while Bidder’s Edge kept trying—and failing—to disguise the origin of the requests. Being sneaky also doesn’t make you look good in court.
You may get away with it if:
1. your use for the data is not aimed at stealing or harassing a target site’s customers,
2. you can find a minimally-intrusive means of collecting the data, and
To have a site welcome your bot, you need something more: a planned usage that benefits the site itself. While Bidder’s Edge and Tickets.com actually drove more traffic to the sites they targeted, they didn’t do enough to satisfy the plaintiffs who sued them. Some sites want to ensure that their advertising is viewed and so will object to any middle-man that directs customers directly to the products, services, or data they seek. Others are more practical and will accept such arrangements so long as an appropriate fee is paid.
This is not to say there is no reason to feel frustrated that companies are able to charge for otherwise freely available information. There’s also an inherent linedrawing problem in encouraging individuals to visit one’s site for free, but suing a company that visits 1,000 times per day. Would 500 times be too much? How about 150? You don’t know when you are going to step across that subjective line.
Courts have not been terribly sympathetic to screen scraping, and so principled positions are typically going to give way to practical solutions.