There is a lot of hullabaloo about the Artificial Intelligence. And rightfully so, because we, humans have understood very clearly that the automatization of some of the aspects of our work is going to free up time for investment into aspects which are much more critical.

For example, for an underwriter at a medical insurance provider, inspection of the pre-insurance health checkup documents for assigning risk score to the profile is a necessary job, but extremely mundane and getting human workers to do this leads it to be prone with human errors.

AI comes into rescue in such kind of a situation. An AI Models could be trained today with such documents, to categorically detect anomalies when encountering a new document and highlight as such to the medical insurance provider, leaving the critical task of assigning the risk score to the underwriter.

Another example is in the area of Tuberculosis detection through nodules found in Chest X-Rays. As an established practice, the Radiologists go through the Chest X-Rays along with rapid sputum test result and through experience are able to figure out if a certain nodule formation pointing towards TB. Once detected, the Radiologist can devise further diagnosis or medicinal care. However, again, this is prone to human error and time taking in general, especially considering the Radiologist would still require to go through every case before reaching to the case where medical care is required, therefore losing precious time.

Now, AI comes to rescue again. An AI Models trained on multiplicity of Chest X-Rays, could reduce that time to seconds while the Radiologist is able to focus on the medical care that is required. 

Today, almost every Industry is galore with such examples of use cases being solved using AI Models, and has got the name of A-N-I(Artificial Narrow Intelligence).

But then, such a capability requires the AI Models to be extremely data hungry. Therefore, it is a Data Scientist’s true and real headache to have a large set of data to be able to train such a model.

Now, today with the increasing reach of Internet, it has become a gold mine for data. 

So, today scraping data from public websites has become a norm to function as a data source for those AI Models- it being called Web Scraping.

So two big questions are- What could cause a risk if this is done? And how to do it such that Legal risk is minimized? 

Web scraping is not illegal in itself.

The act of running a script through one’s own website and retrieving information has not been deemed illegal under any Law.

Let us examine where the contesting situation could arise.

Take Linkedin as an example, you, me and everyone else who post their written, video or pictorial contents on Linkedin, still retain the ownership of these contents. 

We knew some of you would be amazed, so here is the link.

Now, content creators are providing Licenses to their works to Linkedin as a content host. But then, Linkedin owns the Copyright to the way the content gets structured, the code and overall makeup of the website.

So even if the content creators disclaim Copyright Ownership to their contents essentially releasing in Public domain, the overall content would consist of Copyrighted + Public domain contents.

But, the scraping code would not know this.

It would indiscriminately search for css images(for example) and give you a download. So there starts the problem.

This is true for Facebook, Amazon, Ebay and every other E-commerce or social media website of today. 

Further, these websites find it very troubling when they are susceptible to getting hit by millions of such scraping codes in a day (not week, not years). They have spent a fortune to create a platform for creators and are responsible for keeping their content safe. And then, there are personal information galore in these websites. 

So they sue businesses that are scraping or crawling or sniffing data from them. Read more here and here.

Laws across the globe are still playing catch with whether such an act of accessing the Copyrighted material is Legal(“Access”) and further whether use of such a material for creating AI models could be deemed as Legal(“Use”) (fair use for our Lawyer friends).  

There are two issues seen here: Access and Use.

Some jurisdictions have come to term with the reality of AI use and have defined the Use as fair use under Copyright Law (Eg. Copyright Directive, 2019 in EU) however legality of Access is still getting decided on a case to case basis. Also, it is unlikely that the personal Information received as part of the scraping is a lawful basis for conducting your business (unless you are being hired by a company to undergo scraping).

But then, you cannot wait for the Law makers of the world to come to a conclusion before you fructify your pet data science project. Right?

So what can be done?

You may follow the algorithmic approach.

Step1. Are you randomly looking for data over internet or know the website where from to collect data? For the earlier, go to step 2, for the later go to Step 3. Or directly go to Step 6.

Step 2: Are you looking for images of impersonal object, start with Google Images, Check in the “Labeled for reuse” or search Stock Photos. If this does not work, go to Step 5. 

Step 3. If you know the website where you are going to scrap, go straight down to the bottom of their page. Did you find a Terms of Use, Terms and Conditions? Click open and then CTRL+F “scrap”. Did you find the word? Stop! Go to Step 5. Or Else go to Step 4.

Step 4. Are you accessing an European website, eg. Amazon.co.uk, mostly you are ok. If you are accessing website of any other region, go to Step 5.

Step 5.  Can you encourage your employees or friends to click photos and send it to you or buy the photos? If yes, you do not have to worry. Else, go to Step 6.

Step 6. Send the types of data required for your purpose, use of the data and websites from where you are accessing to your Lawyer. Or [email protected]