Workflow Model: Download All Pages Before Parsing Them
Website scraping is a batch work. Basically there are two workflow models in this job. One is: download page1 => parse page1 and save records of page1 => download page2 => parse page2 => … . The other is download all the pages => parse and save all the pages . You’d better go to the "download all then parse and save" approach.
Why is that? Think of failing situations. You may fail in the middle of a batch process due to parsing error (mainly caused by your not-so-robust paring program). If you are going with "download one and parse one" during which a parsing error happens, you may have to spend a while to investigate and correct your program, during which AWS EC2 (if you are using one) will not stop charging you, and the site you are scraping may have found your "attack" and starts to bring up an anti-scraping mechanism. What’s worse is that when you retry your program, you may have to re-download the pages you have already downloaded, unless your program recorded when it last stopped and knows how to restart from there. It’s doable, but tricky, and normally it’s not worth it. Finally one retry doesn’t necessarily work. Your program may have bugs again and again. It normally takes more than 2 revisions to get a perfect version. That will bring more frastruation.
On the other hand, in a "download all first" approach a parsing error will not lead to the problems mentioned above. You’ve got all your pages. All the material is in your hand, you’ll have less pressure.
Time management consideration is another factor that you should choose "download all first". You don’t want to restart "downloading" since it’s time consuming, while you can redo parsing because it can be done in a few minutes. To sum up, first deal with the things that is not totally under you control, then do the left job with less worries.
Save All Files in a Fixed Path and Use a Single File Path API
Page downloading involves retrying. You don’t want to re-download the pages that you have already downloaded during previous trials. One way to achieve this is to test if their corresponding files are already existing. That’s why you must save the files in a fixed path on every try.
You may also want to use a single file path API for all the modules of your program to decide where the files are or should be, so that you don’t need to pass the paths as module parameters. In this way you don’t only simplify your code but also enforce the "Fixed Path" scheme.
Log Errors and Important Statistics
You must log errors to find out whether you have got all the data, how failures happen and which pages need to be re-downloaded.
You should also record key statistics, such as how many records some landing page tell you there will be, so that you can validate your final results against this number. Also it provides the foundation for time measurement.
Make your Downloading Faster and More Robust
To make the downloading faster, you can adopt a thread pool based design to download the pages in parallel. You must also reuse your http-connection since establishing a connection is quite time-consuming. If you are using Java, try Apache’s HttpClient’s PoolingHttpClientConnectionManager.
Let your downloading worker retries itself for 2-5 times when it fails to download a page, so as to increase the chance that you get all the data in one batch. You can let it sleep 100ms before retrying, so that the website can "take a breath" to serve you again. You must figure out what is failure and which failures are "retriable". Here is my list:
1. Http Error with code >= 500 is a retriable failure
2. Http 200 saying something like "cannot work for now" is a retriable failure
3. Http 200 with too little data is a retriable failure
4. Network Issue such as Timeout, IO Exception is a retriable failure
Deal with the Website’s Performance Issue
A typical performance problem of the target website is that it may fail or refuse to work when you are querying upon a large data set. For example, if you search something within the "shoes" category, you may get the results soon; but when you search the same thing within the "clothes" category, it may take quite a while or even fail you.
Another problem is related to paging. You may find that the program runs well when it scrapes the first pages, and starts to fail when the page index reaches 100. This happens a lot to Lucene/Solr based websites.
To deal with the 2 problems above, you could split your target category into small ones, and do the query upon each. Small categories normally have much less records and less pages.
Exploit the Cloud if Necessary
Scraping program must run in a strong computer, unless you are only targeting a small dataset. It requires a network access close to your target website. It must have multiple CPUs if the program involves multi-threads. It should also have big memory. Finally, it must have a large storage otherwise your disk could soon be full.
Your own computer may not satisfy all the requirements. In this case, try Cloud. I often use AWS EC2 to scrape US websites. I can get all I need from EC2, and its cost is low and flexible.
Be Honest to your Sponsor
If you are doing the scraping for yourself, you can ignore this part. Read on if you are a freelancer or if you are doing it for your client or your boss.
Websites may not be accurate themselves. The landing page can say that it has 10k records but it actually only has 9.5k. One tells you that you’ll get 1000 in total when you do the query 20 records per page, and when you chose 100 records per page to make it run faster, you end up with only 900 records in total.
Find out the inaccuracies of the website and let your sponsor know. Let them know that it is not your fault that some records will be missing.
Sometimes you just can’t download all the records due to the site’s poor performance, tech difficulties, or time/budget limitation. Also let your sponsor know. Ask him how many percent loss of records he can endure and you two reach an acceptable deal.
Be honest to your sponsor. It’s better you find out the problems rather than he does.
When the scraping is done, provide your sponsor a validation report. Tell them how many records are missing for each category (You can do this by analyze the final results and the logs) and provide the links for them to check. Let him feel that your job is under his control.