Question: Use Scrapy to iteratively get items from two sites

Question

Use Scrapy to iteratively get items from two sites

Answers 1
Added at 2016-12-27 23:12
Tags
Question

I'm halfway through writing a scraper with Scrapy and am worried that its asynchronous behaviour may result in issues.

I start on a page that has several links a from each of which I get an x. These x are saved (downloaded). Then I go to another page b where I use some info I got from one of the a links (it is constant for all of them) to select and download y.
Then I "pair" x and y, how I pair them is not important what matters is just that x and y both exist (are downloaded).
Now I would consider my starting page (start_urls) processed, and I would get the link to 'turn' the page (as in I'm on page 1 and am now going to page 2), which I then Request to start the process from the beginning.

The code looks roughly like this :

# ..imports, class etc.

start_url = ['bla']
start_url_buddy = ['bli']


def parse(self, response):

    urls = response.xpath(...)
    for url in urls:
        yield scrapy.Request(url, callback=self.parse_child)

    yield scrapy.Request(start_url_buddy, callback=self.parse_buddy)

    pair_function(self.info)

    # Finished processing start page. Now turning the page.
    # could do smth like this to get next page:
    nextpage_url = response.xpath(...@href)
    yield scrapy.Request(nextpage_url)

    # or maybe something like this?
    start_urls.append(response.xpath(...@href))

# links `a`
def parse_child(self, response):

    # info for processing link `b`
    self.info = response.xpath(...)

    # Download link
    x = response.xpath(...)
    # urlopen etc. write x to file in central dir

# link `b`
def parse_buddy(self, response):

    # Download link
    y = response.xpath(...self.info...)
    # urlopen etc. write y to file in central dir

I haven't gotten to the page turning part yet and am worried whether that will work as intended (I'm fiddling with the merge function atm, getting xs and y works fine for one page). I don't care in what order the xs and y are gotten as long as it's before pair_function and 'turning the page' (when the parse function should be again).

I have looked at a couple other SO questions like this but I haven't been able to get a clear answer from them. My basic problem is I'm unsure as to how exactly the asynchronicity is implemented (it doesn't seem to explained in the docs?).

EDIT: To be clear what I'm scared will happen is that yield scrapy.Request(nextpage_url) will be called before the previous one's have gone through. I'm now thinking I can maybe safe guard against that by just appending to start_urls (as I've done in the code) after everything has been done (the logic being this should result in the parse function being called on the appended url?

Answers
nr: #1 dodano: 2016-12-27 23:12

You won't be able to know when a request is finished, as scrapy is processing all your requests, but it doesn't wait for the requested server to return a response before processing the next pending request.

About asynchronous calls, you don't know "when" they will end, but you know "where", and that's the callback method for. So for example if you want for sure do a request after another you can do something like:

def callback_method_1(self, response):
    # working with response 1
    yield Request(url2, callback=self.callback_method_2)

def callback_method_2(self, response):
    # working with response 2, after response 1
    yield Request(url3, callback=self.callback_method_3)

def callback_method_3(self, response):
    # working with response 3, after response 2 
    yield myfinalitem

In this example you know for sure that the first request, was done before the url2 request, and that was before url3. As you can see, you don't know exactly "when" these requests were done, but you do know "where".

Also remember that a way to communicate between callbacks is using the meta request argument.

Source Show
◀ Wstecz