I'm halfway through writing a scraper with Scrapy and am worried that its asynchronous behaviour may result in issues.
I start on a page that has several links
a from each of which I get an
x are saved (downloaded). Then I go to another page
b where I use some info I got from one of the
a links (it is constant for all of them) to select and download
Then I "pair"
y, how I pair them is not important what matters is just that
y both exist (are downloaded).
Now I would consider my starting page (
start_urls) processed, and I would get the link to 'turn' the page (as in I'm on page 1 and am now going to page 2), which I then Request to start the process from the beginning.
The code looks roughly like this :
# ..imports, class etc.
start_url = ['bla']
start_url_buddy = ['bli']
def parse(self, response):
urls = response.xpath(...)
for url in urls:
yield scrapy.Request(url, callback=self.parse_child)
yield scrapy.Request(start_url_buddy, callback=self.parse_buddy)
# Finished processing start page. Now turning the page.
# could do smth like this to get next page:
nextpage_url = response.xpath(...@href)
# or maybe something like this?
# links `a`
def parse_child(self, response):
# info for processing link `b`
self.info = response.xpath(...)
# Download link
x = response.xpath(...)
# urlopen etc. write x to file in central dir
# link `b`
def parse_buddy(self, response):
# Download link
y = response.xpath(...self.info...)
# urlopen etc. write y to file in central dir
I haven't gotten to the page turning part yet and am worried whether that will work as intended (I'm fiddling with the merge function atm, getting
y works fine for one page). I don't care in what order the
y are gotten as long as it's before
pair_function and 'turning the page' (when the parse function should be again).
I have looked at a couple other SO questions like this but I haven't been able to get a clear answer from them. My basic problem is I'm unsure as to how exactly the asynchronicity is implemented (it doesn't seem to explained in the docs?).
EDIT: To be clear what I'm scared will happen is that
yield scrapy.Request(nextpage_url) will be called before the previous one's have gone through. I'm now thinking I can maybe safe guard against that by just appending to
start_urls (as I've done in the code) after everything has been done (the logic being this should result in the
parse function being called on the appended url?