Warm tip: This article is reproduced from stackoverflow.com, please click
python scrapy web-scraping

Websites getting crawled but not scraped Scrapy

发布于 2020-03-27 10:15:49

I have been scraping this website and trying to store properties and while some properties do get scraped some just get crawled and not scraped:

class CapeWaterfrontSpider(scrapy.Spider):
    name = "cape_waterfront"
    start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']

    def parse(self, response):
        for prop in response.css('div.col-sm-6.col-md-12.grid-sizer.grid-item'):

            link = prop.css('div.property-image a::attr(href)').get()

            bedrooms = prop.css('div.property-details li.bedrooms::text').getall()
            bathrooms = prop.css('div.property-details li.bathrooms::text').getall()
            gar = prop.css('div.property-details li.garages::text').getall()

            if len(bedrooms) == 0:
                bedrooms.append(None)
            else:
                bedrooms = bedrooms[1].split()
            if len(bathrooms) == 0:
                bathrooms.append(None)
            else:
                bathrooms = bathrooms[1].split()
            if len(gar) == 0:
                gar.append(None)
            else:
                gar = gar[1].split()

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'title': ' '.join(prop.css('div.property-details p.intro::text').get().split()),
                    'price': ''.join(prop.css('div.property-details p.price::text').get().split()),
                    'bedrooms': str(bedrooms),
                    'bathroom':  str(bathrooms),
                    'garages': str(gar)
                }},
                callback=self.get_loc,
            )

        next_page = response.css('p.form-control-static.pagination-link a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Any suggestions how to make this work? Thank you very much in advance

Questioner
saraherceg
Viewed
215
SIM 2019-07-03 22:41

The way you have defined selectors are error prone. Moreover, there are few faulty selectors which are not working at all. The link to the next page is not working as well. It only goes to the page 1 and then quits. Lastly, I don't know any usage of next_sibling in css selector so I had to dig out that next sibling thing in some awkward manner.

class CapeWaterfrontSpider(scrapy.Spider):
    name = "cape_waterfront"
    start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']

    def parse(self, response):

        for prop in response.css('.grid-item'):
            link = prop.css('.property-image a::attr(href)').get()

            bedrooms = [elem.strip() for elem in prop.css(".bedrooms::text").getall()]
            bedrooms = bedrooms[-2] if len(bedrooms)>=1 else None

            bathrooms = [elem.strip() for elem in prop.css(".bathrooms::text").getall()]
            bathrooms = bathrooms[-2] if len(bathrooms)>=1 else None

            gar = [elem.strip() for elem in prop.css(".garages::text").getall()]
            gar = gar[-2] if len(gar)>=1 else None

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'bedrooms': bedrooms,
                    'bathroom':  bathrooms,
                    'garages': gar
                }},
                callback=self.get_loc,
            )

        next_page = response.css('.pagination-link a.next::attr(href)').get()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def get_loc(self,response):
        items = response.meta['item']
        print(items)

If you wanted to pursue a cleaner approach to get the three items, I think xpath is what you wanna stick to:

for prop in response.css('.grid-item'):
    link = prop.css('.property-image a::attr(href)').get()
    bedrooms = prop.xpath("normalize-space(.//*[contains(@class,'bedrooms')]/label/following::text())").get()
    bathrooms = prop.xpath("normalize-space(.//*[contains(@class,'bathrooms')]/label/following::text())").get()
    gar = prop.xpath("normalize-space(.//*[contains(@class,'garages')]/label/following::text())").get()

I've kicked out two or three fields for brevity and I suppose you can manage them.