python 网站被抓取但未被抓取Scrapy

SIM 2019-07-03 22:41

您定义选择器的方式容易出错。此外，几乎没有故障的选择器很少起作用。指向下一页的链接也无法正常工作。它只会转到第1页，然后退出。最后，我不知道next_siblingcss选择器中的任何用法，因此我不得不以某种尴尬的方式来挖掘下一个同级对象。

class CapeWaterfrontSpider(scrapy.Spider):
    name = "cape_waterfront"
    start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']

    def parse(self, response):

        for prop in response.css('.grid-item'):
            link = prop.css('.property-image a::attr(href)').get()

            bedrooms = [elem.strip() for elem in prop.css(".bedrooms::text").getall()]
            bedrooms = bedrooms[-2] if len(bedrooms)>=1 else None

            bathrooms = [elem.strip() for elem in prop.css(".bathrooms::text").getall()]
            bathrooms = bathrooms[-2] if len(bathrooms)>=1 else None

            gar = [elem.strip() for elem in prop.css(".garages::text").getall()]
            gar = gar[-2] if len(gar)>=1 else None

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'bedrooms': bedrooms,
                    'bathroom':  bathrooms,
                    'garages': gar
                }},
                callback=self.get_loc,
            )

        next_page = response.css('.pagination-link a.next::attr(href)').get()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def get_loc(self,response):
        items = response.meta['item']
        print(items)

如果您想采用一种更清洁的方法来获得这三个项目，我想xpath您要坚持的是：

for prop in response.css('.grid-item'):
    link = prop.css('.property-image a::attr(href)').get()
    bedrooms = prop.xpath("normalize-space(.//*[contains(@class,'bedrooms')]/label/following::text())").get()
    bathrooms = prop.xpath("normalize-space(.//*[contains(@class,'bathrooms')]/label/following::text())").get()
    gar = prop.xpath("normalize-space(.//*[contains(@class,'garages')]/label/following::text())").get()

为了简洁起见，我已经淘汰了两三个字段，并且我想您可以对其进行管理。

SIM 2019-07-03 22:42:53

我在脚本中包含了css选择器和xpath，以挖掘出@saraherceg这些项目。

python - 网站被抓取但未被抓取Scrapy

相关问题

热门github