Websites getting crawled but not scraped Scrapy

SIM 2019-07-03 22:41

The way you have defined selectors are error prone. Moreover, there are few faulty selectors which are not working at all. The link to the next page is not working as well. It only goes to the page 1 and then quits. Lastly, I don't know any usage of next_sibling in css selector so I had to dig out that next sibling thing in some awkward manner.

class CapeWaterfrontSpider(scrapy.Spider):
    name = "cape_waterfront"
    start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']

    def parse(self, response):

        for prop in response.css('.grid-item'):
            link = prop.css('.property-image a::attr(href)').get()

            bedrooms = [elem.strip() for elem in prop.css(".bedrooms::text").getall()]
            bedrooms = bedrooms[-2] if len(bedrooms)>=1 else None

            bathrooms = [elem.strip() for elem in prop.css(".bathrooms::text").getall()]
            bathrooms = bathrooms[-2] if len(bathrooms)>=1 else None

            gar = [elem.strip() for elem in prop.css(".garages::text").getall()]
            gar = gar[-2] if len(gar)>=1 else None

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'bedrooms': bedrooms,
                    'bathroom':  bathrooms,
                    'garages': gar
                }},
                callback=self.get_loc,
            )

        next_page = response.css('.pagination-link a.next::attr(href)').get()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def get_loc(self,response):
        items = response.meta['item']
        print(items)

If you wanted to pursue a cleaner approach to get the three items, I think xpath is what you wanna stick to:

for prop in response.css('.grid-item'):
    link = prop.css('.property-image a::attr(href)').get()
    bedrooms = prop.xpath("normalize-space(.//*[contains(@class,'bedrooms')]/label/following::text())").get()
    bathrooms = prop.xpath("normalize-space(.//*[contains(@class,'bathrooms')]/label/following::text())").get()
    gar = prop.xpath("normalize-space(.//*[contains(@class,'garages')]/label/following::text())").get()

I've kicked out two or three fields for brevity and I suppose you can manage them.

SIM 2019-07-03 22:42:53

I've included css selectors and xpaths within your script to dig out those items @saraherceg.

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels