base de paginación dinámica ajax selenium

Estoy intentando ejecutar mi Base spider para la paginación dinámica, pero no estoy logrando el rastreo. He utilizado la paginación dinámica del ajax del selenium. la url que estoy usando es: http://www.demo.com . Aquí está mi código:

# -*- coding: utf-8 -*- import scrapy import re from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.spider import BaseSpider from demo.items import demoItem from selenium import webdriver def removeUnicodes(strData): if(strData): #strData = strData.decode('unicode_escape').encode('ascii','ignore') strData = strData.encode('utf-8').strip() strData = re.sub(r'[\n\r\t]',r' ',strData.strip()) #print 'Output:',strData return strData class demoSpider(scrapy.Spider): name = "demourls" allowed_domains = ["demo.com"] start_urls = ['http://www.demo.com'] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): print "*****************************************************" self.driver.get(response.url) print response.url print "______________________________" hxs = Selector(response) item = sumItem() finalurls = [] while True: next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a') try: next.click() # get the data and write it to scrapy items item['pageurl'] = response.url item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0]) urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract() print '**********************************************2***url*****************************************',urls for url in urls: print '---------url-------',url finalurls.append(url) item['urls'] = finalurls except: break self.driver.close() return item 

mi items.py es

 from scrapy.item import Item, Field class demoItem(Item): page = Field() urls = Field() pageurl = Field() title = Field() 

cuando bash rastrearlo y convertirlo en json obtengo mi archivo json como:

 [{"pageurl": "http://www.demo.com", "urls": [], "title": "demo"}] 

No puedo rastrear todas las URL ya que se está cargando dinámicamente.

Espero que el siguiente código ayude.

somespider.py

 # -*- coding: utf-8 -*- import scrapy import re from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.spider import BaseSpider from demo.items import DemoItem from selenium import webdriver def removeUnicodes(strData): if(strData): strData = strData.encode('utf-8').strip() strData = re.sub(r'[\n\r\t]',r' ',strData.strip()) return strData class demoSpider(scrapy.Spider): name = "domainurls" allowed_domains = ["domain.com"] start_urls = ['http://www.domain.com/used/cars-in-trichy/'] def __init__(self): self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS) def parse(self, response): self.driver.get(response.url) self.driver.implicitly_wait(5) hxs = Selector(response) item = DemoItem() finalurls = [] while True: next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a') try: next.click() # get the data and write it to scrapy items item['pageurl'] = response.url item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0]) urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]') for url in urls: url = url.get_attribute("href") finalurls.append(removeUnicodes(url)) item['urls'] = finalurls except: break self.driver.close() return item 

items.py

 from scrapy.item import Item, Field class DemoItem(Item): page = Field() urls = Field() pageurl = Field() title = Field() 

Nota: debe tener el servidor de Selenium rc en ejecución porque HTMLUNITWITHJS funciona con selenium rc solo usando Python.

Ejecute su servidor de selenium rc emitiendo el comando :

 java -jar selenium-server-standalone-2.44.0.jar 

Ejecuta tu araña usando el comando :

 spider crawl domainurls -o someoutput.json 

En primer lugar, no necesita presionar el botón showMoreCars ya que se presionará dinámicamente después de cargar la página. En cambio, esperar un segundo será suficiente.

Aparte de tu código de scrapy , el selenium es capaz de capturar todas las href para ti. Aquí hay un ejemplo de lo que necesita hacer en selenium.

 from selenium import webdriver driver = webdriver.Firefox() driver.get("http://www.carwale.com/used/cars-in-trichy/#city=194&kms=0-&year=0-&budget=0-&pn=2") driver.implicitly_wait(5) urls = driver.find_elements_by_xpath('.//a[@id="linkToDetails"]') for url in urls: print url.get_attribute("href") driver.close() 

Todo lo que necesitas es combinar esto con tu parte desechable.

Salida:

 http://www.carwale.com/used/cars-in-trichy/renault-pulse-s586981/ http://www.carwale.com/used/cars-in-trichy/marutisuzuki-ritz-2009-2012-s598266/ http://www.carwale.com/used/cars-in-trichy/mahindrarenault-logan-2007-2009-s607757/ http://www.carwale.com/used/cars-in-trichy/marutisuzuki-ritz-2009-2012-s589835/ http://www.carwale.com/used/cars-in-trichy/hyundai-santro-xing-2003-2008-s605866/ http://www.carwale.com/used/cars-in-trichy/chevrolet-captiva-s599023/ http://www.carwale.com/used/cars-in-trichy/chevrolet-enjoy-s595824/ http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s606823/ http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s606617/ http://www.carwale.com/used/cars-in-trichy/marutisuzuki-estilo-2009-2014-s592745/ http://www.carwale.com/used/cars-in-trichy/toyota-etios-2013-2014-s605950/ http://www.carwale.com/used/cars-in-trichy/tata-indica-vista-2008-2011-s599001/ http://www.carwale.com/used/cars-in-trichy/opel-corsa-s591616/ http://www.carwale.com/used/cars-in-trichy/hyundai-i20-2008-2010-s596173/ http://www.carwale.com/used/cars-in-trichy/tata-indica-vista-2012-2014-s600753/ http://www.carwale.com/used/cars-in-trichy/fiat-punto-2009-2011-s606934/ http://www.carwale.com/used/cars-in-trichy/mitsubishi-pajero-s597849/ http://www.carwale.com/used/cars-in-trichy/fiat-linea20082014-s596079/ http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s597390/ http://www.carwale.com/used/cars-in-trichy/mahindra-xylo-2009-2012-s603434/