“500 Error interno del servidor” al combinar Scrapy over Splash con un proxy HTTP

Estoy tratando de rastrear una araña de Scrapy en un contenedor de Docker usando Splash (para mostrar JavaScript) y Tor a través de Privoxy (para proporcionar anonimato). Aquí está el docker-compose.yml que estoy usando para este fin:

 version: '3' services: scraper: build: ./apk_splash # environment: # - http_proxy=http://tor-privoxy:8118 links: - tor-privoxy - splash tor-privoxy: image: rdsubhas/tor-privoxy-alpine splash: image: scrapinghub/splash 

donde el Raspador tiene el siguiente Dockerfile :

 FROM python:alpine RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash RUN pip install scrapy scrapy-splash scrapy-fake-useragent COPY . /scraper WORKDIR /scraper CMD ["scrapy", "crawl", "apkmirror"] 

y la araña que estoy tratando de rastrear es

 import scrapy from scrapy_splash import SplashRequest from apk_splash.items import ApkmirrorItem class ApkmirrorSpider(scrapy.Spider): name = 'apkmirror' allowed_domains = ['apkmirror.com'] start_urls = [ 'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/', ] custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url, callback=self.parse, endpoint='render.html', args={'wait': 0.5}) def parse(self, response): item = ApkmirrorItem() item['url'] = response.url item['developer'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){1}[^/]+/$")]/text()').extract_first() item['app'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){2}[^/]+/$")]/text()').extract_first() item['version'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){3}[^/]+/$")]/text()').extract_first() yield item 

donde he agregado lo siguiente a settings.py :

 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPLASH_URL = 'http://splash:8050/' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 

Con el environment para el contenedor del scraper comentado, el raspador funciona más o menos. Obtengo registros que contienen lo siguiente:

     scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None) scraper_1 | 2017-07-11 13:57:19 [scrapy.core.scraper] DEBUG: Scraped from  scraper_1 | {'app': 'Androbench (Storage Benchmark)', scraper_1 | 'developer': 'CSL@SKKU', scraper_1 | 'url': 'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/', scraper_1 | 'version': '5.0'} scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Closing spider (finished) scraper_1 | 2017-07-11 13:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats: scraper_1 | {'downloader/request_bytes': 1508, scraper_1 | 'downloader/request_count': 3, scraper_1 | 'downloader/request_method_count/GET': 2, scraper_1 | 'downloader/request_method_count/POST': 1, scraper_1 | 'downloader/response_bytes': 190320, scraper_1 | 'downloader/response_count': 3, scraper_1 | 'downloader/response_status_count/200': 2, scraper_1 | 'downloader/response_status_count/404': 1, scraper_1 | 'finish_reason': 'finished', scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 13, 57, 19, 488874), scraper_1 | 'item_scraped_count': 1, scraper_1 | 'log_count/DEBUG': 5, scraper_1 | 'log_count/INFO': 7, scraper_1 | 'memusage/max': 49131520, scraper_1 | 'memusage/startup': 49131520, scraper_1 | 'response_received_count': 3, scraper_1 | 'scheduler/dequeued': 2, scraper_1 | 'scheduler/dequeued/memory': 2, scraper_1 | 'scheduler/enqueued': 2, scraper_1 | 'scheduler/enqueued/memory': 2, scraper_1 | 'splash/render.html/request_count': 1, scraper_1 | 'splash/render.html/response_count/200': 1, scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 13, 57, 13, 788850)} scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Spider closed (finished) apksplashcompose_scraper_1 exited with code 0 

    Sin embargo, si comento en las líneas de environment en el docker-compose.yml , obtengo un 500 Error interno del servidor:

     scraper_1 | 2017-07-11 14:05:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying  (failed 3 times): 500 Internal Server Error scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] DEBUG: Crawled (500)  (referer: None) scraper_1 | 2017-07-11 14:05:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response : HTTP status code is not handled or not allowed scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Closing spider (finished) scraper_1 | 2017-07-11 14:05:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats: scraper_1 | {'downloader/request_bytes': 3898, scraper_1 | 'downloader/request_count': 7, scraper_1 | 'downloader/request_method_count/GET': 4, scraper_1 | 'downloader/request_method_count/POST': 3, scraper_1 | 'downloader/response_bytes': 6839, scraper_1 | 'downloader/response_count': 7, scraper_1 | 'downloader/response_status_count/200': 1, scraper_1 | 'downloader/response_status_count/500': 6, scraper_1 | 'finish_reason': 'finished', scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 14, 5, 7, 866713), scraper_1 | 'httperror/response_ignored_count': 1, scraper_1 | 'httperror/response_ignored_status_count/500': 1, scraper_1 | 'log_count/DEBUG': 10, scraper_1 | 'log_count/INFO': 8, scraper_1 | 'memusage/max': 49065984, scraper_1 | 'memusage/startup': 49065984, scraper_1 | 'response_received_count': 3, scraper_1 | 'retry/count': 4, scraper_1 | 'retry/max_reached': 2, scraper_1 | 'retry/reason_count/500 Internal Server Error': 4, scraper_1 | 'scheduler/dequeued': 4, scraper_1 | 'scheduler/dequeued/memory': 4, scraper_1 | 'scheduler/enqueued': 4, scraper_1 | 'scheduler/enqueued/memory': 4, scraper_1 | 'splash/render.html/request_count': 1, scraper_1 | 'splash/render.html/response_count/500': 3, scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 14, 4, 46, 717691)} scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Spider closed (finished) apksplashcompose_scraper_1 exited with code 0 

    En resumen, cuando uso Splash para renderizar JavaScript, no puedo usar con éxito el HttpProxyMiddleware para usar también Tor a través de Privoxy. ¿Alguien puede ver lo que está mal aquí?

    Actualizar

    Siguiendo el comentario de Paul, traté de adaptar el servicio de splash siguiente manera:

      splash: image: scrapinghub/splash volumes: - ./splash/proxy-profiles:/etc/splash/proxy-profiles 

    donde agregué un directorio ‘splash’ al directorio principal así:

     . ├── apk_splash ├── docker-compose.yml └── splash └── proxy-profiles └── proxy.ini 

    y proxy.ini lee

     [proxy] host=tor-privoxy port=8118 

    Tal como lo entiendo, esto debería hacer que el proxy se use siempre (es decir, la whitelist predeterminada es ".*" Y no la blacklist ).

    Sin embargo, si vuelvo a docker-compose build y docker-compose build y aún así obtengo errores HTTP 500. ¿Entonces la pregunta sigue siendo cómo resolver estos?

    (Por cierto, esta pregunta parece similar a https://github.com/scrapy-plugins/scrapy-splash/issues/117 ; sin embargo, no estoy usando Crawlera, así que no estoy seguro de cómo adaptar la respuesta).

    Actualización 2

    Siguiendo el segundo comentario de Paul, verifiqué que tor-privoxy resuelve dentro del contenedor haciendo esto (mientras aún estaba funcionando):

     ~$ docker ps -l CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 04909e6ef5cb apksplashcompose_scraper "scrapy crawl apkm..." 2 hours ago Up 8 seconds apksplashcompose_scraper_1 ~$ docker exec -it $(docker ps -lq) /bin/bash bash-4.3# python Python 3.6.1 (default, Jun 19 2017, 23:58:41) [GCC 5.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> socket.gethostbyname('tor-privoxy') '172.22.0.2' 

    En cuanto a cómo ejecuto Splash, es a través de un contenedor vinculado, similar a la forma descrita en https://splash.readthedocs.io/en/stable/install.html#docker-folder-sharing . He verificado que el /etc/splash/proxy-profiles/proxy.ini está presente en el contenedor:

     ~$ docker exec -it apksplashcompose_splash_1 /bin/bash root@b091fbef4c78:/# cd /etc/splash/proxy-profiles root@b091fbef4c78:/etc/splash/proxy-profiles# ls proxy.ini root@b091fbef4c78:/etc/splash/proxy-profiles# cat proxy.ini [proxy] host=tor-privoxy port=8118 

    Intentaré Aquarium , pero la pregunta sigue ¿por qué la configuración actual no funciona?

    Siguiendo la estructura del proyecto Aquarium como lo sugiere paul trmbrth , descubrí que es esencial nombrar el archivo .ini default.ini , no proxy.ini (de lo contrario, no se “selecciona” automáticamente). Logré que el raspador funcionara de esta manera (consulte mi auto-respuesta a Cómo usar Scrapy con Splash y Tor sobre Privoxy en Docker Compose ).