How to scrape casenet using Scrapy formRequest?











up vote
1
down vote

favorite












I would like to scraper this site: https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name



Here is my code:



import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
name = "casenet"
def start_requests(self):
start_urls = [
"https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse )

def parse(self, response):
data = {
inputVO.lastName: 'smith',
inputVO.firstName: 'fred',
inputVO.yearFiled: 2010,
}
yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
casenet_row = Selector(response).xpath('//tr[@align="left"]')

def parse_pages(self, response):
for row in casenet_row:
if "Part Name" not in row or "Address on File" not in row:
item = Challenge6Item()
item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item


However, I'm getting this error:




/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:
ScrapyDeprecationWarning: Module scrapy.contrib.spiders is
deprecated, use scrapy.spiders instead from scrapy.contrib.spiders
import Rule 2018-11-14 17:47:54 [scrapy.utils.log] INFO: Scrapy 1.5.1
started (bot: Challenge6) 2018-11-14 17:47:54 [scrapy.utils.log] INFO:
Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1,
w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec 4 2017,
14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14
Aug 2018), cryptography 2.3.1, Platform
Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14
17:47:54 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'Challenge6.spiders', 'SPIDER_MODULES':
['Challenge6.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14
17:47:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled item pipelines: 2018-11-14
17:47:55 [scrapy.core.engine] INFO: Spider opened 2018-11-14 17:47:55
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2018-11-14 17:47:55
[scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 1 times):
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 2
times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/robots.txt>
(failed 3 times): 2018-11-14 17:47:55
[scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading https://www.courts.mo.gov/robots.txt>:
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 1 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 2 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 3 times): 2018-11-14 17:47:56 [scrapy.core.scraper] ERROR: Error
downloading https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:56 [scrapy.core.engine] INFO: Closing
spider (finished) 2018-11-14 17:47:56 [scrapy.statscollectors] INFO:
Dumping Scrapy stats: {'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':
6, 'downloader/request_bytes': 1455, 'downloader/request_count': 6,
'downloader/request_method_count/GET': 6, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2018, 11, 14, 23, 47,
56, 195277), 'log_count/DEBUG': 7, 'log_count/ERROR': 2,
'log_count/INFO': 7, 'memusage/max': 52514816, 'memusage/startup':
52514816, 'retry/count': 4, 'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 11, 14, 23, 47, 55, 36009)}




What am I doing wrong?










share|improve this question
























  • The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
    – shanmuga
    Nov 15 at 12:39















up vote
1
down vote

favorite












I would like to scraper this site: https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name



Here is my code:



import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
name = "casenet"
def start_requests(self):
start_urls = [
"https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse )

def parse(self, response):
data = {
inputVO.lastName: 'smith',
inputVO.firstName: 'fred',
inputVO.yearFiled: 2010,
}
yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
casenet_row = Selector(response).xpath('//tr[@align="left"]')

def parse_pages(self, response):
for row in casenet_row:
if "Part Name" not in row or "Address on File" not in row:
item = Challenge6Item()
item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item


However, I'm getting this error:




/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:
ScrapyDeprecationWarning: Module scrapy.contrib.spiders is
deprecated, use scrapy.spiders instead from scrapy.contrib.spiders
import Rule 2018-11-14 17:47:54 [scrapy.utils.log] INFO: Scrapy 1.5.1
started (bot: Challenge6) 2018-11-14 17:47:54 [scrapy.utils.log] INFO:
Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1,
w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec 4 2017,
14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14
Aug 2018), cryptography 2.3.1, Platform
Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14
17:47:54 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'Challenge6.spiders', 'SPIDER_MODULES':
['Challenge6.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14
17:47:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled item pipelines: 2018-11-14
17:47:55 [scrapy.core.engine] INFO: Spider opened 2018-11-14 17:47:55
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2018-11-14 17:47:55
[scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 1 times):
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 2
times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/robots.txt>
(failed 3 times): 2018-11-14 17:47:55
[scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading https://www.courts.mo.gov/robots.txt>:
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 1 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 2 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 3 times): 2018-11-14 17:47:56 [scrapy.core.scraper] ERROR: Error
downloading https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:56 [scrapy.core.engine] INFO: Closing
spider (finished) 2018-11-14 17:47:56 [scrapy.statscollectors] INFO:
Dumping Scrapy stats: {'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':
6, 'downloader/request_bytes': 1455, 'downloader/request_count': 6,
'downloader/request_method_count/GET': 6, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2018, 11, 14, 23, 47,
56, 195277), 'log_count/DEBUG': 7, 'log_count/ERROR': 2,
'log_count/INFO': 7, 'memusage/max': 52514816, 'memusage/startup':
52514816, 'retry/count': 4, 'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 11, 14, 23, 47, 55, 36009)}




What am I doing wrong?










share|improve this question
























  • The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
    – shanmuga
    Nov 15 at 12:39













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I would like to scraper this site: https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name



Here is my code:



import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
name = "casenet"
def start_requests(self):
start_urls = [
"https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse )

def parse(self, response):
data = {
inputVO.lastName: 'smith',
inputVO.firstName: 'fred',
inputVO.yearFiled: 2010,
}
yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
casenet_row = Selector(response).xpath('//tr[@align="left"]')

def parse_pages(self, response):
for row in casenet_row:
if "Part Name" not in row or "Address on File" not in row:
item = Challenge6Item()
item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item


However, I'm getting this error:




/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:
ScrapyDeprecationWarning: Module scrapy.contrib.spiders is
deprecated, use scrapy.spiders instead from scrapy.contrib.spiders
import Rule 2018-11-14 17:47:54 [scrapy.utils.log] INFO: Scrapy 1.5.1
started (bot: Challenge6) 2018-11-14 17:47:54 [scrapy.utils.log] INFO:
Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1,
w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec 4 2017,
14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14
Aug 2018), cryptography 2.3.1, Platform
Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14
17:47:54 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'Challenge6.spiders', 'SPIDER_MODULES':
['Challenge6.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14
17:47:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled item pipelines: 2018-11-14
17:47:55 [scrapy.core.engine] INFO: Spider opened 2018-11-14 17:47:55
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2018-11-14 17:47:55
[scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 1 times):
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 2
times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/robots.txt>
(failed 3 times): 2018-11-14 17:47:55
[scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading https://www.courts.mo.gov/robots.txt>:
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 1 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 2 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 3 times): 2018-11-14 17:47:56 [scrapy.core.scraper] ERROR: Error
downloading https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:56 [scrapy.core.engine] INFO: Closing
spider (finished) 2018-11-14 17:47:56 [scrapy.statscollectors] INFO:
Dumping Scrapy stats: {'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':
6, 'downloader/request_bytes': 1455, 'downloader/request_count': 6,
'downloader/request_method_count/GET': 6, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2018, 11, 14, 23, 47,
56, 195277), 'log_count/DEBUG': 7, 'log_count/ERROR': 2,
'log_count/INFO': 7, 'memusage/max': 52514816, 'memusage/startup':
52514816, 'retry/count': 4, 'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 11, 14, 23, 47, 55, 36009)}




What am I doing wrong?










share|improve this question















I would like to scraper this site: https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name



Here is my code:



import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
name = "casenet"
def start_requests(self):
start_urls = [
"https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse )

def parse(self, response):
data = {
inputVO.lastName: 'smith',
inputVO.firstName: 'fred',
inputVO.yearFiled: 2010,
}
yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
casenet_row = Selector(response).xpath('//tr[@align="left"]')

def parse_pages(self, response):
for row in casenet_row:
if "Part Name" not in row or "Address on File" not in row:
item = Challenge6Item()
item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item


However, I'm getting this error:




/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:
ScrapyDeprecationWarning: Module scrapy.contrib.spiders is
deprecated, use scrapy.spiders instead from scrapy.contrib.spiders
import Rule 2018-11-14 17:47:54 [scrapy.utils.log] INFO: Scrapy 1.5.1
started (bot: Challenge6) 2018-11-14 17:47:54 [scrapy.utils.log] INFO:
Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1,
w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12 (default, Dec 4 2017,
14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14
Aug 2018), cryptography 2.3.1, Platform
Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14
17:47:54 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'Challenge6.spiders', 'SPIDER_MODULES':
['Challenge6.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14
17:47:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55
[scrapy.middleware] INFO: Enabled item pipelines: 2018-11-14
17:47:55 [scrapy.core.engine] INFO: Spider opened 2018-11-14 17:47:55
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2018-11-14 17:47:55
[scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 1 times):
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (failed 2
times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/robots.txt>
(failed 3 times): 2018-11-14 17:47:55
[scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading https://www.courts.mo.gov/robots.txt>:
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 1 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 2 times): 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]
DEBUG: Gave up retrying https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
(failed 3 times): 2018-11-14 17:47:56 [scrapy.core.scraper] ERROR: Error
downloading https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py",
line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived:
2018-11-14 17:47:56 [scrapy.core.engine] INFO: Closing
spider (finished) 2018-11-14 17:47:56 [scrapy.statscollectors] INFO:
Dumping Scrapy stats: {'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':
6, 'downloader/request_bytes': 1455, 'downloader/request_count': 6,
'downloader/request_method_count/GET': 6, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2018, 11, 14, 23, 47,
56, 195277), 'log_count/DEBUG': 7, 'log_count/ERROR': 2,
'log_count/INFO': 7, 'memusage/max': 52514816, 'memusage/startup':
52514816, 'retry/count': 4, 'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 11, 14, 23, 47, 55, 36009)}




What am I doing wrong?







python-2.7 web-scraping scrapy scrapy-spider






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 at 10:21









Jesse de Bruijne

2,46861327




2,46861327










asked Nov 15 at 0:14









Jorden Whitbey

112




112












  • The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
    – shanmuga
    Nov 15 at 12:39


















  • The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
    – shanmuga
    Nov 15 at 12:39
















The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
– shanmuga
Nov 15 at 12:39




The error message says unable to retrieve ` courts.mo.gov/casenet/cases/nameSearch.do?searchType=name` can you check if the url exists and is reachable.
– shanmuga
Nov 15 at 12:39

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310673%2fhow-to-scrape-casenet-using-scrapy-formrequest%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310673%2fhow-to-scrape-casenet-using-scrapy-formrequest%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?