Skip to content

First pass at Indiana state scraping#20

Open
marks wants to merge 3 commits intodobtco:masterfrom
marks:master
Open

First pass at Indiana state scraping#20
marks wants to merge 3 commits intodobtco:masterfrom
marks:master

Conversation

@marks
Copy link

@marks marks commented Mar 4, 2014

This is my first pass and is loosely based on the IL scraper Javascript. There is more work to do but this is a start.

Scraper output:

mba62:openrfps-scrapers mark$ bin/openrfps run scrapers/in/rfps.js 
[ 'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63265&desc=Contract+for+Services+for+Invasive+Plant+Control&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-058&desc=Centralized+Production+and+Direct+Distribution+of+License+Plates+and+Registration+Documents&method=RFP&code=7',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF21&desc=Misc.+Furniture&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF20&desc=End+Tables&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF19&desc=Sofas&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63398&desc=Outright+Purchase+of+Airboat+and+Trailer+for+IDNR&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFI-14-73&desc=Electronic+Media+Destruction+and+Shredding+Services&method=RFI&code=K',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-079&desc=QPA+for+Snack+Products+for+Pen+Products&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-75&desc=Hard+Copy+Book+Collections&method=RFP&code=B',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFI-14-081&desc=Development,+Operation+and+Maintenance+of+an+Inn+and+Related+Facilities+at+Potato+Creek+State+Park&method=RFI&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-074&desc=Paint+and+Paint+Supplies&method=RFP&code=W',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-049&desc=Sustained+Statewide+Public+Relations+and+Marketing+Campaign&method=RFP&code=7',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-076&desc=Lab+Supplies&method=NEGOTIATED BID&code=V',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-089&desc=QPA+for+Cosmetic+Grade+Soap+Products+for+Pen+Products&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=570-14-24746&desc=Wireless+Paging+System&method=NEGOTIATED BID&code=8',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-83&desc=DNA+Sample+Collection+Services&method=RFP&code=K' ]
Done scraping!
Cached results to scrapers/in/rfps.json
[
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63265&desc=Contract+for+Services+for+Invasive+Plant+Control&method=NEGOTIATED BID&code=T",
    "id": "300-14-63265",
    "type": "NEGOTIATED BID",
    "title": "Contract for Services for Invasive Plant Control",
    "responses_open_at": "3/4/2014",
    "contact_name": "Deaton, Teresa"
  },
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-058&desc=Centralized+Production+and+Direct+Distribution+of+License+Plates+and+Registration+Documents&method=RFP&code=7",
    "id": "RFP-14-058",
    "type": "RFP",
    "title": "Centralized Production and Direct Distribution of License Plates and Registration Documents",
    "responses_open_at": "3/4/2014",
    "contact_name": "Thiemann, Adam"
  },
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF21&desc=Misc.+Furniture&method=NOTICE TO BIDDERS&code=P",
    "id": "RFQ_ISF21",
    "type": "NOTICE TO BIDDERS",
    "title": "Misc. Furniture",
    "responses_open_at": "3/5/2014",
    "contact_name": "Archer, Mary Beth"
  }
]
13 RFPs not printed for length considerations

Test output:

mba62:openrfps-scrapers mark$ bin/openrfps test scrapers/in/rfps.js 
The scraper returns at least one result: OK
item.id is returned for all items: OK
item.type is valid for all items: Not OK
item.contact_email is a proper address (or blank): OK
download URLs are valid (or blank): OK
item.id is unique for each item: OK
item.title is returned for all items: OK
NIGP codes are digits: OK

@ajb
Copy link
Contributor

ajb commented Mar 4, 2014

Hey Mark, looks awesome. I'll just leave this open for you to add to?

@ajb ajb added the wip label Mar 4, 2014
@marks
Copy link
Author

marks commented Mar 4, 2014

@adamjacobbecker - thanks. In the spirit of having others add to it (it's at a place where it is useful but could be more useful), I'd prefer you merge it so others know it's there and they can add to it before I get to it. Thoughts on that approach?

@ajb
Copy link
Contributor

ajb commented Mar 4, 2014

Not sure if we have a good workflow for that right now.

I might suggest updated the wiki page to remove the link to Indiana from the first list, and add a link to this PR in the "In Progress" section. That make sense?

@marks
Copy link
Author

marks commented Mar 4, 2014

OK - I removed it from the first list (I read too fast and thought that was a complete list) and it is in the in-progress list. I see IL is in the in progress list but in the master branch which makes it a little confusing to a newcomer.

If this were my project, I'd put anything in progress (and working, of course) in the master branch. NBD either way though. Hope to have time to add additional data but working with their HTML gave me enough fun for one night ;)

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants