mercredi 3 décembre 2014

Help needed: Program or configure a web / XML crawler

Hey y'all,



I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.



Here is the site (may take a minute to load, if you are on a weak computer or connection):



http://ift.tt/1u2EIPk



This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):


Code:



<a data-link="xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt"></a>


And that links to full URL's such as


Code:



http://ift.tt/1FNTlgd


Those .txt files contain stuff like this:




Quote:








<?xml version="1.0" encoding="UTF-8"?>



<person>

<first_name><![CDATA[Ken]]></first_name>

<middle_name></middle_name>

<last_name><![CDATA[Gorski]]></last_name>

<title></title>

<degree><![CDATA[B Architecture Professional Degree, University of Kansas, 1972]]></degree>

<city><![CDATA[El Paso]]></city>

<state><![CDATA[TX]]></state>

<country><![CDATA[US]]></country>

<occupation_status>Degreed + Licensed</occupation_status>

<tech_biography><![CDATA[I'm a licensed architect and AIA member.]]></tech_biography>

<statement_911><![CDATA[I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.]]></statement_911>

<photo></photo>

<license_info><![CDATA[6477 TX]]></license_info>

</person>



Which I want to have translated into a simpler CSV/spreadsheet like format such as


Code:



url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info

xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX



(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)



Then the same for same sorts of linked profiles on


Code:



http://ift.tt/1osvHwE

http://ift.tt/1vLbW9E

http://ift.tt/1FNTlgf

...



I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.





Thanks!!





via International Skeptics Forum http://ift.tt/1FNTlgh

Aucun commentaire:

Enregistrer un commentaire