Hey y'all,
I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.
Here is the site (may take a minute to load, if you are on a weak computer or connection):
http://ift.tt/1u2EIPk
This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):
And that links to full URL's such as
Those .txt files contain stuff like this:
Which I want to have translated into a simpler CSV/spreadsheet like format such as
(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)
Then the same for same sorts of linked profiles on
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.
Thanks!!
I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.
Here is the site (may take a minute to load, if you are on a weak computer or connection):
http://ift.tt/1u2EIPk
This page has >2,300 links to personal profiles; each link has a local href like this (I am snipping out attributes):
Code:
<a data-link="xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt"></a>
And that links to full URL's such as
Code:
http://ift.tt/1FNTlgd
Those .txt files contain stuff like this:
Quote:
<?xml version="1.0" encoding="UTF-8"?> <person> <first_name><![CDATA[Ken]]></first_name> <middle_name></middle_name> <last_name><![CDATA[Gorski]]></last_name> <title></title> <degree><![CDATA[B Architecture Professional Degree, University of Kansas, 1972]]></degree> <city><![CDATA[El Paso]]></city> <state><![CDATA[TX]]></state> <country><![CDATA[US]]></country> <occupation_status>Degreed + Licensed</occupation_status> <tech_biography><![CDATA[I'm a licensed architect and AIA member.]]></tech_biography> <statement_911><![CDATA[I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.]]></statement_911> <photo></photo> <license_info><![CDATA[6477 TX]]></license_info> </person> |
Which I want to have translated into a simpler CSV/spreadsheet like format such as
Code:
url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX
(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)
Then the same for same sorts of linked profiles on
Code:
http://ift.tt/1osvHwE
http://ift.tt/1vLbW9E
http://ift.tt/1FNTlgf
...
I would like to have a little tool that I can run on these URLs whenever I need to update my database: Input is the page with all the names and links, output a list with all profiles. Either something you program from scratch, or perhaps you can recommend a freeware tool that does just that sort of thing and can be configured by a half-witted fellow like myself.
Thanks!!
via International Skeptics Forum http://ift.tt/1FNTlgh
Aucun commentaire:
Enregistrer un commentaire