May 27, 2012

How to export a blog from blogspot.com and to use the output...

First you have to login on http://www.blogger.com and use preferences (aka Einstellungen):

There you have to use etcetera (aka Sonstiges):

There you find a link "export blog" (aka Blog exportieren):

After the following dialog you get one big xml-file:

I got a file named blog-05-26-2012.xml. This file contains everything of you blog:
  • Layout
  • Users
  • Configuration
  • All postings (incl. comments, date, labels, ...)
  • Locales
  • Meta description
  • Timezone, timestamp format
  • ...
The problem is: How to extract the postings out of this file?

First extract only the lines with the xml-tag "entry":
 grep "<entry>" blog-05-27-2012.xml > blog.entry.xml
Then put every entry in a new line:
sed 's/<entry>/\n<entry>/g' blog.entry.xml > blog.newline.xml
Now you have some line wiht configuration details. You can remove them with this command:
 grep -v  "<email>noreply@blogger.com</email>" blog.newline.xml  |grep "<author>" > blog.posts.xml
Now this XML contains a lot of tags:
  • id
  • author
  • title
  • content
  • link
  • published
  • updated
  • uri
  • email
  • category
  • name
and some more...

If you want to get a file with one line per post like "date**title**content" you can use the following command:
cat blog.posts.xml | sed $'s/<title type=\'text\'>/gruzelwurbel/g'|sed $'s/<\/title><content type=\'html\'>/gruzelwurbel/g'|sed 's/<\/content>/gruzelwurbel/g'|sed 's/<published>/gruzelwurbel/g'|sed 's/<\/published>/gruzelwurbel/g'| awk -F gruzelwurbel '{printf("%s**%s**%s\n",$2,$4,$5)}'
-> 2008-01-01T00:00:00.000-08:00**Gästebuch**Um einen Kommentar im Gästebuch zu hinterlassen bitte "Kommentar veröffentlichen" anklicken.

Html is escaped with &lt; and &gt;. To reformat this the following to sed commands can be used:
cat file | sed 's/&lt;/</g' | sed 's/&gt;/>/g' > newfile

No comments:

Post a Comment