Run like Hell: How to export a blog from blogspot.com and to use the output...

May 27, 2012

How to export a blog from blogspot.com and to use the output...

First you have to login on http://www.blogger.com and use preferences (aka Einstellungen):

There you have to use etcetera (aka Sonstiges):

There you find a link "export blog" (aka Blog exportieren):

After the following dialog you get one big xml-file:

I got a file named blog-05-26-2012.xml. This file contains everything of you blog:

Layout
Users
Configuration
All postings (incl. comments, date, labels, ...)
Locales
Meta description
Timezone, timestamp format
...

The problem is: How to extract the postings out of this file?

First extract only the lines with the xml-tag "entry":

grep "<entry>" blog-05-27-2012.xml > blog.entry.xml

Then put every entry in a new line:

sed 's/<entry>/\n<entry>/g' blog.entry.xml > blog.newline.xml

Now you have some line wiht configuration details. You can remove them with this command:

grep -v "<email>noreply@blogger.com</email>" blog.newline.xml |grep "<author>" > blog.posts.xml

Now this XML contains a lot of tags:

id
author
title
content
link
published
updated
uri
email
category
name

and some more...

If you want to get a file with one line per post like "date**title**content" you can use the following command:

cat blog.posts.xml | sed $'s/<title type=\'text\'>/gruzelwurbel/g'|sed $'s/<\/title><content type=\'html\'>/gruzelwurbel/g'|sed 's/<\/content>/gruzelwurbel/g'|sed 's/<published>/gruzelwurbel/g'|sed 's/<\/published>/gruzelwurbel/g'| awk -F gruzelwurbel '{printf("%s**%s**%s\n",$2,$4,$5)}'

-> 2008-01-01T00:00:00.000-08:00**Gästebuch**Um einen Kommentar im Gästebuch zu hinterlassen bitte "Kommentar veröffentlichen" anklicken.

Html is escaped with < and >. To reformat this the following to sed commands can be used:

cat file | sed 's/</</g' | sed 's/>/>/g' > newfile

Run like Hell

May 27, 2012

How to export a blog from blogspot.com and to use the output...

No comments:

Post a Comment