From Blogsphere to a Static Site (Part 2) - Cleaning up the HTML

Blogsphere allows to create RichText and plain HTML entries. To export them I need to grab the HTML, either the manual entered or the RichText generated on, clean it up (especially for my manual entered HTML) and then replace image sources and internal links using the new URL syntax. To make this happen I created 2 functions that saved images and attachments and created a lookup list, so the HTML cleanup has a mapping table to work with


private void saveImage(Document doc) {
        String sourceDirectory = this.config.sourceDirectory + this.config.imageDirectory;
        try {
            String subject = doc.getItemValueString("ImageName");
            Date created = doc.getCreated().toJavaDate();
            @SuppressWarnings("rawtypes")
            Vector attNames = this.s.evaluate("@AttachmentNames", doc);
            String description = doc.getItemValueString("ImageName");
            String oldURL = this.config.oldImageLocation + doc.getItemValueString("ImageUNID") + "/$File/";
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy");
            String year = sdf.format(created);
            FileEntry fe = this.imgEntries.add(subject, oldURL, description, created);

            for (Object attObj : attNames) {
                try {
                    String attName = attObj.toString();
                    String newURL = this.config.webBlogLocation + this.config.imageDirectory + year + "/" + attName;
                    fe.add(attName, newURL, description, created);
                    String outDir = sourceDirectory + year + "/";
                    this.ensureDirectory(outDir);
                    EmbeddedObject att = doc.getAttachment(attName);
                    att.extractFile(outDir + attName);
                    Utils.shred(att);
                } catch (NotesException e) {
                    e.printStackTrace();
                } catch (Exception e2) {
                    e2.printStackTrace();
                }
            }

        } catch (NotesException e) {
            e.printStackTrace();
        }

    }

    private void saveImageFromURL(String href, String targetName) {

        String fetchFromWhere = "https://" + this.config.bloghost + href;
        try {
            byte[] curImg = Request.Get(fetchFromWhere).execute().returnContent().asBytes();
            this.saveIfChanged(curImg, targetName);
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

With images saved the HTML cleanup can proceed. As mentioned before I'm using JSoup to process crappy HTML. It allows for easy extraction of elements and attributes, so processing of links an images is just a few lines


    private String cleanupHTMLlinksAndImages(String source, String location) {

        org.jsoup.nodes.Document hDoc = Jsoup.parse(source);
        this.cleanupTagUrlAttribute(hDoc, "img", "src", location);
        this.cleanupTagUrlAttribute(hDoc, "a", "href", location);
        return hDoc.body().html();

    }

    private void cleanupTagUrlAttribute(org.jsoup.nodes.Document hDoc, String elementName, String attName, String location) {

        String query = elementName + "[" + attName + "]";
        Elements elements = hDoc.select(query);

        for (Element element : elements) {
            String attValue = element.attr(attName).trim();
            if (this.mapperOldNewURLs.containsKey(attValue.toLowerCase())) {
                String replace = this.mapperOldNewURLs.get(attValue.toLowerCase());
                System.out.print("Replacing:");
                System.out.print(attValue);
                System.out.print(" with ");
                System.out.println(replace);
                element.attr(attName, replace);
            }
        }
    }

The returned HTML not only has the attribute values cleaned up, but also produces valid clean HTML.
Next stop: rendering output. Stay tuned

Posted by Stephan H Wissel on 17 April 2017 | Comments (0) | categories: Blog

wissel.net

Blog Migration

From Blogsphere to a Static Site (Part 2) - Cleaning up the HTML

Comments

No comments yet, be the first to comment