Collect, Enrich, & Store News Data

  1. Create a list of Publisher Real Simple Syndication (RSS) feeds to aggregate news which includes
  • publisher
  • article title
  • short summary

2. Developed with the company, Python scripts to retrieve feed items daily and enhance them with additional metadata.

3. Learn and use Java-based Stanford Natural Language Processing (NLP) tools and server to extract interesting entities from the title and short summary.

Stanford CoreNLP (“part of speech” classifier) splits sentences into tokens and performs Named Entity Recognition (NER) against the token stream (PERSON/PLACE/etc.)
A Part-Of-Speech Tagger (POS Tagger), like Stanford CoreNLP, is a piece of software that reads the text in some language and assigns parts of speech to each word.

For example:
Timothy PERSON
Geithner PERSON

4. Index/verify enhanced documents with the Apache Solr search platform. Python script runs and submits enhanced documents to Solr API for subsequent searching.

An example query could look like this:

"publisher_name":"FOX News",
"neutrality_article_thumbnail_url: ["{delim}"],
"A woman was attacked by a shark in the Maldives while free diving, and it was all caught on camera. The video of the bite has resurfaced on the internet.",
"neutrality_article_title":"Video of shark attacking snorkeler in Maldives resurfaces; tourist was left with 6-inch wound",
"neutrality_article_author":"Ryan Morik",
"RSSmap":"General News",

This is where we would verify in Solr that the query looks good to use for training.






Leave a Reply

Your email address will not be published. Required fields are marked *