CodingChallenge

Insight Data Engineering Coding Challenge

This is a simple, single python 2.7 program approach to the coding challenge. Since this was a simple, one off project, I chose to hardcode the filepaths of the input and output, so the file structure was clearly described in the challenge. This means that run.sh only runs the python script, and doesn't pass it anything. I chose to solve this problem as it was presented, as opposed to how it might actually be impletmented in the real-world, with getting a data stream from an API. Because we were provided a text file with all of the tweets in it, it made more sense to do most of the preprocessing all at once, rather than go line by line. Most of the processing of the JSON tweets was fairly simple, as python has built in libraries to handle JSON and datetime stamps. Also, encoding the strings as ascii removed the majority of the escape characters. The only dependencies this program has are the libraries json and datetime, which to my knowledge come with every python distribution.

As for the graph of the hashtags, that was a little more challenging and fun. Instead of using or building a more standard graph class, I realized that there might be a simpler approach because the only desired information from the graph is its nodes average connectedness. I thought this would be a great job for a dictionary, with the keys as nodes and values as a set of edges. The rolling one minute average of degree proved to be the most challenging part. I solved this by making the values tuples, with one part being the hashtag (edge) and the other being the timestamp. If a new tweet came in with the same hashtag, the value was updated with the newer timestamp. Pruning edges that were too old, older than one minute, was fairly simple by spanning the graph in the form of iterating over the dictionary and comparing the timestamp. Edges that were more than a minute old were removed, and any node that contained no edges was also removed. Calculating the average degree of the graph was also fairly simple, just needing to interate over the dictionary and sum the length of the values (edges) then divide that by the length of the graph (number of nodes). Then it's a simple matter of doing this as each new tweet comes in and writing the average degree to the output file.

I believe that the data structure that I have created is rather elegant and easily scalable, but that it's implementation could be further refined. The insert_node() function works well and is quick. The remove_node() function is not as clean. I likely would have written it differently if I were working with a real data stream. In that case there would be easier ways to handle removing a node that are faster, and more scalable. I thought about trying to implement something closer to a real-world approach here, but the created_at timestamp of a tweet only has a one second resolution, so that means large numbers of tweets need to be removed at once. I thought about using the actual millisecond timestamp at the end of the JSON, but based on the wording of the challenge I thought a single average degree was wanted per text in at the one second level was wanted, so using the milliseconds would produce the wrong output for this.

Overall I really enjoyed this challenge, and had fun coming up with a pseudo graph made from a dictionary of strings, sets, and tuples.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
tweets_input		tweets_input
tweets_output		tweets_output
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodingChallenge

About

Uh oh!

Releases

Packages

Languages

gfosmire/CodingChallenge

Folders and files

Latest commit

History

Repository files navigation

CodingChallenge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages