No doubt, Git is a best thing you can use for managing your code. But sometimes you really put yourself in such a worse situation that you start cursing yourself for using Git. I am working on a fairly large project and been working in teams, I had to switch to another branch because the code base I had was greatly differing from the one on git but I had to work with that source because of some reasons. Being base code not being same as Git’s master, I made a commit and started working on that. I cannot simply merge because there were total 20k+ files with code and almost impractical to go through most of them to resolve the conflicts. I thought of then applying patch for all the later commits. It started giving me lot’s of conflicts too. The files I worked on were not touched by others so simple thing was to take all the files that I edited and to use them. I am no git expert and quite lazy to find the way to do so. There can be several nice way to deal with this situation using git only but as I said, I am no git expert. And that is not this post is about.
So I created a simple php script that can read and identify the files that were affected in that commits. I thought to share the code, so this is the post.
As I explained already, I had created serial patch I can read from it the affected files. Being quite short on time, wrote a quick code, which may be improved in efficiency and accuracy.
Here is the script I wrote:
<!--?<span class="hiddenSpellError" pre=""-->php $result=''; for($i=1;$i<89;$i++) { $fname=glob('patch/'.substr("0000".$i,-4)."*"); $entry=$fname[0]; $content = explode('diff --',file_get_contents($entry)); $content = $content[0]; $re="/([a-zA-Z0-9_]+)(\.php|\.js|\.css)/"; preg_match_all($re,$content,$out, PREG_PATTERN_ORDER); $out[0]=array_unique($out[0]); foreach($out[0] as $val){ $result[$val]=true; } } foreach($result as $key=>$val){ echo $key."<br />"; } ?>
Explanation
Simple thing to do was to read each patch and look for a filename. I used three extensions to look: .php , .js, .css . I put the folder in a directory and put patches in the subdir patch/
Format for patch file name was a serial number starting from 0001 up to the number of files. I simply generated name myself using glob(). I am sure you will argue why didn’t I simply read the directory. Actually, my initial code cause script to exceed the timeout and I was not sure if it was the number of file or something else. So I thought of controlling the number of files. So was the code like this.
Reason for so long execution time was preg_match_all on a long patch files (couple of them were 20 MB! ). We don’t need to search for the files in changes, only portion containing the list of files was important so I simply search for file names in that portion by exploding contents to diff — and searching in first array value.
To maintain the unique list, I used file name as index so unnecessary memory space does not get wasted. I believe rest of the part makes sense without any explanation.
Above code was enough and efficient in my case. Let me know what you thing about this.